You may have heard that Data Engineering is moving towards cloud data warehouses which are cheaper from evaluation and storage perspectives. In this article, we will explore famous modern data warehouse called Snowflake that not only provides cheap resources but also comes with unbelievably advanced features to make the analytics world more powerful and efficient.
Snowflake is a SAAS solution which means Software as a Service that provides storage, computing, and analytics under one umbrella. SAAS also means that you don’t have to install any special software, maintain or upgrade any server or take backups for disaster recovery. All these come as an integrated part of Snowflake.
Where does Snowflake run?
Underneath, Snowflake is provided on general cloud networks such as AWS, Google Cloud, and Microsoft Azure. That being said, it means it can’t be managed on-premises or privately managed clouds. While creating the snowflake account, you can choose among the cloud-supplier for snowflake hosting as shown in the lower graphic.
Database Storage Layer
Snowflake uses an underlying cloud provider for a depot. For example, if you choose AWS, it will use AWS S3 storing data while storing the data, and it will also optimize the data in a more developed & compressed setup. To create it faster for analytics inquiries, it will also use a circular format for keeping the data. All of the central data and optimization process is abstracted from the end user. As a customer, you will not be able to direct access the data, you will always have to manage the data through SQL inquiry in the Snowflake Editor.
Query Processing Layer
Processing in snowflake is done through SQL. You can build virtual warehouses according to your use case. For example, if your queries are easy, you can go for a little virtual warehouse, if your queries are complex and long-running, then you can go for a medium or large warehouse. You can choose your warehouse size according to your analytics needs and this approach saves a lot of money. Each Virtual warehouse has its computing power and they do not allocate resources.
Snowflake does not just provide storage and computing, but it also provides tons of extra services such as optimization, metadata management, management control, and much more. In simple words, it could be a one-stop solution for all your analytics use cases.
Snowflake Unique Features
In Snowflake, you can recover deleted or changed data within a defined period using time travel. Snowflake standard and enterprise edition provide 1 day of data recovery by default. However, you can also set it to up to 90 days in the enterprise edition.
Let’s say you modified something in above table but later you released that changes were not appropriate and you want to go back in time. For example, you want to see what data was 5 minutes back, below query could be used
Even after the time-travel period is finished, data does not get deleted immediately. It moves to the Fail-Safe zone. However, no user operations are allowed in the Fail-Safe zone and only snowflake can recover the data from that zone on request. In the Fail-Safe zone, data is saved for 7 more days by Snowflake. Moreover, data recovery from Fail-Safe is not immediate, it may take days to recover the data.
Zero-copy cloning is another unique feature of snowflake using which you can create clones of databases, schema, and tables without downtime and in no extra time. Cloning in Snowflake means cloning the metadata. Internally, all data is saved in micro-partitions. Unless and until you will not change the content of cloned database/table, they keep on referring to the same underlying micro-partitions. Even when data is updated, only the relevant micro-partitions are updated. You can also create clones from time-travel
Below query clone customer table as it existed exactly at the date and time of the specified timestamp:
Snowflake is a modern data warehouse with so many advanced features that are required for today’s data engineers but there are a few things that you should never forget to use this data warehouse more efficiently
Time-Travel & Fail-Safe
Data that is saved in time-travel and fail-safe zone both will take storage and will add to your monthly expenses, so if some data is not critical and is recoverable, set its retention time to 0 and turn off the fail-safe for it.
Whenever you create a virtual warehouse in snowflake, never forget to turn on the auto-suspend feature since it can save a lot of money when you are not running any query on Snowflake development.
Enabling auto-resume will make sure that whenever use is querying against any warehouse, they get started if they are auto-suspended.
Set Resource Monitors
Using Resource monitors, you can send notifications and alerts in case your usage exceeds a certain threshold. Always make sure to set appropriate resource monitors to guard your expenses. The below figure depicts how we can set resource monitor on a specific warehouse.
Set Workloads Timeout
Snowflake maintains warm caches for frequent access by other users. But these caches take storage and are not useful for all kinds of workloads. For example, if there is some data loading task, you can set a timeout for the warm cache as immediately as those caches are not useful. Similarly, if there are some select queries, you can maintain the warm cache for a longer period since those could be reused by other users.