US-12625881-B2 - Caching systems and methods

US12625881B2US 12625881 B2US12625881 B2US 12625881B2US-12625881-B2

Abstract

Example caching systems and methods are described. In one implementation, a method receives a set of queries to be processed by a set of virtual warehouses. The method distributes the set of queries to the set of virtual warehouses to be executed and creates, during the processing of the set of queries by the set of virtual warehouses, a new virtual warehouse, wherein cache resources associated with the new virtual warehouse are populated with data files associated with the set of queries at the time the virtual warehouse is created and the cache resources vary among the processors, wherein a first subset of the processors comprises minimal cache resources and a second subset of processors comprises cache resources providing faster input-output operations. The method redistributes the set of queries across the set of virtual warehouses.

Inventors

Thierry Cruanes
Benoit Dageville
Marcin Zukowski

Assignees

SNOWFLAKE INC.

Dates

Publication Date: 20260512
Application Date: 20240226

Claims (20)

1 . A method comprising: receiving a set of queries to be processed by a set of virtual warehouses, wherein each virtual warehouse of the set of virtual warehouses includes a first set of processors and cache resources corresponding to the first set of processors; distributing, by a second set of processors, the set of queries to the set of virtual warehouses to be executed by the first set of processors; creating, during the processing of the set of queries by the set of virtual warehouses, a new virtual warehouse; determining, based on metadata, a file on a remote storage device, the metadata indicating information about data organization on the remote storage device; retrieving the file from the remote storage device; storing the file in cache resources associated with the new virtual warehouse, wherein: the cache resources associated with the new virtual warehouse are populated with data files associated with the set of queries at a time the virtual warehouse is created to restore a metadata state associated with the set of queries based on a set of statistics, wherein the metadata state associated with the set of queries is based on a previous state of a previously terminated virtual warehouse from the set of virtual warehouses and includes the file stored in the cache resources and information related to a last time the file was accessed; and the cache resources vary among the first set of processors, wherein a first subset of the first set of processors comprises minimal cache resources and a second subset of the first set of processors comprises cache resources providing faster input-output operations; and redistributing, as a result of the creating, the set of queries across the set of virtual warehouses.
2 . The method of claim 1 , wherein the set of queries are associated with a database comprising a set of database tables.
3 . The method of claim 2 , wherein at least one of the set of database tables is encrypted and is subsequently decrypted before executing the set of queries.
4 . The method of claim 2 , wherein at least one of the set of database tables is compressed and is subsequently decompressed before executing the set of queries.
5 . The method of claim 2 , wherein: each of the second set of processors processes one of the set of database tables; and data from the one of the set of database tables is stored in a cache associated with that processor.
6 . The method of claim 1 , wherein the set of queries are received from a set of clients, further comprising generating a set of results from the execution of the set of queries, and returning the set of results to the set of clients.
7 . The method of claim 1 , further comprising optimizing the set of queries.
8 . The method of claim 2 , wherein the database is a relational database.
9 . The method of claim 8 , wherein the relational database is a structured query language database.
10 . The method of claim 2 , wherein the database is a multi-tenant database that isolates computing resources and data between different customers.
11 . The method of claim 2 , wherein the database is external to a system that includes the first set of processors.
12 . The method of claim 1 , wherein the set of statistics comprises the metadata related to one or more databases.
13 . The method of claim 1 , wherein the set of statistics is automatically accumulated.
14 . The method of claim 1 , wherein the set of statistics is automatically updated.
15 . A system comprising: a memory; and a processing device operatively coupled to the memory, the processing device to: receive a set of queries to be processed by a set of virtual warehouses, wherein each virtual warehouse of the set of virtual warehouses includes a first set of processors and cache resources corresponding to the first set of processors; distribute, by a second set of processors, the set of queries to the set of virtual warehouses to be executed by the first set of processors; create, during the processing of the set of queries by the set of virtual warehouses, a new virtual warehouse; determine, based on metadata, a file on a remote storage device, the metadata indicating information about data organization on the remote storage device; retrieve the file from the remote storage device; store the file in cache resources associated with the new virtual warehouse, wherein: the cache resources associated with the new virtual warehouse are populated with data files associated with the set of queries at a time the virtual warehouse is created to restore a metadata state associated with the set of queries based on a set of statistics, wherein the metadata state associated with the set of queries is based on a previous state of a previously terminated virtual warehouse from the set of virtual warehouses and includes the file stored in the cache resources and information related to a last time the file was accessed; and the cache resources vary among the first set of processors, wherein a first subset of the first set of processors comprises minimal cache resources and a second subset of the first set of processors comprises cache resources providing faster input-output operations; and redistribute, as a result of the creating, the set of queries across the set of virtual warehouses.
16 . The system of claim 15 , wherein the cache resources associated with the new virtual warehouse include a memory device and a disk storage device.
17 . A non-transitory computer-readable medium having instructions stored thereon that, when executed by a processing device, cause the processing device to: receive a set of queries to be processed by a set of virtual warehouses, wherein each virtual warehouse of the set of virtual warehouses includes a first set of processors and cache resources corresponding to the first set of processors; distribute, by a second set of processors, the set of queries to the set of virtual warehouses to be executed by the first set of processors; create, during the processing of the set of queries by the set of virtual warehouses, a new virtual warehouse; determine, based on metadata, a file on a remote storage device, the metadata indicating information about data organization on the remote storage device; retrieve the file from the remote storage device; store the file in cache resources associated with the new virtual warehouse, wherein: the cache resources associated with the new virtual warehouse are populated with data files associated with the set of queries at a time the virtual warehouse is created to restore a metadata state associated with the set of queries based on a set of statistics, wherein the metadata state associated with the set of queries is based on a previous state of a previously terminated virtual warehouse from the set of virtual warehouses and includes the file stored in the cache resources and information related to a last time the file was accessed; and the cache resources vary among the first set of processors, wherein a first subset of the first set of processors comprises minimal cache resources and a second subset of the first set of processors comprises cache resources providing faster input-output operations; and redistribute, as a result of the creating, the set of queries across the set of virtual warehouses.
18 . The non-transitory computer-readable medium of claim 17 , wherein the set of queries comprises a set of database tables.
19 . The non-transitory computer-readable medium of claim 18 , wherein at least one of the set of database tables is encrypted and is subsequently decrypted before executing the set of queries.
20 . The non-transitory computer-readable medium of claim 18 , wherein at least one of the set of database tables is compressed and is subsequently decompressed before the executing of the set of queries.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of co-pending U.S. patent application Ser. No. 18/202,502, filed May 26, 2023, which is a continuation of U.S. patent application Ser. No. 16/805,638, filed Feb. 28, 2020, which is a continuation of U.S. patent application Ser. No. 14/518,971, filed Oct. 20, 2014, which claims the benefit under 35 U.S.C. § 119(e) of U.S. Provisional Patent Application No. 61/941,986, filed Feb. 19, 2014, and these applications are hereby incorporated by reference herein in their entirety. TECHNICAL FIELD The present disclosure relates to resource management systems and methods that manage the caching of data. BACKGROUND Many existing data storage and retrieval systems are available today. For example, in a shared-disk system, all data is stored on a shared storage device that is accessible from all of the processing nodes in a data cluster. In this type of system, all data changes are written to the shared storage device to ensure that all processing nodes in the data cluster access a consistent version of the data. As the number of processing nodes increases in a shared-disk system, the shared storage device (and the communication links between the processing nodes and the shared storage device) becomes a bottleneck that slows data read and data write operations. This bottleneck is further aggravated with the addition of more processing nodes. Thus, existing shared-disk systems have limited scalability due to this bottleneck problem. Another existing data storage and retrieval system is referred to as a “shared-nothing architecture.” In this architecture, data is distributed across multiple processing nodes such that each node stores a subset of the data in the entire database. When a new processing node is added or removed, the shared-nothing architecture must rearrange data across the multiple processing nodes. This rearrangement of data can be time-consuming and disruptive to data read and write operations executed during the data rearrangement. And, the affinity of data to a particular node can create “hot spots” on the data cluster for popular data. Further, since each processing node also performs the storage function, this architecture requires at least one processing node to store data. Thus, the shared-nothing architecture fails to store data if all processing nodes are removed. Additionally, management of data in a shared-nothing architecture is complex due to the distribution of data across many different processing nodes. The systems and methods described herein provide an improved approach to data storage and data retrieval that alleviates the above-identified limitations of existing systems. BRIEF DESCRIPTION OF THE DRAWINGS Non-limiting and non-exhaustive embodiments of the present disclosure are described with reference to the following figures, wherein like reference numerals refer to like parts throughout the various figures unless otherwise specified. FIG. 1 is a block diagram depicting an example embodiment of the systems and methods described herein. FIG. 2 is a block diagram depicting an embodiment of a resource manager. FIG. 3 is a block diagram depicting an embodiment of an execution platform. FIG. 4 is a block diagram depicting an example operating environment with multiple users accessing multiple databases through multiple virtual warehouses. FIG. 5 is a block diagram depicting another example operating environment with multiple users accessing multiple databases through a load balancer and multiple virtual warehouses contained in a virtual warehouse group. FIG. 6 is a block diagram depicting another example operating environment having multiple distributed virtual warehouses and virtual warehouse groups. FIG. 7 is a flow diagram depicting an embodiment of a method for managing data storage and retrieval operations. FIG. 8 is a flow diagram depicting an embodiment of a method for managing a data cache. FIG. 9 is a block diagram depicting an example computing device. DETAILED DESCRIPTION The systems and methods described herein provide a new platform for storing and retrieving data without the problems faced by existing systems. For example, this new platform supports the addition of new nodes without the need for rearranging data files as required by the shared-nothing architecture. Additionally, nodes can be added to the platform without creating bottlenecks that are common in the shared-disk system. This new platform is always available for data read and data write operations, even when some of the nodes are offline for maintenance or have suffered a failure. The described platform separates the data storage resources from the computing resources so that data can be stored without requiring the use of dedicated computing resources. This is an improvement over the shared-nothing architecture, which fails to store data if all computing resources are removed. Therefore, the new platform continues to store data even though t