US-12619606-B2 - System and methods for processing query command within data warehouse architecture

US12619606B2US 12619606 B2US12619606 B2US 12619606B2US-12619606-B2

Abstract

Embodiments of the disclosure describe a system and method for processing Structured Query Language (SQL) query within data warehouse architecture. The method includes receiving, by nodes associated with an engine layer, a SQL query from a client device, the engine layer indicates a component of the data warehouse architecture. Further, the method includes receiving a topology from a storage layer in response to receiving the SQL query, the topology indicates an arrangement of the stored data among partitions associated with the storage layer. Further, the method includes determining an execution plan tree, the execution plan tree indicates operations to be executed by the engine layer and the storage layer corresponding to the SQL query. The method includes distributing, a fragment of the execution plan tree to the storage layer based on the topology, for processing the operations.

Inventors

Kartik Kulkarni
Darshan Nagaraj

Assignees

AkashX Inc.

Dates

Publication Date: 20260505
Application Date: 20240628

Claims (12)

1 . A method for processing a Structured Query Language (SQL) query within a data warehouse architecture, the method comprising: receiving, by one or more nodes associated with an engine layer, a SQL query from a client device, wherein the engine layer indicates component of the data warehouse architecture configured for receiving, interpreting, optimizing, and executing the SQL query against stored data; receiving, by the one or more nodes associated with the engine layer, a topology from a storage layer in response to receiving the SQL query, wherein the topology indicates arrangement of the stored data among one or more partitions associated with the storage layer, and wherein the storage layer indicates component of the data warehouse architecture for storing and managing data; determining, by the one or more nodes associated with the engine layer, an execution plan tree based on correlating the SQL query, a received data volume value, and pre-defined capabilities of the storage layer, wherein determining the execution plan tree comprises: optimizing, by the engine layer, the received SQL query to determine the received data volume value indicating an amount of data volume that the storage layer sends to the engine layer for processing the received SQL query; correlating the SQL query, the received data volume value, and the pre-defined capabilities of the storage layer; and assigning, one or more operations for processing the received SQL query to the engine layer and the storage layer, based on correlating the SQL query, the received data volume value, and the pre-defined capabilities of the storage layer, thereby determining the execution plan tree, and wherein the execution plan tree indicates the one or more operations to be executed by the engine layer and the storage layer corresponding to the received SQL query; distributing, by the one or more nodes associated with the engine layer, a fragment of the execution plan tree to the storage layer based on the topology, for processing the one or more operations included in the execution plan tree such that the received or fetched data volume value and a computational load value of the engine layer is reduced, wherein distributing the fragment of the execution plan tree to the storage layer based on the topology comprises sending the fragment of the execution plan tree to the storage layer based on: identification of one or more execution nodes in the execution plan tree assigned to the storage layer, and the topology, wherein each of the one or more execution nodes include metadata thereby required for processing the received SQL query; processing, by the storage layer, the one or more operations corresponding to the fragment of the execution plan tree; sending, by the storage layer, processed results to the engine layer thereby reducing the received data volume value and the computational load value of the engine layer; and generating, by the engine layer, a compiled result-set by combining the processed results received from the storage layer with computed results generated within the engine layer, thereby passing the compiled result-set to the client device.
2 . The method of claim 1 , wherein the one or more nodes receiving the SQL query is a query coordinator and the one or more nodes determining the execution plan tree is a query optimizer.
3 . The method of claim 1 , wherein determining the execution plan tree further comprises: parsing, by the engine layer, the received SQL query, indicating determining syntactic structure and semantics corresponding to the received SQL query; and receiving, by the engine layer, the pre-defined capabilities indicating computation characteristics of the storage layer.
4 . The method of claim 1 , wherein distributing the fragment of the execution plan tree to the storage layer based on the topology further comprises: identifying the one or more execution nodes in the execution plan tree assigned to the storage layer, wherein the metadata indicates operation codes or functions for each of the one or more execution nodes, catalog content, list of objects, filters, predicates, projections, and data types, wherein the fragment of the execution plan tree is sent to the storage layer based on the identification and the topology such that the one or more nodes associated with the engine layer send the fragment to a corresponding one or more partitions associated with the storage layer, wherein the fragment of the execution plan tree includes at least one of: an individual execution node or a plurality of execution nodes.
5 . The method of claim 1 , wherein the storage layer of the data warehouse architecture is disaggregated and decoupled from the engine layer.
6 . A method of processing a Structured Query Language (SQL) query within a data warehouse architecture, the method comprising: receiving, by one or more nodes associated with an engine layer, a SQL query from a client device, wherein the engine layer indicates component of the data warehouse architecture configured for receiving, interpreting, optimizing, and executing the SQL query against stored data; determining, by the one or more nodes associated with the engine layer, an execution plan tree based on correlating the SQL query, a received data volume value, and pre-defined capabilities of a storage layer, wherein determining the execution plan tree comprises: optimizing, by the engine layer, the received SQL query to determine the received data volume value indicating an amount of data volume that the storage layer sends to the engine layer for processing the received SQL query; correlating the SQL query, the received data volume value, and the pre-defined capabilities of the storage layer; and assigning, one or more operations for processing the received SQL query to the engine layer and the storage layer, based on correlating the SQL query, the received data volume value, and the pre-defined capabilities of the storage layer, thereby determining the execution plan tree, and wherein the execution plan tree indicates the one or more operations to be executed by the engine layer and the storage layer corresponding to the received SQL query; sending, by the one or more nodes associated with the engine layer, a fragment of the execution plan tree to the storage layer for processing the one or more operations included in the execution plan tree such that the received or fetched data volume value and a computational load value of the engine layer is reduced, wherein the fragment of the execution plan tree is sent to the storage layer based on: identification of one or more execution nodes in the execution plan tree assigned to the storage layer, and a topology received from the storage layer, wherein each of the one or more execution nodes include metadata thereby required for processing the received SQL query, and wherein the topology indicates arrangement of the stored data among one or more partitions associated with the storage layer; processing, by the storage layer, the one or more operations corresponding to the fragment of the execution plan tree; sending, by the storage layer, processed results to the engine layer thereby reducing the received data volume value and the computational load value of the engine layer; and generating, by the engine layer, a compiled result-set by combining the processed results received from the storage layer with computed results generated within the engine layer, thereby passing the compiled result-set to the client device.
7 . A system for processing a Structured Query Language (SQL) query within a data warehouse architecture, wherein the system comprises: a memory; and at least one processor in communication with the memory, the at least one processor configured to receive, by one or more nodes associated with an engine layer, a SQL query from a client device, wherein the engine layer indicates component of the data warehouse architecture configured for receiving, interpreting, optimizing, and executing the SQL query against stored data; receive, by the one or more nodes associated with the engine layer, a topology from a storage layer in response to receiving the SQL query, wherein the topology indicates arrangement of the stored data among one or more partitions associated with the storage layer, and wherein the storage layer indicates component of the data warehouse architecture for storing and managing data; determine, by the one or more nodes associated with the engine layer, an execution plan tree based on correlating the SQL query, a received data volume value, and pre-defined capabilities of the storage layer, wherein to determine the execution plan tree, the at least one processor is configured to: optimize, by the engine layer, the received SQL query to determine the received data volume value indicating an amount of data volume that the storage layer sends to the engine layer for processing the received SQL query; correlate the SQL query, the received data volume value, and the pre-defined capabilities of the storage layer; and assign, one or more operations for processing the received SQL query to the engine layer and the storage layer, based on correlating the SQL query, the received data volume value, and the pre-defined capabilities of the storage layer, thereby determining the execution plan tree, and wherein the execution plan tree indicates the one or more operations to be executed by the engine layer and the storage layer corresponding to the received SQL query; distribute, by the one or more nodes associated with the engine layer, a fragment of the execution plan tree to the storage layer based on the topology, for processing the one or more operations included in the execution plan tree such that the received or fetched data volume value and a computational load value of the engine layer is reduced, wherein to distribute the fragment of the execution plan tree to the storage layer based on the topology, the at least one processor is configured to send the fragment of the execution plan tree to the storage layer based on: identification of one or more execution nodes in the execution plan tree assigned to the storage layer, and the topology, wherein each of the one or more execution nodes include metadata thereby required for processing the received SQL query; process, by the storage layer, the one or more operations corresponding to the fragment of the execution plan tree; send, by the storage layer, processed results to the engine layer thereby reducing the received data volume value and the computational load value of the engine layer; and generate, by the engine layer, a compiled result-set by combining the processed results received from the storage layer with computed results generated within the engine layer, thereby passing the compiled result-set to the client device.
8 . The system of claim 7 , wherein the one or more nodes receiving the SQL query is a query coordinator and the one or more nodes determining the execution plan tree is a query optimizer.
9 . The system of claim 7 , wherein to determine the execution plan tree, the at least one processor is configured to: parse, using the engine layer, the received SQL query, indicating determining syntactic structure and semantics corresponding to the received SQL query; and receive, using the engine layer, the pre-defined capabilities indicating computation characteristics of the storage layer.
10 . The system of claim 7 , wherein the fragment of the execution plan tree is distributed to the storage layer based on the topology by: identifying the one or more execution nodes in the execution plan tree assigned to the storage layer, wherein the metadata indicates operation codes or functions for each of the one or more execution nodes, catalog content, list of objects, filters, predicates, projections, and data types, wherein the fragment of the execution plan tree is sent to the storage layer based on the identification and the topology such that the one or more nodes associated with the engine layer send the fragment to a corresponding one or more partitions associated with the storage layer, wherein the fragment of the execution plan tree includes at least one of: an individual execution node or a plurality of execution nodes.
11 . The system of claim 7 , wherein the storage layer of the data warehouse architecture is disaggregated and decoupled from the engine layer.
12 . A system of processing a Structured Query Language (SQL) query within a data warehouse architecture, wherein the system comprises: a memory; and at least one processor in communication with the memory, the at least one processor configured to: receive, by one or more nodes associated with an engine layer, a SQL query from a client device, wherein the engine layer indicates component of the data warehouse architecture configured for receiving, interpreting, optimizing, and executing the SQL query against stored data; determine, by the one or more nodes associated with the engine layer, an execution plan tree based on correlating the SQL query, a received data volume value, and pre-defined capabilities of a storage layer, wherein to determine the execution plan tree, the at least one processor is configured to: optimize, by the engine layer, the received SQL query to determine the received data volume value indicating an amount of data volume that the storage layer sends to the engine layer for processing the received SQL query; correlate the SQL query, the received data volume value, and the pre-defined capabilities of the storage layer; and assign, one or more operations for processing the received SQL query to the engine layer and the storage layer, based on correlating the SQL query, the received data volume value, and the pre-defined capabilities of the storage layer, thereby determining the execution plan tree, and wherein the execution plan tree indicates the one or more operations to be executed by the engine layer and the storage layer corresponding to the received SQL query; send, by the one or more nodes associated with the engine layer, a fragment of the execution plan tree to the storage layer for processing the one or more operations included in the execution plan tree such that the received or fetched data volume value and a computational load value of the engine layer is reduced, wherein the fragment of the execution plan tree is sent to the storage layer based on: identification of one or more execution nodes in the execution plan tree assigned to the storage layer, and a topology received from the storage layer, wherein each of the one or more execution nodes include metadata thereby required for processing the received SQL query, and wherein the topology indicates arrangement of the stored data among one or more partitions associated with the storage layer; process, by the storage layer, the one or more operations corresponding to the fragment of the execution plan tree; send, by the storage layer, processed results to the engine layer thereby reducing the received data volume value and the computational load value of the engine layer; and generate, by the engine layer, a compiled result-set by combining the processed results received from the storage layer with computed results generated within the engine layer, thereby passing the compiled result-set to the client device.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of priority to U.S. Provisional Patent Application No. 63/619,341, filed Jan. 10, 2024, the contents of which are incorporated herein by reference in their entirety. FIELD OF THE INVENTION The disclosure generally relates to query processing engines, and more particularly relates to systems and methods for processing Structured Query Language (SQL) queries within a data warehouse architecture. BACKGROUND The data warehouse architecture is a centralized repository that stores huge volumes of data from multiple sources within an organization. Further, the data warehouse is adapted to support business intelligence and analytics activities to provide a foundation for making data-driven decisions. Data warehouses (analytics databases), such as Teradata® were successful in a fast Structured Query Language (SQL) processing over large data volumes. However, such data warehouses had one limitation i.e., data storage and compute resources were tightly coupled. As a result, if the data size increases, such data warehouses are required to increase both the storage and compute resources because compute resources and storage are tightly coupled. However, such data warehouses failed to use all the data for analysis. Thus, the compute resources associated with cold data (information that is infrequently accessed or rarely used) were unused. Further, with exponential growth in data, it became exorbitantly expensive to buy and run such systems. To solve the aforementioned problems, modern data warehouses, such as Snowflake®, came into business. The architecture of such modern data warehouses is built upon the concept of a compute-storage separation (also referred to as compute-storage disaggregation). In the compute-storage separation, data is permanently stored in cheap and highly scalable cloud storage services, such as Amazon-S3®, and the SQL query engine may be launched on-demand in a temporary, use-and-destroy compute instances that are completely decoupled from the cloud storage. The advantage of such architecture is that the volume of data may become huge and grow perpetually within the cheap and scalable cloud storage. However, the expensive computing resources required for analysis do not have to increase proportionally. Also, only the cloud storage service is required to be always on, but the SQL query engine may be turned on only when some analysis is required to be run. However, the modern compute-storage-disaggregated architecture has a major problem i.e., the access path between the SQL compute engine and the cloud storage is slow (higher latency and lower bandwidth) than that in the compute-storage-aggregated architecture in which data is stored in the hard drives directly attached to the CPU (as opposed to being separated by a network in between). This problem results in excessive compute consumption by the SQL engines. To satisfy the high-throughput-analysis requirements of the users, the SQL engines typically launch many instances trying to parallelly fetch data from the cloud storage to a local memory and then perform computation on locally fetched data. Since many parallel compute instances may be required to concurrently fetch data from the cloud storage and all such parallel compute instances may spend time waiting for the slow cloud storage to send the data, it may result in high cloud bills when using data warehouses/analytics as compute is the most expensive resource on all the public clouds. Therefore, in view of the above-mentioned problems, there is a need to provide a method and system for processing the SQL queries within the data warehouse architecture. SUMMARY This summary is provided to introduce a selection of concepts, in a simplified format, that are further described in the detailed description of the disclosure. This summary is neither intended to identify key or essential inventive concepts of the disclosure and nor is it intended to determine the scope of the disclosure. In an embodiment, a method for processing Structured Query Language (SQL) queries within data warehouse architecture is disclosed. The method includes receiving, by one or more nodes associated with an engine layer, an SQL query from a client device, wherein the engine layer indicates a component of the data warehouse architecture configured for receiving, interpreting, optimizing, and executing the SQL query against stored data. Further, the method includes receiving, by the one or more nodes associated with the engine layer, a topology from a storage layer in response to receiving the SQL query, wherein the topology indicates an arrangement of the stored data among one or more partitions associated with the storage layer, wherein the storage layer indicates a component of the data warehouse architecture for storing and managing data. Furthermore, the method includes determining, by the one or more nodes associated with the engine layer, an ex