Search

US-20260127156-A1 - Systems and Methods for Data Processing for Cloud-Based Data Warehouses

US20260127156A1US 20260127156 A1US20260127156 A1US 20260127156A1US-20260127156-A1

Abstract

Systems and methods are provided for processing data for cloud-based data warehousing. The method may receive a request to insert a large amount of data items into a table of a cloud-based database. The method may create a plurality of temporary tables in the cloud-based database in response to the request. The method may divide the large amount of data items into a plurality of sets of data items. The method may insert each set of data items of the plurality of sets of data items into a corresponding temporary table of the plurality of temporary tables. The method may merge the plurality of temporary tables into the table of the cloud-based database.

Inventors

  • Bharath SRINIVASAIAH
  • Umamaheswara Reddy KUDUMULA
  • Krishna KANUMURI
  • Sangameswara Rao GAVINI
  • Christopher Roper
  • Ian Gunn
  • Pisharath KRISHNAN
  • Anil LOOMBA
  • Ashok Chennuru

Assignees

  • Elevance Health, Inc.

Dates

Publication Date
20260507
Application Date
20251104

Claims (11)

  1. 1 . A method for processing data for cloud-based data warehousing, comprising: receiving a request to insert a plurality of data items exceeding a predefined threshold into a table of a cloud-based database, wherein the cloud-based database imposes a limit on a maximum number of parallel insert operations that can be performed on the table; identifying a particular number of temporary tables to be created, wherein the identifying of the particular number of temporary tables to be created comprises: choosing a first number; creating the first number of temporary tables in the cloud-based database; dividing a collection of data items into the first number of sets of data items; inserting each set of data items of the first number of sets of data items into a corresponding temporary table of the first number of temporary tables; measuring a first time period needed for the inserting of the first number of sets of data items into the first number of temporary tables to complete; choosing a second number that is different from the first number; repeating the creating, dividing, inserting, and measuring operations based on the second number, wherein a second time period is measured for the inserting operation related to the second number; identifying the particular number of temporary tables to be created based on the measured time periods, wherein the identifying of the particular number of temporary tables to be created based on the measured time periods comprises: comparing the first time period with the second time period; choosing the first number to be the particular number when the first time period is shorter than the second time period; and choosing the second number to be the particular number when the second time period is shorter than the first time period; creating the particular number of temporary tables in the cloud-based database; dividing the plurality of data items into the particular number of sets of data items, wherein a maximum size of each set of data items is based on a number of processes writing to the table; inserting each set of data items of the particular number of sets of data items into a corresponding temporary table of the particular number of temporary tables, wherein the inserting of a first set of data items of the particular number of sets of data items into a first temporary table of the particular number of temporary tables is performed in parallel with the inserting of a second set of data items of the particular number of sets of data items into a second temporary table of the particular number of temporary tables; determining whether a number of parallel insert operations performed on the table is less than a threshold value, wherein the threshold value is less than or equal to the maximum number; merging the particular number of temporary tables into the table of the cloud-based database when the number of parallel insert operations performed on the table is determined to be less than the threshold value; and waiting a time period before repeating the determining operation when the number of parallel insert operations performed on the table is determined to be greater than or equal to the threshold value.
  2. 2 . A method for processing data for cloud-based data warehousing, comprising: receiving a request to insert a plurality of data items exceeding a predefined threshold into a table of a cloud-based database, wherein the cloud-based database imposes a limit on a maximum number of parallel insert operations that can be performed on the table; creating a plurality of temporary tables in the cloud-based database based on an amount of the plurality of data items; dividing the plurality of data items into a plurality of sets of data items, wherein a maximum size of each set of data items is based on a number of processes writing to the table; inserting each set of data items of the plurality of sets of data items into a corresponding temporary table of the plurality of temporary tables; and merging the plurality of temporary tables into the table of the cloud-based database.
  3. 3 . The method of claim 2 , wherein the inserting of a first set of data items of the plurality of sets of data items into a first temporary table of the plurality of temporary tables is performed in parallel with the inserting of a second set of data items of the plurality of sets of data items into a second temporary table of the plurality of temporary tables.
  4. 4 . The method of claim 2 , further comprising: determining whether a number of parallel insert operations performed on the table is less than a threshold value, wherein the threshold value is less than or equal to the maximum number, wherein the merging of the plurality of temporary tables into the table is performed when the number of parallel insert operations performed on the table is determined to be less than the threshold value; and waiting a time period before repeating the determining operation when the number of parallel insert operations performed on the table is determined to be greater than or equal to the threshold value.
  5. 5 . The method of claim 4 , further comprising identifying a particular number of temporary tables to be created, wherein the identifying of the particular number of temporary tables to be created comprises: choosing a first number; creating the first number of temporary tables in the cloud-based database; dividing a collection of data items into the first number of sets of data items; inserting each set of data items of the first number of sets of data items into a corresponding temporary table of the first number of temporary tables; measuring a first time period needed for the inserting of the first number of sets of data items into the first number of temporary tables to complete; choosing a second number that is different from the first number; repeating the creating, dividing, inserting, and measuring operations based on the second number; and identifying the particular number of temporary tables to be created based on the measured time periods.
  6. 6 . The method of claim 5 , wherein a second time period is measured for the inserting operation related to the second number, wherein the identifying of the particular number of temporary tables to be created based on the measured time periods comprises: comparing the first time period with the second time period; choosing the first number to be the particular number when the first time period is shorter than the second time period; and choosing the second number to be the particular number when the second time period is shorter than the first time period.
  7. 7 . A computer system for processing data for cloud-based data warehousing, comprising: one or more processors; a display; and a memory, wherein the memory stores one or more programs configured for execution by the one or more processors, and the one or more programs comprising instructions for: receiving a request to insert a plurality of data items exceeding a predefined threshold into a table of a cloud-based database, wherein the cloud-based database imposes a limit on a maximum number of parallel insert operations that can be performed on the table; creating a plurality of temporary tables in the cloud-based database based on an amount of the plurality of data items; dividing the plurality of data items into a plurality of sets of data items, wherein a maximum size of each set of data items is based on a number of processes writing to the table; inserting each set of data items of the plurality of sets of data items into a corresponding temporary table of the plurality of temporary tables; and merging the plurality of temporary tables into the table of the cloud-based database.
  8. 8 . The computer system of claim 7 , wherein instructions for inserting a first set of data items of the plurality of sets of data items into a first temporary table of the plurality of temporary tables is performed in parallel with instructions for inserting a second set of data items of the plurality of sets of data items into a second temporary table of the plurality of temporary tables.
  9. 9 . The computer system of claim 7 , wherein the one or more programs further comprise instructions for: determining whether a number of parallel insert operations performed on the table is less than a threshold value, wherein the threshold value is less than or equal to the maximum number, wherein the instructions for merging the plurality of temporary tables into the table is performed when the number of parallel insert operations performed on the table is determined to be less than the threshold value; and waiting a time period before repeating the determining operation when the number of parallel insert operations performed on the table is determined to be greater than or equal to the threshold value.
  10. 10 . The computer system of claim 9 , wherein the one or more programs further comprise instructions for identifying a particular number of temporary tables to be created, wherein the instructions for identifying the particular number of temporary tables to be created comprises instructions for: choosing a first number; creating the first number of temporary tables in the cloud-based database; dividing a collection of data items into the first number of sets of data items; inserting each set of data items of the first number of sets of data items into a corresponding temporary table of the first number of temporary tables; measuring a first time period needed for the inserting of the first number of sets of data items into the first number of temporary tables to complete; choosing a second number that is different from the first number; repeating the creating, dividing, inserting, and measuring operations based on the second number; and identifying the particular number of temporary tables to be created based on the measured time periods.
  11. 11 . The computer system of claim 10 , wherein a second time period is measured for the inserting operation related to the second number, wherein the instructions for identifying the particular number of temporary tables to be created based on the measured time periods comprise instructions for: comparing the first time period with the second time period; choosing the first number to be the particular number when the first time period is shorter than the second time period; and choosing the second number to be the particular number when the second time period is shorter than the first time period.

Description

RELATED APPLICATIONS This application claims the benefit of, and priority to, U.S. Provisional Patent Application Ser. No. 63/715,933, filed Nov. 4, 2024, entitled “Systems and Methods for Dynamic Query Optimization for Distributed Computing Environments”, which is incorporated by reference herein in its entirety. TECHNICAL FIELD The present disclosure generally relates to data processing and management within cloud-based data warehousing platforms, specifically focusing on cloud-based data warehouses' data processing capabilities. BACKGROUND A data warehouse is an enterprise data platform used for the analysis and reporting of structured and semi-structured data from multiple data sources, such as point-of-sale transactions, marketing automation, customer relationship management, and more. Data warehouses include an analytical database and critical analytical components and procedures. They support ad hoc analysis and custom reporting, such as data pipelines, queries, and business applications. They can consolidate and integrate massive amounts of current and historical data in one place and are designed to give a long-range view of data over time. These data warehouse capabilities have made data warehousing a primary staple of enterprise analytics that help support informed business decisions. A cloud-based data warehouse is a centralized database in a public cloud for storing, processing, integrating, and managing large volumes of structured and semi-structured data. Being a cloud-based data warehouse means that, instead of hosting physical servers and infrastructure on premises, everything happens in online—offsite servers take care of the heavy lifting, and users can access their data and analytics tools over the internet without the need for downloading or setting up any software or applications. Cloud-based data warehouses are easier to set up compared to their traditional counterparts, which generally entails a complex setup. A cloud-based data warehouse stores, integrates, and processes large volumes of data from several sources, whether on-premises or on the internet. A cloud-based data warehouse is critical to make quick, data-driven decisions. It offers improved computational ability and simplified data management, allowing users to extract valuable insights from updated, accurate, and enriched data when needed. While cloud-based data warehouses offer significant benefits, especially when it comes to scalability and flexibility, it has its own set of challenges and complexities. For example, some cloud-based data warehouses may impose limitations on the number of parallel operations performed on a table. These limitations can lead to bottlenecks and performance issues in data processing. Attempts to address the technical problem of limited parallel operations in cloud-based data warehouse involved manual monitoring and management of data processing to avoid exceeding the limit. However, such approach was suboptimal as it required significant time and effort, and did not provide a dynamic and automated solution to the problem. There is an increasing need for efficient data processing for cloud-based data warehouses. SUMMARY The present disclosure aims to overcome the default limitation on the maximum parallel insert operations allowed on a table in cloud-based data warehouses by dynamically creating session-level temporary tables for insert operations. The chief objective of the present disclosure is to increase data processing capacity and optimize parallel insert operations, thereby reducing overall runtimes and getting data to business quicker. The techniques described herein improve the performance of cloud-based data warehouses while effectively managing data processing within the cloud-based data warehouses' limitations. In one aspect, a method is provided for processing data for cloud-based data warehousing, according to some embodiments. The method may include receiving a request to insert a plurality of data items exceeding a predefined threshold into a table of a cloud-based database. The cloud-based database may impose a limit on the maximum number of parallel insert operations that can be performed on the table. The method may also include creating a plurality of temporary tables in the cloud-based database based on the amount of the plurality of data items. The method may also include dividing the plurality of data items into a plurality of sets of data items. The maximum size of each set of data items may be based on the number of processes writing to the table (e.g., to avoid queueing), with the goal of optimizing parallel insert operations within the cloud database's limitations, for example. The method may also include inserting each set of data items of the plurality of sets of data items into a corresponding temporary table of the plurality of temporary tables. The method may also include merging the plurality of temporary tables into the table of the cloud-based database