US-12619593-B2 - Efficient computation over a wide table

US12619593B2US 12619593 B2US12619593 B2US 12619593B2US-12619593-B2

Abstract

Embodiments of the present disclosure provide techniques for efficient computation over a wide table. A processing device determines that a first number of columns of a first table is greater than a threshold number of columns. The processing device transforms the first table into a second table based on the determination, where the second table includes a second number of columns that is less than the first number of columns, and where the second table includes a first column that includes first fields that identify columns of the first table, a second column that includes second fields that identify data types of fields of the first table, and a third column that includes third fields that include data of the fields of the first table. The processing device executes a UDTF on the second table.

Inventors

Suraj P. Acharya
Yimeng Li
Raghavendran Ramakrishnan
Chao Tan
Yunqiao Zhang

Assignees

SNOWFLAKE INC.

Dates

Publication Date: 20260505
Application Date: 20240626

Claims (20)

1 . A method, comprising: determining that a first number of columns of a first table in a column-oriented database is greater than a threshold number of columns, wherein the first table includes data in fields; transforming, by a processing device, the first table into a second table in the column-oriented database based on the determination that the first number of columns of the first table in the column-oriented database is greater than the threshold number of columns, wherein the second table comprises a second number of columns that is less than the first number of columns, and wherein the second table comprises a first column that includes first fields that identify columns of the first table, a second column that includes second fields that identify data types of the fields of the first table, and a third column that includes third fields that include the data of the fields of the first table; and executing a user-defined table function (UDTF) optimized for columnar data processing on the second table in the column-oriented database, wherein executing the UDTF on the second table comprises executing a classification UDTF on the second table, and wherein executing the classification UDTF on the second table comprises: identifying personal data in a row of the second table corresponding to a column in the first table; and associating the row with a system defined tag based on the identification.
2 . The method of claim 1 , further comprising: obtaining the first table prior to determining that the first number of columns of the first table is greater than the threshold number of columns.
3 . The method of claim 1 , wherein the personal data comprises an identifier of an individual, a quasi-identifier of the individual, or sensitive information of the individual.
4 . The method of claim 1 , further comprising: outputting results of the executed UDTF to a computing device.
5 . The method of claim 1 , wherein executing the UDTF on the second table comprises executing the UDTF on a column-by-column basis on the second table.
6 . The method of claim 1 , further comprising: obtaining, based on the execution of the UDTF, a tabular value for each row of the second table.
7 . The method of claim 1 , wherein transforming the first table into the second table comprises executing an unpivot operation on the first table.
8 . The method of claim 1 , further comprising: sampling the first table to obtain a row count of the first table, wherein executing the UDTF is based on the row count.
9 . The method of claim 1 , wherein the first table comprises a first number of rows and the second table comprises a second number of rows, and wherein the first number of rows is less than the second number of rows.
10 . The method of claim 1 , wherein transforming the first table into the second table comprises transforming the first table into the second table at an execution platform, and wherein executing the UDTF on the second table comprises executing the UDTF on the second table at the execution platform.
11 . A system, comprising: a memory; and a processing device operatively coupled to the memory, the processing device to: determine that a first number of columns of a first table in a column-oriented database is greater than a threshold number of columns, wherein the first table includes data in fields; transform the first table into a second table in the column-oriented database based on the determination that the first number of columns of the first table in the column-oriented database is greater than the threshold number of columns, wherein the second table comprises a second number of columns that is less than the first number of columns, and wherein the second table comprises a first column that includes first fields that identify columns of the first table, a second column that includes second fields that identify data types of the fields of the first table, and a third column that includes third fields that include the data of the fields of the first table; and execute a user-defined table function (UDTF) on the second table optimized for columnar data processing in the column-oriented database, wherein to execute the UDTF on the second table, the processing device is to execute a classification UDTF on the second table, and wherein to execute the classification UDTF on the second table, the processing device is to: identify personal data in a row of the second table corresponding to a column in the first table; and associate the row with a system defined tag based on the identification.
12 . The system of claim 11 , wherein to execute the UDTF on the second table, the processing device is to execute the UDTF on a column-by-column basis on the second table.
13 . The system of claim 11 , wherein the personal data comprises an identifier of an individual, a quasi-identifier of the individual, or sensitive information of the individual.
14 . The system of claim 11 , wherein the processing device is further to: output results of the executed UDTF to a computing device.
15 . The system of claim 11 , wherein the first table comprises a first number of rows and the second table comprises a second number of rows, and wherein the first number of rows is less than the second number of rows.
16 . A non-transitory computer-readable medium having instructions stored thereon which, when executed by a processing device, cause the processing device to: determine that a first number of columns of a first table is greater than a threshold number of columns; transform, by the processing device, the first table into a second table based on the determination, wherein the second table comprises a second number of columns that is less than the first number of columns, and wherein the second table comprises a first column that includes first fields that identify columns of the first table, a second column that includes second fields that identify data types of fields of the first table, and a third column that includes third fields that include data of the fields of the first table; and execute a user-defined table function (UDTF) on the second table, wherein to execute the UDTF on the second table, the instructions, when executed by the processing device, cause the processing device to execute a classification UDTF on the second table, and wherein to execute the classification UDTF on the second table, the instructions, when executed by the processing device, cause the processing device to: identify personal data in a row of the second table corresponding to a column in the first table; and associate the row with a system defined tag based on the identification.
17 . The non-transitory computer-readable medium of claim 16 , wherein to execute the UDTF on the second table, the instructions, when executed by the processing device, cause the processing device to execute the UDTF on a column-by-column basis on the second table.
18 . The non-transitory computer-readable medium of claim 16 , wherein the personal data comprises an identifier of an individual, a quasi-identifier of the individual, or sensitive information of the individual.
19 . The non-transitory computer-readable medium of claim 16 , wherein the instructions, when executed by the processing device, cause the processing device further to: output results of the executed UDTF to a computing device.
20 . The non-transitory computer-readable medium of claim 16 , wherein the first table comprises a first number of rows and the second table comprises a second number of rows, and wherein the first number of rows is less than the second number of rows.

Description

TECHNICAL FIELD The present disclosure relates to databases, and particularly to efficient computation over a wide table. BACKGROUND Databases are widely used for data storage and access in computing applications. Databases may include one or more tables that include or reference data that may be read, modified, or deleted using queries. Databases may be used for storing and/or accessing personal information or other sensitive information. Secure storage and access of database data may be provided by encrypting and/or storing data in an encrypted form to prevent unauthorized access. In some cases, data sharing may be desirable to let computing devices of entities perform queries against a set of data. BRIEF DESCRIPTION OF THE DRAWINGS The described embodiments and the advantages thereof may best be understood by reference to the following description taken in conjunction with the accompanying drawings. These drawings in no way limit any changes in form and detail that may be made to the described embodiments by one skilled in the art without departing from the spirit and scope of the described embodiments. FIG. 1A is a block diagram depicting an example computing environment in which the methods disclosed herein may be implemented in accordance with some aspects of the present disclosure. FIG. 1B is a block diagram illustrating an example virtual warehouse in accordance with some aspects of the present disclosure. FIG. 2 is a block diagram illustrating components of the compute service manager, in accordance with aspects of the present disclosure. FIG. 3 is a block diagram depicting an embodiment of an execution platform in accordance with some embodiments of the present disclosure. FIG. 4 is a diagram illustrating an example of a first query plan and an example of a second query plan in accordance with some aspects of the present disclosure. FIG. 5 is a diagram illustrating an example of a first table and an example of a second table in accordance with some aspects of the present disclosure. FIG. 6 is a diagram illustrating an example of classification in accordance with some aspects of the present disclosure. FIG. 7 is a diagram illustrating an example of a computing system in accordance with some aspects of the present disclosure. FIG. 8 is a flow diagram of a method for efficient computation over a wide table in accordance with some embodiments of the present disclosure. FIG. 9 is a block diagram of an example computing device that may perform one or more of the operations described herein in accordance with some aspects of the present disclosure. DETAILED DESCRIPTION Data providers often have data assets that are cumbersome to share, but of interest to another entity. For example, a large online retail company may have a data set that includes the purchasing habits of millions of consumers over the last ten years. If the online retailer wishes to share all or a portion of this data with another entity, the online retailer may need to use old and slow methods to transfer the data, such as a file-transfer-protocol (FTP), or even copying the data onto physical media and mailing the physical media to the other entity. This has several disadvantages. First, it is slow as copying terabytes or petabytes of data can take days. Second, once the data is delivered, the provider cannot control what happens to the data. The recipient can alter the data, make copies, or share it with other parties. Third, the only entities that would be interested in accessing such a large data set in such a manner are large corporations that can afford the complex logistics of transferring and processing the data as well as the high price of such a cumbersome data transfer. Thus, smaller entities (e.g., “mom and pop” shops) or even smaller, nimbler cloud-focused startups are often priced out of accessing this data, even though the data may be valuable to their businesses. This may be because raw data assets are generally too unpolished and full of potentially sensitive data to simply outright sell/provide to other companies. Data cleaning, de-identification, aggregation, joining, and other forms of data enrichment need to be performed by the owner of data before it is shareable with another party. This is burdensome on computational resources, time-consuming, and expensive. Finally, it is difficult to share data assets with many entities because traditional data sharing methods do not allow scalable sharing for the reasons mentioned above. Traditional sharing methods also introduce latency and delays in terms of all parties having access to the most recently-updated data. Private and public data exchanges may allow data providers to more easily and securely share their data assets with other entities (e.g., data consumers). A public data exchange (also referred to herein as a “Snowflake data marketplace,” or a “data marketplace”) may provide a centralized repository with open access where a data provider may publish and control liv