US-12626186-B2 - Automated machine learning for network-based database systems

US12626186B2US 12626186 B2US12626186 B2US 12626186B2US-12626186-B2

Abstract

The subject technology receives first party training data provided by an end-user of a baseline machine learning model. The subject technology determines a first set of common features based on the first party training data. The subject technology receives, from at least one data source. The subject technology determines a second set of common features based on the set of datasets. The subject technology trains, using the first set of common features and the second set of common features, a second machine learning model, the second machine learning model incorporating additional training data from the external data supplier during training compared to the baseline machine learning model. The subject technology generates a boosted machine learning model based at least in part on the training, the boosted machine learning model comprising the trained second machine learning model.

Inventors

Rachel Frances Blum
Nancy Dou
Matthew J. Glickman
Boxin Jiang
Orestis KOSTAKIS
Justin Langseth
Michael Earle Rainey
Haoran YU

Assignees

SNOWFLAKE INC.

Dates

Publication Date: 20260512
Application Date: 20220823

Claims (14)

1 . A system comprising: at least one hardware processor; and a memory storing instructions that cause the at least one hardware processor to perform operations comprising: receiving first party training data provided by an end-user of a baseline machine learning model; determining a first set of common features based on the first party training data; receiving, from at least one data source, a set of datasets comprising raw data from an external data supplier different from the end-user; determining a second set of common features based on the set of datasets; training, using the first set of common features and the second set of common features, a second machine learning model, the second machine learning model incorporating additional training data from the external data supplier during training compared to the baseline machine learning model; generating a boosted machine learning model based at least in part on the training, the boosted machine learning model comprising the trained second machine learning model; performing, using the baseline machine learning model, a first set of predictions using first party testing data; evaluating the first set of predictions based at least in part on a first comparison with the first party testing data; determining a first accuracy score based at least in part on the evaluating; performing, using the boosted machine learning model, a second set of predictions using the first party testing data; evaluating the second set of predictions based at least in part on a second comparison with the first party testing data; determining a second accuracy score based at least in part on the evaluating; determining a percentage boost value based at least in part on a first area under a curve of the first set of predictions and a second area under a curve of the second set of predictions; and providing the percentage boost value to a boost pricing model to generate, using at least the percentage boost value and a number of rows scored, a value indicating a cost of utilizing the boosted machine learning model.
2 . The system of claim 1 , wherein the second accuracy score is greater than the first accuracy score.
3 . The system of claim 1 , wherein the operations further comprise: receiving, after generating the boosted machine learning model, a set of additional testing data, the set of additional testing data including a particular set of features corresponding to new testing data; determining a set of other features missing from the particular set of features, the set of other features being included in the first set of common features or the second set of common features; generating, using a set of feature gap filler models, a set of values corresponding to the set of other features; and performing a concatenation operation to include the set of values in the first party testing data.
4 . The system of claim 3 , wherein performing, using the boosted machine learning model, the second set of predictions is based on the first party testing data including the set of values.
5 . The system of claim 1 , wherein the end-user of the baseline machine learning model comprises an entity or particular user performing machine learning development using a database system, the entity or the particular user being separate from the external data supplier that is associated with a different entity.
6 . The system of claim 5 , wherein determining the first set of common features and determining the second set of common features are based on a sub-industry standard schema of common fields, or based on determining a set of columns or a set of rows that are included in the first party training data and the set of datasets.
7 . The system of claim 1 , wherein the second set of common features is appended to the first set of common features as a set of additional rows of data or a set of additional columns of data.
8 . A method comprising: receiving first party training data provided by an end-user of a baseline machine learning model; determining a first set of common features based on the first party training data; receiving, from at least one data source, a set of datasets comprising raw data from an external data supplier different from the end-user; determining a second set of common features based on the set of datasets; training, using the first set of common features and the second set of common features, a second machine learning model, the second machine learning model incorporating additional training data from the external data supplier during training compared to the baseline machine learning model; generating a boosted machine learning model based at least in part on the training, the boosted machine learning model comprising the trained second machine learning model; performing, using the baseline machine learning model, a first set of predictions using first party testing data; evaluating the first set of predictions based at least in part on a first comparison with the first party testing data; determining a first accuracy score based at least in part on the evaluating; performing, using the boosted machine learning model, a second set of predictions using the first party testing data; evaluating the second set of predictions based at least in part on a second comparison with the first party testing data; determining a second accuracy score based at least in part on the evaluating; determining a percentage boost value based at least in part on a first area under a curve of the first set of predictions and a second area under a curve of the second set of predictions; and providing the percentage boost value to a boost pricing model to generate, using at least the percentage boost value and a number of rows scored, a value indicating a cost of utilizing the boosted machine learning model.
9 . The method of claim 8 , wherein the second accuracy score is greater than the first accuracy score.
10 . The method of claim 8 , further comprising: receiving, after generating the boosted machine learning model, a set of additional testing data, the set of additional testing data including a particular set of features corresponding to new testing data; determining a set of other features missing from the particular set of features, the set of other features being included in the first set of common features or the second set of common features; generating, using a set of feature gap filler models, a set of values corresponding to the set of other features; and performing a concatenation operation to include the set of values in the first party testing data.
11 . The method of claim 10 , wherein performing, using the boosted machine learning model, the second set of predictions is based on the first party testing data including the set of values.
12 . The method of claim 8 , wherein the end-user of the baseline machine learning model comprises an entity or particular user performing machine learning development using a database system, the entity or the particular user being separate from the external data supplier that is associated with a different entity.
13 . The method of claim 12 , wherein determining the first set of common features and determining the second set of common features are based on a sub-industry standard schema of common fields, or based on determining a set of columns or a set of rows that are included in the first party training data and the set of datasets.
14 . A non-transitory computer-storage medium comprising instructions that, when executed by one or more processors of a machine, configure the machine to perform operations comprising: receiving first party training data provided by an end-user of a baseline machine learning model; determining a first set of common features based on the first party training data; receiving, from at least one data source, a set of datasets comprising raw data from an external data supplier different from the end-user; determining a second set of common features based on the set of datasets; training, using the first set of common features and the second set of common features, a second machine learning model, the second machine learning model incorporating additional training data from the external data supplier during training compared to the baseline machine learning model; generating a boosted machine learning model based at least in part on the training, the boosted machine learning model comprising the trained second machine learning model; performing, using the baseline machine learning model, a first set of predictions using first party testing data; evaluating the first set of predictions based at least in part on a first comparison with the first party testing data; determining a first accuracy score based at least in part on the evaluating; performing, using the boosted machine learning model, a second set of predictions using the first party testing data; evaluating the second set of predictions based at least in part on a second comparison with the first party testing data; determining a second accuracy score based at least in part on the evaluating; determining a percentage boost value based at least in part on a first area under a curve of the first set of predictions and a second area under a curve of the second set of predictions; and providing the percentage boost value to a boost pricing model to generate, using at least the percentage boost value and a number of rows scored, a value indicating a cost of utilizing the boosted machine learning model.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to U.S. Provisional Patent Application Ser. No. 63/371,583, filed Aug. 16, 2022, entitled “IMPROVED AUTOMATED MACHINE LEARNING FOR NETWORK-BASED DATABASE SYSTEMS,” and the contents of which is incorporated herein by reference in its entirety for all purposes. TECHNICAL FIELD Embodiments of the disclosure relate generally to a network-based database system or a cloud data platform. BACKGROUND Cloud-based data warehouses and other database systems or data platforms sometimes provide support for transactional processing that enable such systems to perform operations that are not available through the built-in, system-defined functions. However, for mitigating security risks, security mechanisms to ensure that user code running on such systems remain isolated are needed. Improving the accuracy of machine learning (ML) modeling is a goal of many companies. Existing approaches, however, sometimes lack the utilization of various data sources to achieve such improved accuracy of ML models thereby limiting an amount of improvement. BRIEF DESCRIPTION OF THE DRAWINGS The present disclosure will be understood more fully from the detailed description given below and from the accompanying drawings of various embodiments of the disclosure. FIG. 1 illustrates an example computing environment that includes a network-based database system in communication with a cloud storage platform, in accordance with some embodiments of the present disclosure. FIG. 2 is a block diagram illustrating components of a compute service manager, in accordance with some embodiments of the present disclosure. FIG. 3 is a block diagram illustrating components of an execution platform, in accordance with some embodiments of the present disclosure. FIG. 4 is a computing environment conceptually illustrating an example software architecture for providing automated machine learning using raw data from various sources, in accordance with some embodiments of the present disclosure. FIG. 5 is a flow diagram illustrating operations of a database system or computing environment in performing a method, in accordance with some embodiments of the present disclosure. FIG. 6 is a flow diagram illustrating operations of a database system or computing environment in performing a method, in accordance with some embodiments of the present disclosure. FIG. 7 is a flow diagram illustrating operations of a database system or computing environment in performing a method, in accordance with some embodiments of the present disclosure. FIG. 8 illustrates a diagrammatic representation of a machine in the form of a computer system within which a set of instructions may be executed for causing the machine to perform any one or more of the methodologies discussed herein, in accordance with some embodiments of the present disclosure. DETAILED DESCRIPTION Reference will now be made in detail to specific example embodiments for carrying out the inventive subject matter. Examples of these specific embodiments are illustrated in the accompanying drawings, and specific details are set forth in the following description in order to provide a thorough understanding of the subject matter. It will be understood that these examples are not intended to limit the scope of the claims to the illustrated embodiments. On the contrary, they are intended to cover such alternatives, modifications, and equivalents as may be included within the scope of the disclosure. Embodiments of the subject technology enable improving the accuracy of ML modeling (e.g., increasing signal for more accurate prediction) by securely connecting to datasets made available by various data suppliers in a data cloud. Moreover, such data suppliers (e.g., customers that have agreed to share data, e.g., data providers with opt-in consent) could be compensated for their participation in securely sharing datasets. Until now this type of data pooling for cross company-cross dataset analysis at this scale could not be done. Historically to perform cross company data set joins, sets of data would have to be configured, encrypted or anonymized, and moved in and out of databases for joint analysis, requiring a significant amount of engineering with limited impact. Using approaches described by embodiments herein, the subject system automatically enriches the user's data with existing data from other participating users to produce better ML models. All computations are performed without data being revealed to any other participant. In addition, because the described processes herein are opt-in for suppliers with no required configuration for either supplier or requester, and the processes are fully automated, improved ML results can be achieved with no engineering resources for high impact in an instantaneous manner. FIG. 1 illustrates an example computing environment 100 that includes a database system in the example form of a network-based database system