US-12626150-B2 - Self-learned reference mechanism for computer model selection

US12626150B2US 12626150 B2US12626150 B2US 12626150B2US-12626150-B2

Abstract

A method, system, and computer program product for self-learning reference mechanisms for model selection in AutoAI. The method identifies a set of data summary statistics within a data set. A data pattern group is identified within the set of data summary statistics. The data pattern group is determined to be mature. A model selection acceleration mechanism (MSAM) model is generated based on the data pattern group. The method predicts a set of top-k models for the data set based on the MSAM model.

Inventors

Jiang Bo Kang
Yao Dong Liu
Jun Wang
Dong Hai Yu
Bo Song

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260512
Application Date: 20220628

Claims (17)

1 . A computer-implemented method, comprising: identifying a set of data summary statistics within a data set; identifying a data pattern group within the set of data summary statistics; determining that the data pattern group is immature based on a determination that sample sizes within the data pattern group do not allow multinomial logistic regression to be performed on the data pattern group; creating a knowledge base using the data set; updating the knowledge base with a new data set; reevaluating maturity of the data pattern group in the knowledge base; determining that the data pattern group is now mature based on a determination that sample sizes within the data pattern group allow multinomial logistic regression to be performed on the data pattern group; generating a model selection acceleration mechanism (MSAM) model using the data pattern group of the knowledge base to predict a set of top-k models for operating on the data pattern group; and executing the MSAM model to output the predicted set of top-k models for operating on the data pattern group.
2 . The method of claim 1 , wherein the set of top-k models is a first set of top-k models, the method further comprising: performing raw model selection on the set of data; and selecting a second set of top-k models from the raw model selection.
3 . The method of claim 2 , wherein identifying the data pattern group further comprises: identifying, within the knowledge base, a subset of data having a set of predictors and at least one model of the second set of top-k models generated from raw model selection.
4 . The method of claim 2 , further comprising: evaluating the MSAM model based on the first set of top-k models and the second set of top-k models.
5 . The method of claim 4 , wherein the MSAM model is a first MSAM model, the method further comprising: determining that an accuracy value for the MSAM model is below an accuracy threshold; and generating a second MSAM model based on a subsequent data pattern group of the data set.
6 . The method of claim 1 , wherein the knowledge base is a self-learning knowledge base configured to self-evaluate maturity of data pattern groups within the data set.
7 . A system, comprising: one or more processors; and a computer-readable storage medium, coupled to the one or more processors, storing program instructions that, when executed by the one or more processors, cause the one or more processors to perform operations comprising: identifying a set of data summary statistics within a data set; identifying a data pattern group within the set of data summary statistics; determining that the data pattern group is immature based on a determination that sample sizes within the data pattern group do not allow multinomial logistic regression to be performed on the data pattern group; creating a knowledge base using the data set; updating the knowledge base with a new data set; reevaluating maturity of the data pattern group in the knowledge base; determining that the data pattern group is now mature based on a determination that sample sizes within the data pattern group allow multinomial logistic regression to be performed on the data pattern group; generating a model selection acceleration mechanism (MSAM) model using the data pattern group of the knowledge base to predict a set of top-k models for operating on the data pattern group; and executing the MSAM model to output the predicted set of top-k models for operating on the data pattern group.
8 . The system of claim 7 , wherein the set of top-k models is a first set of top-k models, the operations further comprising: performing raw model selection on the set of data; and selecting a second set of top-k models from the raw model selection.
9 . The system of claim 8 , wherein identifying the data pattern group further comprises: identifying, within the knowledge base, a subset of data having a set of predictors and at least one model of the second set of top-k models generated from raw model selection.
10 . The system of claim 8 , wherein the operations further comprise: evaluating the MSAM model based on the first set of top-k models and the second set of top-k models.
11 . The system of claim 10 , wherein the MSAM model is a first MSAM model, the operations further comprising: determining that an accuracy value for the MSAM model is below an accuracy threshold; and generating a second MSAM model based on a subsequent data pattern group of the data set.
12 . The system of claim 7 , wherein the knowledge base is a self-learning knowledge base configured to self-evaluate maturity of data pattern groups within the data set.
13 . A computer program product comprising a computer readable storage medium having program instructions embodied therewith, the program instructions being executable by one or more processors to cause the one or more processors to perform operations comprising: identifying a set of data summary statistics within a data set; identifying a data pattern group within the set of data summary statistics; determining that the data pattern group is immature based on a determination that sample sizes within the data pattern group do not allow multinomial logistic regression to be performed on the data pattern group; creating a knowledge base using the data set; updating the knowledge base with a new data set; reevaluating maturity of the data pattern group in the knowledge base; determining that the data pattern group is now mature based on a determination that sample sizes within the data pattern group allow multinomial logistic regression to be performed on the data pattern group; generating a model selection acceleration mechanism (MSAM) model using the data pattern group of the knowledge base to predict a set of top-k models for operating on the data pattern group; and executing the MSAM model to output the predicted set of top-k models for operating on the data pattern group.
14 . The computer program product of claim 13 , wherein the set of top-k models is a first set of top-k models, the operations further comprising: performing raw model selection on the set of data; and selecting a second set of top-k models from the raw model selection.
15 . The computer program product of claim 14 , wherein identifying the data pattern group further comprises: identifying, within the knowledge base, a subset of data having a set of predictors and at least one model of the second set of top-k models generated from raw model selection.
16 . The computer program product of claim 14 , wherein the operations further comprise: evaluating the MSAM model based on the first set of top-k models and the second set of top-k models.
17 . The computer program product of claim 16 , wherein the MSAM model is a first MSAM model, the operations further comprising: determining that an accuracy value for the MSAM model is below an accuracy threshold; and generating a second MSAM model based on a subsequent data pattern group of the data set.

Description

BACKGROUND Automated machine learning (AutoML) is a process of automating manual tasks that are used to build and train machine learning (ML) models. AutoAI is a variation of AutoML and similarly applies automation operations to build predictive ML models. AutoAI runs automated operations of data pre-processing, automated model selection, automated feature engineering, and hyperparameter optimization to build and evaluate candidate ML model pipelines. SUMMARY According to an embodiment described herein, a computer-implemented method for self-learning reference mechanisms for model selection in AutoAI is provided. The method identifies a set of data summary statistics within a data set. A data pattern group is identified within the set of data summary statistics. The data pattern group is determined to be mature. A model selection acceleration mechanism (MSAM) model is generated based on the data pattern group. The method predicts a set of top-k models for the data set based on the MSAM model. According to an embodiment described herein, a system for self-learning reference mechanisms for model selection in AutoAI is provided. The system includes one or more processors and a computer-readable storage medium, coupled to the one or more processors, storing program instructions that, when executed by the one or more processors, cause the one or more processors to perform operations. The operations identify a set of data summary statistics within a data set. A data pattern group is identified within the set of data summary statistics. The data pattern group is determined to be mature. A model selection acceleration mechanism (MSAM) model is generated based on the data pattern group. The operations predict a set of top-k models for the data set based on the MSAM model. According to an embodiment described herein, a computer program product for self-learning reference mechanisms for model selection in AutoAI is provided. The computer program product includes a computer-readable storage medium having program instructions embodied therewith, the program instructions being executable by one or more processors to cause the one or more processors to identify a set of data summary statistics within a data set. A data pattern group is identified within the set of data summary statistics. The data pattern group is determined to be mature. A model selection acceleration mechanism (MSAM) model is generated based on the data pattern group. The computer program product predicts a set of top-k models for the data set based on the MSAM model. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 depicts a block diagram of a computing environment for implementing concepts and computer-based methods, according to at least one embodiment. FIG. 2 depicts a flow diagram of a computer-implemented method for self-learning reference mechanisms for model selection in AutoAI, according to at least one embodiment. FIG. 3 depicts a flow diagram of a computer-implemented method for self-learning reference mechanisms for model selection in AutoAI, according to at least one embodiment. FIG. 4 depicts a block diagram of a computing system for self-learning reference mechanisms for model selection in AutoAI, according to at least one embodiment. FIG. 5 is a schematic diagram of a cloud computing environment in which concepts of the present disclosure may be implemented, in accordance with an embodiment of the present disclosure. FIG. 6 is a diagram of model layers of a cloud computing environment in which concepts of the present disclosure may be implemented, in accordance with an embodiment of the present disclosure. DETAILED DESCRIPTION The present disclosure relates generally to automated ML processes. More particularly, but not exclusively, embodiments of the present disclosure relate to a computer-implemented method for self-learned reference mechanisms for model selection in automated ML processes. The present disclosure relates further to a related system for automated ML processes, and a computer program product for operating such a system. AutoAI runs automated operations including data pre-processing, automated model selection, automated feature engineering, and hyperparameter optimization to build and evaluate candidate ML model pipelines. AutoAI may operate by providing data in a structured file or database. The data is automatically prepared. A model type may be selected, at least in part based on the prepared data. The AutoAI process may then generate and rank model pipelines. Once generated and ranked, a model may be selected, saved, and deployed. Data preparation can include feature type detection, missing values imputation, and feature encoding and scaling. Models may be selected based on application of the model or algorithms to data within a data set. In generating and ranking model pipelines, the AutoAI process may optimize hyperparameters and perform feature engineering optimization. Some AutoAI implementations use a data allocation usi