Search

US-12619867-B2 - Automated creation of machine-learning modeling pipelines

US12619867B2US 12619867 B2US12619867 B2US 12619867B2US-12619867-B2

Abstract

A computer-implemented method of generating a machine learning model pipeline (“pipeline”) for a task, where the pipeline includes a machine learning model and at least one feature. A machine learning task including a data set and a set of first tags related to the task are received from a user. It is determined whether a database stores a first machine learning model pipeline correlated in the database with a second tag matching at least one first tag received from the user. Upon determining that the database stores the first machine learning model pipeline, the first machine learning model pipeline is retrieved, the retrieved first machine learning model pipeline is run, and the machine learning task is responded to. Pipelines may also be created based on stored pipelines correlated with a tag related to a tag in the task, or from received feature generator(s) and models.

Inventors

  • Francesco Fusco
  • Fearghal O'Donncha
  • Seshu TIRUPATHI

Assignees

  • INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date
20260505
Application Date
20201229

Claims (17)

  1. 1 . A computer-implemented method of generating a first machine learning model pipeline for a machine learning task, the first machine learning model pipeline including a first machine learning model and at least one feature, the computer-implemented method comprising: receiving, by a computing device, from a user, the machine learning task, a data set, and a set of first tags related to the machine learning task; receiving, by the computing device, from the user, a feature generator comprising executable logic for generating a first plurality of features based on the data set, wherein the feature generator received, by the computing device, from the user is a model building block that defines one or more operations for selecting and processing data in the data set, the first plurality of features is input to the generated first machine learning model pipeline to perform the machine learning task, and the first plurality of features includes the at least one feature; receiving, by the computing device, from the user, a tags library, the tags library structured as a node-relationship graph defining relations between the set of first tags; searching, by the computing device, the tags library using a graph traversal algorithm to identify a second tag matching at least one first tag of the set of first tags; determining, by the computing device, based on the searching of the tags library, whether a database including a machine learning model pipelines library stores the first machine learning model pipeline correlated in the database with the second tag matching the at least one first tag; and upon determining that the database stores the first machine learning model pipeline: retrieving, by the computing device, the first machine learning model pipeline; running, by the computing device, the retrieved first machine learning model pipeline on the data set based on the first plurality of features; and responding to the machine learning task, by the computing device, based on an output of the running of the retrieved first machine learning model pipeline.
  2. 2 . The computer-implemented method of claim 1 , further comprising: upon determining that the first machine learning model pipeline is not stored in the database, searching, by the computing device, the tags library using the graph traversal algorithm to identify a third tag related to the at least one first tag; determining whether the database stores a second machine learning model pipeline correlated with the third tag; and upon determining that the second machine learning model pipeline is stored in the database: retrieving, by the computing device, the second machine learning model pipeline; running, by the computing device, the feature generator and a second machine learning model in the retrieved second machine learning model pipeline for a plurality of iterations to generate a second plurality of features based on the data set; selecting, by the computing device, one or more features of the second plurality of features; running, by the computing device, a third machine learning model pipeline with the selected one or more features of the second plurality of features and the second machine learning model; responding to the machine learning task, by the computing device, based on an output of the running of the third machine learning model pipeline; and storing, by the computing device, the third machine learning model pipeline in the database in correlation with the third tag.
  3. 3 . The computer-implemented method of claim 2 , further comprising: receiving, by the computing device, a third machine learning model from the user for running a third plurality of features on the data set to perform the machine learning task; upon determining, by the computing device, that the second machine learning model pipeline is not stored in the database: running, by the computing device, the feature generator and the third machine learning model for a plurality of iterations to generate the third plurality of features; selecting one or more features of the third plurality of features; running a fourth machine learning model pipeline with the selected one or more features of the third plurality of features and the third machine learning model; responding to the machine learning task based on an output of the running of the fourth machine learning model pipeline; and storing the fourth machine learning model pipeline in the database in correlation with the at least one first tag.
  4. 4 . The computer-implemented method of claim 3 , further comprising: receiving, by the computing device, a plurality of machine learning models, wherein the plurality of machine learning models is different from the first machine learning model, the second machine learning model, and the third machine learning model; running, by the computing device, the feature generator and the plurality of machine learning models for a plurality of iterations to generate a fourth plurality of features; selecting, by the computing device, one or more features of the fourth plurality of features; selecting, by the computing device, a fourth machine learning model of the plurality of machine learning models; running, by the computing device, a fifth machine learning model pipeline with the selected one or more features of the fourth plurality of features and the selected fourth machine learning model; responding to the machine learning task, by the computing device, based on an output of the running of the fifth machine learning model pipeline; and storing the fifth machine learning model pipeline in the database in correlation with the at least one first tag.
  5. 5 . The computer-implemented method of claim 4 , further comprising: selecting, by the computing device, the one or more features of the fourth plurality of features and the fourth machine learning model based on a performance metric.
  6. 6 . The computer-implemented method of claim 5 , further comprising, after responding to the machine learning task: continuing to run, by the computing device, the feature generator and the plurality of machine learning models for a plurality of iterations to improve a performance of the fifth machine learning model pipeline with respect to the performance metric.
  7. 7 . A computing device, comprising: a processing device; a storage device coupled to the processing device; machine learning modeling code stored in the storage device, wherein execution of the machine learning modeling code by the processing device causes the computing device to: receive, from a user, a machine learning task, a set of first tags related to the machine learning task, and a data set; receive, from the user, a feature generator comprising executable logic for generating a first plurality of features based on the data set, wherein the feature generator received, by the computing device, from the user is a model building block that defines one or more operations for selecting and processing data in the data set, and the first plurality of features is input to a first machine learning model pipeline to perform the machine learning task; receive, from the user, a tags library, the tags library structured as a node-relationship graph defining relations between the set of first tags; search, by the computing device, the tags library using a graph traversal algorithm to identify a second tag matching at least one first tag of the set of first tags; determine, based on the search of the tags library, whether a database including a machine learning model pipelines library stores the first machine learning model pipeline correlated in the database with the second tag matching the at least one first tag; upon determining that the database stores the first machine learning model pipeline: retrieve the first machine learning model pipeline that includes a first machine learning model; run the retrieved first machine learning model pipeline on the data set based on the first plurality of features; and respond to the machine learning task based on an output of the run of the retrieved first machine learning model pipeline.
  8. 8 . The computing device of claim 7 , wherein the execution of the machine learning modeling code by the processing device further causes the computing device to: upon determining that the first machine learning model pipeline is not stored in the database, search the tags library using the graph traversal algorithm to identify a third tag related to the at least one first tag; determine whether the database stores a second machine learning model pipeline correlated with the third tag; and upon determining that the second machine learning model pipeline is stored in the database: retrieve the second machine learning model pipeline; run the feature generator and a second machine learning model in the retrieved second machine learning model pipeline for a plurality of iterations to generate a second plurality of features based on the data set; select one or more features of the second plurality of features; run a third machine learning model pipeline with the selected one or more features of the second plurality of features and the second machine learning model; respond to the machine learning task based on an output of the run of the third machine learning model pipeline; and store the third machine learning model pipeline in the database in correlation with the third tag.
  9. 9 . The computing device of claim 8 , wherein the execution of the machine learning modeling code further causes the computing device to: upon determining that the second machine learning model pipeline is not stored in the database: run the feature generator and a third machine learning model for a plurality of iterations to generate a third plurality of features; select one or more features of the third plurality of features; run a fourth machine learning model pipeline with the selected one or more features of the third plurality of features and the third machine learning model; respond to the machine learning task based on an output of the run of the fourth machine learning model pipeline; and store the fourth machine learning model pipeline in the database in correlation with the at least one first tag.
  10. 10 . The computing device of claim 9 , wherein the execution of the machine learning modeling code further causes the computing device to: run the feature generator and a plurality of machine learning models for a plurality of iterations to generate a fourth plurality of features, wherein the plurality of machine learning models is different from the first machine learning model, the second machine learning model, and the third machine learning model; select one or more features of the fourth plurality of features; select a fourth machine learning model of the plurality of machine learning models; run a fifth machine learning model pipeline with the selected one or more features of the fourth plurality of features and the selected fourth machine learning model; respond to the machine learning task based on an output of the run of the fifth machine learning model pipeline; and store the fifth machine learning model pipeline in the database in correlation with the at least one first tag.
  11. 11 . The computing device of claim 10 , wherein the execution of the machine learning modeling code further causes the computing device to select the one or more features of the fourth plurality of features and the fourth machine learning model based on a performance metric.
  12. 12 . The computing device of claim 11 , wherein the execution of the machine learning modeling code causes the computing device to, after responding to the machine learning task, continue to run the feature generator and the plurality of machine learning models for a plurality of iterations to improve a performance of the fifth machine learning model pipeline with respect to the performance metric.
  13. 13 . A non-transitory computer readable storage medium tangibly embodying a computer readable program code having computer readable instructions that, when executed, causes a computing device to: receive, from a user, a machine learning task, a set of first tags related to the machine learning task, and a data set; receive, from the user, a feature generator comprising executable logic for generating a first plurality of features based on the data set, wherein the feature generator received, by the computing device, from the user is a model building block that defines one or more operations for selecting and processing data in the data set, and the first plurality of features is input to a first machine learning model pipeline to perform the machine learning task; receive, from the user, a tags library, the tags library structured as a node-relationship graph defining relations between the set of first tags; search, by the computing device, the tags library using a graph traversal algorithm to identify a second tag matching at least one first tag of the set of first tags; determine, based on the search of the tags library, whether a database including a machine learning model pipelines library stores the first machine learning model pipeline correlated in the database with the second tag matching the at least one first tag; and upon determining that the database stores the first machine learning model pipeline: retrieve the first machine learning model pipeline that includes a first machine learning model; run the retrieved first machine learning model pipeline on the data set based on the first plurality of features; and respond to the machine learning task based on an output of the run of the retrieved first machine learning model pipeline.
  14. 14 . The non-transitory computer readable storage medium of claim 13 , wherein the computer readable instructions, when executed, further causes the computing device to: upon determining that the first machine learning model pipeline is not stored in the database, search the tags library using the graph traversal algorithm to identify a third tag related to the at least one first tag; determine whether the database stores a second machine learning model pipeline correlated with the third tag; and upon determining that the second machine learning model pipeline is stored in the database: retrieve, by the computing device, the second machine learning model pipeline; run the feature generator and a second machine learning model in the retrieved second machine learning model pipeline for a plurality of iterations to generate a second plurality of features based on the data set; select one or more features of the second plurality of features; run a third machine learning model pipeline with the selected one or more features of the second plurality of features and the second machine learning model; respond to the machine learning task based on an output of the run the third machine learning model pipeline; and store the third machine learning model pipeline in the database in correlation with the third tag.
  15. 15 . The non-transitory computer readable storage medium device of claim 14 , wherein the computer readable instructions, when executed, further causes the computing device to: upon determining that the second machine learning model pipeline is not stored in the database: run the feature generator and a third machine learning model for a plurality of iterations to generate a third plurality of features; select one or more features of the third plurality of features; run a fourth machine learning model pipeline with the selected one or more features of the third plurality of features and the third machine learning model; respond to the machine learning task based on an output of the run the fourth machine learning model pipeline; and store the fourth machine learning model pipeline in the database in correlation with the at least one first tag.
  16. 16 . The non-transitory computer readable storage medium of claim 15 , wherein the computer readable instructions, when executed, further causes the computing device to: run the feature generator and a plurality of machine learning models for a plurality of iterations to generate a fourth plurality of features, wherein the plurality of machine learning models is different from the first machine learning model, the second machine learning model, and the third machine learning model; select one or more features of the fourth plurality of features; select a fourth machine learning model of the plurality of machine learning models; run a fifth machine learning model pipeline with the selected one or more features of the fourth plurality of features and the selected fourth machine learning model; respond to the machine learning task based on an output of the run of the fifth machine learning model pipeline; and store the fifth machine learning model pipeline in the database in correlation with the at least one first tag.
  17. 17 . The non-transitory computer readable storage medium of claim 16 , wherein the computer readable instructions, when executed, further causes the computing device to select the one or more features of the fourth plurality of features and the fourth machine learning model based on a performance metric.

Description

BACKGROUND Technical Field The present disclosure generally relates to automated machine learning model pipelines, and more particularly, the creation of automated machine learning model pipelines using tagged pipeline building blocks and the reuse of related pipelines. Description of the Related Art Automated machine learning (“AutoML”) systems automate aspects of the process for generating a machine learning predictive model. Given a data set and machine learning task, such as regression or classification, for example, AutoML systems can automatically generate artificial intelligence model pipelines that define different aspects of feature engineering, data cleaning, model training/selection, etc., to optimally perform a task. The execution of an AutoML method generally involves large amount of computational resources. In some applications, such as Internet of Things, a large number of modelling tasks are handled, such as time-series prediction in finance and energy, for example. Generating an AutoML pipeline for each task independently can be expensive and very inefficient. SUMMARY According to various embodiments, a method, a computing device, and a non-transitory computer readable storage medium, are provided for generating a machine learning model pipeline for a task, where the machine learning model pipeline includes a machine learning model and at least one feature. A machine learning task including a data set and a set of first tags related to the task are received from a user. It is determined whether a database including a machine learning model pipelines library stores a first machine learning model pipeline correlated in the database with a second tag matching at least one first tag received from the user. Upon determining that the database stores the first machine learning model pipeline, the first machine learning model pipeline is retrieved, the retrieved first machine learning model pipeline is run, and the machine learning task is responded to based on an output of running of the machine learning model pipeline. In another example, if it is determined that a first machine learning model pipeline is not stored in the database, a search of the database is performed for a second machine learning model pipeline correlated with a third tag related to first tag. If a second machine learning pipeline is located, the second machine learning pipeline is retrieved and used to create a machine learning model pipeline for responding to the request. In another example, if a first machine learning pipeline correlated with a first tag and a second machine learning model pipeline correlated with a third, related tag are not found in the database, a third machine learning model is created for responding to the task based on received feature generators and received machine learning models. The created third machine learning model pipeline is stored in the database, correlated with the tag(s) provided with the task. These and other features will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS The drawings are of illustrative embodiments. They do not illustrate all embodiments. Other embodiments may be used in addition or instead. Details that may be apparent or unnecessary may be omitted to save space or for more effective illustration. Some embodiments may be practiced with additional components or steps and/or without all the components or steps that are illustrated. When the same numeral appears in different drawings, it refers to the same or like components or steps. FIG. 1 is an example of a block diagram of system for automatically generating machine learning modeling pipelines, in accordance with an embodiment. FIG. 2 is a block diagram of an example the generation of a machine learning model pipeline, in accordance with an embodiment. FIG. 3 is a block diagram of an example of the generation of a machine learning model pipeline, in accordance with an embodiment. FIG. 4 is a block diagram of an example of the generation of a machine learning model pipeline, in accordance with an embodiment. FIG. 5 is a block diagram of an example of a tags library, in accordance with an embodiment. FIG. 6 is a block diagram of an example of a portion of a pipelines library database, in accordance with an embodiment. FIG. 7 is a block diagram of an example of the generation of a machine learning model pipeline, in accordance with an embodiment. FIG. 8 is a flow chart of an example of a flowchart for generating a machine learning model pipeline, in accordance with an embodiment. FIG. 9 is a continuation of the flow chart of an example of a flowchart for generating a machine learning model pipeline of FIG. 8, in accordance with an embodiment. FIG. 10 is an example of a functional block diagram of a computer hardware platform, in accordance with an embodiment. FIG. 11 is an illus