US-12619907-B2 - Automated model lineage inference
Abstract
A computer-implemented method, a computer program product, and a computer system for automated model lineage inference. A computer system identifies training datasets which is used to train a machine learning model. A computer system identifies parent datasets from which the training datasets are derived. A computer system identifies associated feature transformations when the training datasets are derived from the parent datasets.
Inventors
- Rajmohan Chandrahasan
- Kriti Rajput
- Nitin Gupta
- Himanshu Gupta
- Sameep Mehta
- Emma Rose Tucker
- Manish Anand Bhide
Assignees
- INTERNATIONAL BUSINESS MACHINES CORPORATION
Dates
- Publication Date
- 20260505
- Application Date
- 20220311
Claims (14)
- 1 . A computer-implemented method for automated model lineage inference, the method comprising: retrieving metadata of datasets and metadata of a machine learning model on an artificial intelligence platform; analyzing the metadata of the datasets and the metadata of the machine learning model; identifying training dataset candidates for training the machine learning model, based on analysis of the metadata of the datasets and the metadata of the machine learning model; determining training datasets from the training dataset candidates, through analysis of the training dataset candidates and the machine learning model; ranking the training datasets, according to confidence with respect to the training datasets being used for training the machine learning model; publishing lineage inference information of the training datasets in a lineage store in the artificial intelligence platform; retrieving metadata of the training datasets on the artificial intelligence platform; analyzing the metadata of the training datasets; identifying parent dataset candidates for deriving the training datasets, based on analysis of the metadata of the training datasets; determining parent datasets from the parent dataset candidates, through analysis of the parent dataset candidates and the training dataset; ranking the parent datasets, according to confidence with respect to the parent datasets being used for deriving the training datasets; and publishing lineage inference information of the parent datasets in the lineage store in the artificial intelligence platform.
- 2 . The computer-implemented method of claim 1 , further comprising: verifying, with an owner of the machine learning model, whether the training datasets are used for training the machine learning model.
- 3 . The computer-implemented method of claim 1 , further comprising: deriving relationships and constraints between columns of the parent datasets and columns of the training datasets; identifying feature transformations that are applied when the training datasets are derived from the parent datasets, based on the relationships and constraints; and publishing lineage inference information of the feature transformations in the lineage store in the artificial intelligence platform.
- 4 . The computer-implemented method of claim 1 , further comprising: comparing schemas of the datasets and a schema of the machine learning model.
- 5 . The computer-implemented method of claim 1 , further comprising: comparing similarity between column names in the machine learning model and column names in the datasets.
- 6 . The computer-implemented method of claim 1 , further comprising: comparing names and descriptions of the datasets with a name and a description of the machine learning model.
- 7 . The computer-implemented method of claim 1 , further comprising: for determining the training datasets from the training dataset candidates, applying one of a neural network, a decision tree, and a support vector machine.
- 8 . The method of claim 1 , wherein determining training datasets from the training dataset candidates comprises: training the machine learning model using each of the training dataset candidates; and determining the training datasets based on comparing weights of the machine learning model prior to the training using a given training dataset candidate to the weights of the model after the training using the given training dataset candidate, wherein the training datasets are associated with a smaller change in weights of the machine learning model relative to other datasets from the training dataset candidates.
- 9 . A computer-implemented method for automated model lineage inference, the method comprising: retrieving metadata of datasets including training datasets for training a machine learning model on an artificial intelligence platform; analyzing the metadata of the datasets including the training datasets; identifying, from the datasets, parent dataset candidates for deriving the training datasets, based on analysis of the metadata of the datasets including the training datasets; determining parent datasets from the parent dataset candidates, through analysis of the parent dataset candidates and the training datasets; ranking the parent datasets, according to confidence with respect to the parent datasets being used for deriving the training datasets; publishing lineage inference information of the parent datasets in a lineage store in the artificial intelligence platform; retrieving the metadata of the datasets and metadata of the machine learning model on the artificial intelligence platform; analyzing the metadata of the datasets and the metadata of the machine learning model; identifying training dataset candidates for training the machine learning model, based on analysis of the metadata of the datasets and the metadata of the machine learning model; determining the training datasets from the training dataset candidates, through analysis of the training dataset candidates and the machine learning model; ranking the training datasets, according to confidence with respect to the training datasets being used for training the machine learning model; and publishing lineage inference information of the training datasets in the lineage store in the artificial intelligence platform.
- 10 . The computer-implemented method of claim 9 , further comprising: verifying, with an owner of the machine learning model, whether the parent datasets are used for deriving the training datasets.
- 11 . The computer-implemented method of claim 9 , further comprising: deriving relationships and constraints between columns of the parent datasets and columns of the training datasets; identifying feature transformations that are applied when the training datasets are derived from the parent datasets, based on the relationships and constraints; and publishing lineage inference information of the feature transformations in the lineage store in the artificial intelligence platform.
- 12 . A computer-implemented method for automated model lineage inference, the method comprising: deriving relationships and constraints between columns of parent datasets and columns of training datasets on an artificial intelligence platform, wherein the training datasets are used for training a machine learning model on the artificial intelligence platform and the parent datasets are used for deriving the training datasets; identifying feature transformations that are applied when the training datasets are derived from the parent datasets, based on the relationships and constraints; publishing lineage inference information of the feature transformations in a lineage store on the artificial intelligence platform; retrieving metadata of the datasets including the training datasets on the artificial intelligence platform; analyzing the metadata of the datasets including the training datasets; identifying parent dataset candidates for deriving the training datasets, based on analysis of the metadata of the datasets including the training datasets; determining the parent datasets from the parent dataset candidates, through analysis of the parent dataset candidates and the training dataset; ranking the parent datasets, according to confidence with respect to the parent datasets being used for deriving the training datasets; and publishing lineage inference information of the parent datasets in the lineage store in the artificial intelligence platform.
- 13 . The computer-implemented method of claim 12 , further comprising: verifying, with an owner of the machine learning model, whether the training datasets are derived from the parent datasets.
- 14 . The computer-implemented method of claim 12 , further comprising: retrieving metadata of datasets and metadata of the machine learning model on the artificial intelligence platform; analyzing the metadata of the datasets and the metadata of the machine learning model; identifying training dataset candidates for training the machine learning model, based on analysis of the metadata of the datasets and the metadata of the machine learning model; determining the training datasets from the training dataset candidates, through analysis of the training dataset candidates and the machine learning model; ranking the training datasets, according to confidence with respect to the training datasets being used for training the machine learning model; and publishing lineage inference information of the training datasets in the lineage store in the artificial intelligence platform.
Description
BACKGROUND The present invention relates generally to machine learning model lineage, and more particularly to automated model lineage inference by exploiting model and dataset metadata. Artificial intelligence (AI) and machine learning (ML) adoption is on rise in various industries. Often AI/ML systems are built to serve the functionality first and associated lineage data is not tracked. It is critical to keep track of various events in the lifecycle of AI/ML models. This objective is served by lineage service. The lineage service keeps track of various events in the model lifecycle. Various services (such as IBM Watson Machine Learning and IBM Watson knowledge catalog) can push lineage information to lineage service. The lineage service then persists this information. The lineage services also provide options for manually ingesting lineage information. A user can manually raise lineage events on lineage service. The model lineage may cover various events, such as model training and creation, model deployment, model promotion, model version change, model quality, feature transformations on training data, etc. SUMMARY In one aspect, a computer-implemented method for automated model lineage inference is provided. The computer-implemented method includes retrieving metadata of datasets and metadata of a machine learning model on an artificial intelligence platform. The computer-implemented method further includes analyzing the metadata of the datasets and the metadata of the machine learning model. The computer-implemented further method includes identifying training dataset candidates for training the machine learning model, based on analysis of the metadata of the datasets and the metadata of the machine learning model. The computer-implemented method further includes determining training datasets from the training dataset candidates, through analysis of the training dataset candidates and the machine learning model. The computer-implemented method further includes ranking the training datasets, according to confidence with respect to the training datasets being used for training the machine learning model. The computer-implemented method further includes publishing lineage inference information of the training datasets in a lineage store in the artificial intelligence platform. In another aspect, a computer-implemented method for automated model lineage inference is provided. The computer-implemented method includes retrieving metadata of datasets including training datasets for training a machine learning model on an artificial intelligence platform. The computer-implemented method further includes analyzing the metadata of the datasets including the training datasets. The computer-implemented method further includes identifying, from the datasets, parent dataset candidates for deriving the training datasets, based on analysis of the metadata of the datasets including the training datasets. The computer-implemented method further includes determining parent datasets from the parent dataset candidates, through analysis of the parent dataset candidates and the training datasets. The computer-implemented method further includes ranking the parent datasets, according to confidence with respect to the parent datasets being used for deriving the training datasets. The computer-implemented method further includes publishing lineage inference information of the parent datasets in a lineage store in the artificial intelligence platform. In yet another aspect, a computer-implemented method for automated model lineage inference is provided. The computer-implemented method includes deriving relationships and constraints between columns of parent datasets and columns of training datasets on an artificial intelligence platform, wherein the training datasets are used for training a machine learning model on the artificial intelligence platform and the parent datasets are used for deriving the training datasets. The computer-implemented method further includes identifying feature transformations that are applied when the training datasets are derived from the parent datasets, based on the relationships and constraints. The computer-implemented method further includes publishing lineage inference information of the feature transformations in a lineage store on the artificial intelligence platform. BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS FIG. 1 a flowchart showing operational steps of automated model lineage inference for identifying training datasets which is used to train a machine learning model, in accordance with one embodiment of the present invention. FIG. 2 a flowchart showing operational steps of automated model lineage inference for identifying parent datasets from which training datasets are derived, in accordance with one embodiment of the present invention. FIG. 3 a flowchart showing operational steps of automated model lineage inference for identifying associated feature transformations when tr