US-12619906-B2 - Interactive machine learning optimization

US12619906B2US 12619906 B2US12619906 B2US 12619906B2US-12619906-B2

Abstract

Methods, computer program products, and systems are presented. The method, computer program products, and systems can include, for instance: examining an enterprise dataset, the enterprise dataset defined by enterprise collected data; selecting one or more synthetic dataset in dependence on the examining, the one or more synthetic dataset including data other than data collected by the enterprise; training a set of predictive models using data of the one or more synthetic dataset to provide a set of trained predictive models; testing the set of trained predictive models with use of holdout data of the one or more synthetic dataset; and presenting prompting data on a displayed user interface of a developer user in dependence on result data resulting from the testing, the prompting data prompting the developer user to direct action with respect to one or more model of the set of predictive models.

Inventors

Dhavalkumar C. Patel
Si Er Han
Jiang Bo Kang

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260505
Application Date: 20210329

Claims (20)

1 . A computer implemented method comprising: examining an enterprise dataset, the enterprise dataset defined by enterprise collected data; selecting one or more synthetic dataset in dependence on the examining, the one or more synthetic dataset including data other than data collected by the enterprise, wherein the examining the enterprise dataset includes subjecting the enterprise dataset to processing for extraction of enterprise dataset characterizing parameter values and comparing the enterprise dataset characterizing parameter values to respective sets of synthetic dataset characterizing parameter values stored in a data repository, wherein the respective sets of synthetic dataset characterizing parameter values characterize respective synthetic datasets stored in the data repository, wherein the selecting one or more synthetic dataset in dependence on the examining includes identifying from the comparing at least one synthetic dataset stored in the data repository having a threshold satisfying similarity with the enterprise dataset and identifying from the respective synthetic datasets stored in the data repository a highest ranked synthetic dataset having a greatest similarity to the enterprise dataset amongst the respective synthetic datasets stored in the data repository, and wherein the selecting one or more synthetic dataset in dependence on the examining includes performing the selecting so that the selected one or more synthetic dataset have the threshold satisfying similarity with the enterprise dataset and include the highest ranked synthetic dataset having the greatest similarity to the enterprise dataset amongst the respective synthetic datasets stored in the data repository; training a set of predictive models using data of the selected one or more synthetic dataset having the threshold satisfying similarity with the enterprise dataset and including the highest ranked synthetic dataset having the greatest similarity to the enterprise dataset amongst the respective synthetic datasets stored in the data repository to provide a set of trained predictive models; testing the set of trained predictive models with use of holdout data of the one or more synthetic dataset; and presenting prompting data on a displayed user interface of a developer user in dependence on result data resulting from the testing, the prompting data prompting the developer user to direct action with respect to one or more model of the set of predictive models, wherein the prompting data prompting the developer user to direct action with respect to one or more model of the set of predictive models includes prompting data prompting the developer user to select, in dependence on data of the result data, a certain synthetic dataset for training and testing a certain predictive model of the set of predictive models, wherein the user interface guides the developer user by presenting a recommended datasets text area displaying recommended synthetic datasets including a highlighted dataset indicator distinguishing a recommended dataset from other candidate datasets, and restricts the developer user from selecting an unrecommended dataset for a predictive model such that, during a subsequent iteration, different predictive models are trained on differentiated synthetic datasets.
2 . The method of claim 1 , wherein testing the set of trained predictive models includes generating a ranked listing of model specific performance metrics, and wherein the user interface presents a model identifier text area that displays the ranked listing to visually correlate each predictive model with its respective performance metric.
3 . The method of claim 1 , wherein the user interface presents a results visualization area that plots predicted signals produced by multiple predictive models alongside a ground truth signal to enable qualitative comparison of model behavior over a shared time axis.
4 . The method of claim 1 , wherein the user interface presents a model attributes text area that displays attributes including model type, capacity, and training iteration count, and dynamically updates the displayed attributes in dependence on testing result data.
5 . The method of claim 1 , wherein the presenting prompting data includes displaying classifications of candidate synthetic datasets by attributes including seasonality, trend, variance, outliers, or level shift, and wherein the classifications are generated by processing metadata of the respective synthetic datasets stored in the data repository.
6 . The method of claim 1 , wherein the presenting prompting data includes displaying integrated visual guidance that correlates model specific error signals with dataset selection information derived from the testing, enabling the developer user to interpret relationships between error behavior and training data characteristics.
7 . The method of claim 1 , wherein the presenting prompting data includes querying a machine learning trained model that is trained to predict a next dataset for training a predictive model of the set of predictive models, wherein the trained model is configured using crowdsourced developer action data identifying historical developer selections of synthetic datasets.
8 . A system comprising: a memory; at least one processor in communication with the memory; and program instructions executable by one or more processor via the memory to perform a method comprising: examining an enterprise dataset, the enterprise dataset defined by enterprise collected data; selecting one or more synthetic dataset in dependence on the examining, the one or more synthetic dataset including data other than data collected by the enterprise, wherein the examining the enterprise dataset includes subjecting the enterprise dataset to processing for extraction of enterprise dataset characterizing parameter values and comparing the enterprise dataset characterizing parameter values to respective sets of synthetic dataset characterizing parameter values stored in a data repository, wherein the respective sets of synthetic dataset characterizing parameter values characterize respective synthetic datasets stored in the data repository, wherein the selecting one or more synthetic dataset in dependence on the examining includes identifying from the comparing at least one synthetic dataset stored in the data repository having a threshold satisfying similarity with the enterprise dataset and identifying from the respective synthetic datasets stored in the data repository a highest ranked synthetic dataset having a greatest similarity to the enterprise dataset amongst the respective synthetic datasets stored in the data repository, and wherein the selecting one or more synthetic dataset in dependence on the examining includes performing the selecting so that the selected one or more synthetic dataset have the threshold satisfying similarity with the enterprise dataset and include the highest ranked synthetic dataset having the greatest similarity to the enterprise dataset amongst the respective synthetic datasets stored in the data repository; training a set of predictive models using data of the selected one or more synthetic dataset having the threshold satisfying similarity with the enterprise dataset and including the highest ranked synthetic dataset having the greatest similarity to the enterprise dataset amongst the respective synthetic datasets stored in the data repository to provide a set of trained predictive models; testing the set of trained predictive models with use of holdout data of the one or more synthetic dataset; and presenting prompting data on a displayed user interface of a developer user in dependence on result data resulting from the testing, the prompting data prompting the developer user to direct action with respect to one or more model of the set of predictive models, wherein the prompting data prompting the developer user to direct action with respect to one or more model of the set of predictive models includes prompting data prompting the developer user to select, in dependence on data of the result data, a certain synthetic dataset for training and testing a certain predictive model of the set of predictive models, wherein the user interface guides the developer user by presenting a recommended datasets text area displaying recommended synthetic datasets including a highlighted dataset indicator distinguishing a recommended dataset from other candidate datasets, and restricts the developer user from selecting an unrecommended dataset for a predictive model such that, during a subsequent iteration, different predictive models are trained on differentiated synthetic datasets.
9 . The system of claim 8 , wherein testing the set of trained predictive models includes generating a ranked listing of model specific performance metrics, and wherein the user interface presents a model identifier text area that displays the ranked listing to visually correlate each predictive model with its respective performance metric.
10 . The system of claim 8 , wherein the user interface presents a results visualization area that plots predicted signals produced by multiple predictive models alongside a ground truth signal to enable qualitative comparison of model behavior over a shared time axis.
11 . The system of claim 8 , wherein the user interface presents a model attributes text area that displays attributes including model type, capacity, and training iteration count, and dynamically updates the displayed attributes in dependence on testing result data.
12 . The system of claim 8 , wherein the presenting prompting data includes displaying classifications of candidate synthetic datasets by attributes including seasonality, trend, variance, outliers, or level shift, and wherein the classifications are generated by processing metadata of the respective synthetic datasets stored in the data repository.
13 . The system of claim 8 , wherein the presenting prompting data includes displaying integrated visual guidance that correlates model specific error signals with dataset selection information derived from the testing, enabling the developer user to interpret relationships between error behavior and training data characteristics.
14 . The system of claim 8 , wherein the presenting prompting data includes querying a machine learning trained model that is trained to predict a next dataset for training a predictive model of the set of predictive models, wherein the trained model is configured using crowdsourced developer action data identifying historical developer selections of synthetic datasets.
15 . A computer program product comprising: a computer readable storage medium readable by one or more processing circuit and storing instructions for execution by one or more processor for performing a method comprising: examining an enterprise dataset, the enterprise dataset defined by enterprise collected data; selecting one or more synthetic dataset in dependence on the examining, the one or more synthetic dataset including data other than data collected by the enterprise, wherein the examining the enterprise dataset includes subjecting the enterprise dataset to processing for extraction of enterprise dataset characterizing parameter values and comparing the enterprise dataset characterizing parameter values to respective sets of synthetic dataset characterizing parameter values stored in a data repository, wherein the respective sets of synthetic dataset characterizing parameter values characterize respective synthetic datasets stored in the data repository, wherein the selecting one or more synthetic dataset in dependence on the examining includes identifying from the comparing at least one synthetic dataset stored in the data repository having a threshold satisfying similarity with the enterprise dataset and identifying from the respective synthetic datasets stored in the data repository a highest ranked synthetic dataset having a greatest similarity to the enterprise dataset amongst the respective synthetic datasets stored in the data repository, and wherein the selecting one or more synthetic dataset in dependence on the examining includes performing the selecting so that the selected one or more synthetic dataset have the threshold satisfying similarity with the enterprise dataset and include the highest ranked synthetic dataset having the greatest similarity to the enterprise dataset amongst the respective synthetic datasets stored in the data repository; training a set of predictive models using data of the selected one or more synthetic dataset having the threshold satisfying similarity with the enterprise dataset and including the highest ranked synthetic dataset having the greatest similarity to the enterprise dataset amongst the respective synthetic datasets stored in the data repository to provide a set of trained predictive models; testing the set of trained predictive models with use of holdout data of the one or more synthetic dataset; and presenting prompting data on a displayed user interface of a developer user in dependence on result data resulting from the testing, the prompting data prompting the developer user to direct action with respect to one or more model of the set of predictive models, wherein the prompting data prompting the developer user to direct action with respect to one or more model of the set of predictive models includes prompting data prompting the developer user to select, in dependence on data of the result data, a certain synthetic dataset for training and testing a certain predictive model of the set of predictive models, wherein the user interface guides the developer user by presenting a recommended datasets text area displaying recommended synthetic datasets including a highlighted dataset indicator distinguishing a recommended dataset from other candidate datasets, and restricts the developer user from selecting an unrecommended dataset for a predictive model such that, during a subsequent iteration, different predictive models are trained on differentiated synthetic datasets.
16 . The computer program product of claim 15 , wherein testing the set of trained predictive models includes generating a ranked listing of model specific performance metrics, and wherein the user interface presents a model identifier text area that displays the ranked listing to visually correlate each predictive model with its respective performance metric.
17 . The computer program product of claim 15 , wherein the user interface presents a results visualization area that plots predicted signals produced by multiple predictive models alongside a ground truth signal to enable qualitative comparison of model behavior over a shared time axis.
18 . The computer program product of claim 15 , wherein the user interface presents a model attributes text area that displays attributes including model type, capacity, and training iteration count, and dynamically updates the displayed attributes in dependence on testing result data.
19 . The computer program product of claim 15 , wherein the presenting prompting data includes displaying integrated visual guidance that correlates model specific error signals with dataset selection information derived from the testing, enabling the developer user to interpret relationships between error behavior and training data characteristics.
20 . The computer program product of claim 15 , wherein the presenting prompting data includes querying a machine learning trained model that is trained to predict a next dataset for training a predictive model of the set of predictive models, wherein the trained model is configured using crowdsourced developer action data identifying historical developer selections of synthetic datasets.

Description

BACKGROUND Embodiments herein relate generally to the field of machine learning, and more particularly to interactive machine learning optimization. Many information handling systems include a graphical user interface (GUI) with which a user communicates with the system. A GUI includes the use of graphic symbols or pictures, rather than just words, to represent objects or elements in the system. Program code is associated with a graphic symbol in order to allow the graphic symbol to possess certain desired behaviors. A graphic symbol, along with its associated program code, make up a GUI control element. Programs that include a GUI typically render on a display screen many graphics including graphical symbols, which can be utilized by a user to communicate with the program and/or control events in the system. To obtain the necessary user input, the program may render a selection graphical symbol on the screen. The user can make an appropriate selection by touching in the case of a touch sensitive GUI and/or with use of a pointer controller. Data structures have been employed for improving the operation of computer systems. A data structure refers to an organization of data in a computer environment for improved computer system operation. Data structure types include containers, lists, stacks, queues, tables, and graphs. Data structures have been employed for improved computer system operation, e.g., in terms of algorithm efficiency, memory usage efficiency, maintainability, and reliability. Artificial intelligence (AI) refers to intelligence exhibited by machines. AI research includes search and mathematical optimization, neural networks, and probability. AI solutions involve features derived from research in a variety of different science and technology disciplines ranging from computer science, mathematics, psychology, linguistics, statistics, and neuroscience. Machine learning has been described as the field of study that gives computers the ability to learn without being explicitly programmed. SUMMARY Shortcomings of the prior art are overcome, and additional advantages are provided, through the provision, in one aspect, of a method. The method can include, for example: examining an enterprise dataset, the enterprise dataset defined by enterprise collected data; selecting one or more synthetic dataset in dependence on the examining, the one or more synthetic dataset including data other than data collected by the enterprise; training a set of predictive models using data of the one or more synthetic dataset to provide a set of trained predictive models; testing the set of trained predictive models with use of holdout data of the one or more synthetic dataset; and presenting prompting data on a displayed user interface of a developer user in dependence on result data resulting from the testing, the prompting data prompting the developer user to direct action with respect to one or more model of the set of predictive models. In another aspect, a computer program product can be provided. The computer program product can include a computer readable storage medium readable by one or more processing circuit and storing instructions for execution by one or more processor for performing a method. The method can include, for example: examining an enterprise dataset, the enterprise dataset defined by enterprise collected data; selecting one or more synthetic dataset in dependence on the examining, the one or more synthetic dataset including data other than data collected by the enterprise; training a set of predictive models using data of the one or more synthetic dataset to provide a set of trained predictive models; testing the set of trained predictive models with use of holdout data of the one or more synthetic dataset; and presenting prompting data on a displayed user interface of a developer user in dependence on result data resulting from the testing, the prompting data prompting the developer user to direct action with respect to one or more model of the set of predictive models. In a further aspect, a system can be provided. The system can include, for example a memory. In addition, the system can include one or more processor in communication with the memory. Further, the system can include program instructions executable by the one or more processor via the memory to perform a method. The method can include, for example: examining an enterprise dataset, the enterprise dataset defined by enterprise collected data; selecting one or more synthetic dataset in dependence on the examining, the one or more synthetic dataset including data other than data collected by the enterprise; training a set of predictive models using data of the one or more synthetic dataset to provide a set of trained predictive models; testing the set of trained predictive models with use of holdout data of the one or more synthetic dataset; and presenting prompting data on a displayed user interface of a developer user in dependence