US-12619910-B2 - Machine learning pipeline with visualizations

US12619910B2US 12619910 B2US12619910 B2US 12619910B2US-12619910-B2

Abstract

A method may include obtaining a machine learning (ML) pipeline including a plurality of functional blocks within the ML pipeline. The method may also include using the ML pipeline as an input to a visualization predictor, where the visualization predictor may be trained to output one or more visualization commands based on relationships between the visualization commands and the functional blocks within the pipeline. The method may additionally include invoking the visualization commands to instantiate the ML pipeline with visualizations generated by the one or more visualization commands.

Inventors

Lei Liu
Wei-Peng Chen

Assignees

FUJITSU LIMITED

Dates

Publication Date: 20260505
Application Date: 20220329

Claims (18)

1 . A method, comprising: obtaining a machine learning (ML) pipeline including a plurality of functional blocks within the ML pipeline; using the ML pipeline as an input to a visualization predictor, the visualization predictor trained to output one or more visualization commands based on relationships between the visualization commands and the functional blocks within the ML pipeline; instantiating the ML pipeline with the one or more visualization commands embedded within the ML pipeline; and generating the visualization predictor, generating the visualization predictor comprising: obtaining a plurality of training ML pipelines as a training dataset, each of the training ML pipelines including at least one visualization, determining first correlations between data features of precursor training datasets which are used to train the training ML pipelines and the at least one visualization, determining second correlations between code features of the training ML pipelines and the at least one visualization, and deriving a plurality of rules based on the first correlations and the second correlations, the rules providing a basis for predicting the visualization commands.
2 . The method of claim 1 , wherein deriving the plurality of rules includes applying association rule mining to the first correlations and the second correlations such that each of the rules includes a statement describing a relationship between one or more of the data features or the code features and a given visualization, and a confidence value of the relationship.
3 . The method of claim 2 , wherein the relationship includes a given code feature, the method further comprising: determining the given code feature occurs after the given visualization in the ML pipeline; and classifying an associated rule as explanatory in response to determining that the given code feature occurs after the given visualization and as exploratory if the given code feature occurs before the visualization.
4 . The method of claim 3 , wherein the given code feature has a relationship with a command to generate the given visualization.
5 . The method of claim 2 , further comprising discretizing a numerical feature of an association rule mining (ARM) training dataset to one of a limited number of buckets.
6 . The method of claim 2 , further comprising selecting a threshold number of rules with the confidence value below a threshold.
7 . The method of claim 1 , wherein the data features of the precursor training datasets include one or more meta-features of the precursor training datasets for one column of the precursor training datasets or one or more meta-features of the precursor training datasets for multiple columns of the precursor training datasets.
8 . The method of claim 1 , wherein using the ML pipeline as input to the visualization predictor comprises: extracting run time code features in the ML pipeline and run time dataset features in a run time training dataset associated with the ML pipeline; and mapping the run time code features and the run time dataset features to rules based on the relationships.
9 . The method of claim 1 , wherein a quantity of visualization commands is limited by a visualization constraint.
10 . One or more non-transitory computer-readable media containing instructions which, when executed by one or more processors, cause a system to perform operations, the operations comprising: obtaining a machine learning (ML) pipeline including a plurality of functional blocks within the ML pipeline; using the ML pipeline as an input to a visualization predictor, the visualization predictor trained to output one or more visualization commands based on relationships between the visualization commands and the functional blocks within the ML pipeline; instantiating the ML pipeline with the one or more visualization commands embedded within the ML pipeline; and generating the visualization predictor, generating the visualization predictor comprising: obtaining a plurality of training ML pipelines as a training dataset, each of the training ML pipelines including at least one visualization; determining first correlations between data features of precursor training datasets which are used to train the training ML pipelines and the at least one visualization; determining second correlations between code features of the training ML pipelines and the at least one visualization; and deriving a plurality of rules based on the first correlations and the second correlations, the rules providing a basis for predicting the visualization commands.
11 . The one or more non-transitory computer-readable media of claim 10 , wherein deriving the plurality of rules includes applying association rule mining to the first correlations and the second correlations such that each of the rules includes a statement describing a relationship between one or more of the data features or the code features and a given visualization, and a confidence value of the relationship.
12 . The one or more non-transitory computer-readable media of claim 11 , wherein the relationship includes a given code feature, the operations further comprise: determining the given code feature occurs after the given visualization in the ML pipeline; and classifying an associated rule as explanatory in response to determining that the given code feature occurs after the given visualization.
13 . The one or more non-transitory computer-readable media of claim 12 , wherein the given code feature has a relationship with a command to generate the given visualization.
14 . The one or more non-transitory computer-readable media of claim 11 , wherein the operations further comprise discretizing a numerical feature of an association rule mining (ARM) training dataset to one of a limited number of buckets.
15 . The one or more non-transitory computer-readable media of claim 11 , wherein the operations further comprise further comprising selecting a threshold number of rules with the confidence value below a threshold.
16 . The one or more non-transitory computer-readable media of claim 10 , wherein the data features of the precursor training datasets include one or more meta-features of the precursor training datasets for one column of the precursor training datasets or one or more meta-features of the precursor training datasets for multiple columns of the precursor training datasets.
17 . The one or more non-transitory computer-readable media of claim 10 , wherein using the ML pipeline as input to the visualization predictor comprises: extracting run time code features in the ML pipeline and run time dataset features in a run time training dataset associated with the ML pipeline; and mapping the run time code features and the run time dataset features to rules based on the relationships.
18 . The one or more non-transitory computer-readable media of claim 10 , wherein a quantity of visualization commands is limited by a visualization constraint.

Description

FIELD The embodiments discussed in the present disclosure are related to a machine learning pipeline with visualizations. BACKGROUND Machine learning (ML) generally employs ML models that are trained with training data to make predictions that automatically become more accurate with ongoing training. ML may be used in a wide variety of applications including, but not limited to, traffic prediction, web searching, online fraud detection, medical diagnosis, speech recognition, email filtering, image recognition, virtual personal assistants, and automatic translation. The subject matter claimed in the present disclosure is not limited to embodiments that solve any disadvantages or that operate only in environments such as those described above. Rather, this background is only provided to illustrate one example technology area where some embodiments described in the present disclosure may be practiced. SUMMARY One or more embodiments of the present disclosure may include a method that includes obtaining a machine learning (ML) pipeline including a plurality of functional blocks within the ML pipeline. The method may also include using the ML pipeline as an input to a visualization predictor, where the visualization predictor may be trained to output one or more visualization commands based on relationships between the visualization commands and the functional blocks within the pipeline. The method may additionally include invoking the visualization commands to instantiate the ML pipeline with visualizations generated by the one or more visualization commands. The objects and advantages of the embodiments will be realized and achieved at least by the elements, features, and combinations particularly pointed out in the claims. Both the foregoing general description and the following detailed description are given as examples and are explanatory and are not restrictive of the invention, as claimed. BRIEF DESCRIPTION OF THE DRAWINGS Example embodiments will be described and explained with additional specificity and detail through the use of the accompanying drawings in which: FIG. 1 is a diagram representing an example system for generating machine learning pipelines that include visualizations; FIG. 2 illustrates an example environment for performing operations to prepare rules used in generating machine learning pipelines that include visualizations; FIG. 3 is a flowchart of an example method of extracting coding features and data features from training machine learning pipelines; FIG. 4 is a flowchart of an example method of generating a machine learning pipeline that include visualizations; FIG. 5 is a flowchart of an example method of deriving rules related to visualizations; FIG. 6 is a flowchart of another example method of generating a machine learning pipeline that include visualizations; FIG. 7 illustrates a block diagram of an example computing system. DESCRIPTION OF EMBODIMENTS Some embodiments described in the present disclosure relate to methods and systems of generating Machine Learning (ML) pipelines that include visualizations. As ML has become increasingly common, there is often a scarcity of ML experts (e.g., skilled data scientists) available to implement new ML projects. Although various AutoML solutions (e.g. Auto-Sklearn, AutoPandas, etc.) have been proposed to resolve the ever-growing challenge of implementing new ML projects with a scarcity of ML experts, current AutoML solutions offer only simplistic and partial solutions that are insufficient to enable non-experts to fully implement new ML projects. Further, although open source software (OSS) databases of existing ML projects (e.g., Kaggle, GitHub, etc.) have also been proposed as another solution for the challenge of implementing new ML projects by non-experts, it may be difficult or impossible for a non-expert to find a potentially useful existing ML project in these databases. Further, even if the non-expert should succeed in finding a potentially useful existing ML project in these databases, it can be difficult or impossible for the non-expert to modify the potentially useful existing ML project for the new requirements of a new ML project. In the present disclosure, the term “ML project” may refer to a project that includes a dataset, an ML task defined on the dataset, and an ML pipeline (e.g., a script or program code with a series of functional blocks) that is configured to implement a sequence of operations to train a ML model, on the dataset, for the ML task and use the ML model for new predictions. In the present disclosure reference to “functional blocks” may refer to operations that may be performed by the ML pipelines in which a particular functional block may correspond to a particular type of functionality. Further, each functional block may be instantiated in its corresponding ML pipeline with a particular code snippet configured to cause execution of the functionality of the corresponding functional block. In many insta