Search

US-12619915-B2 - Machine learning pipeline automation

US12619915B2US 12619915 B2US12619915 B2US 12619915B2US-12619915-B2

Abstract

The described technology is generally directed towards automated development of machine learning pipelines. An automated framework can extract topics from a data science workspace such as a machine learning notebook, transform and annotate cells of the machine learning notebook to various machine learning pipeline stages, and orchestrate the machine learning pipeline stages in a workflow that can be deployed into production data infrastructures.

Inventors

  • Leandro Lopes
  • Francisco Garcia Montemayor
  • Thiagarajan Ramakrishnan
  • Robert Mujica

Assignees

  • DELL PRODUCTS, L.P.

Dates

Publication Date
20260505
Application Date
20221122

Claims (20)

  1. 1 . A method, comprising: iteratively retraining, by equipment comprising at least one processor, a pipeline machine learning model to generate machine learning pipelines from machine learning notebooks, wherein the retraining comprises iteratively, for each machine learning notebook data structure of a group of machine learning notebook data structures: labeling cells of the machine learning notebook data structure, resulting in labeled cells, wherein the labeling comprises, for each cell of the cells: parsing respective lines of code within the cell, identifying methods in the respective lines of code, and classifying the methods respectively as a library-specific method or a user created method, and labeling the cell with a label based on library-specific methods within the cell, and not based on any user created methods within the cell; generating a machine learning pipeline using the labeled cells, wherein the generating comprises assigning the labeled cells to machine learning pipeline stages of the machine learning pipeline based on the labels applied to the labeled cells; testing the machine learning pipeline, resulting in a success of the machine learning pipeline or a failure of the machine learning pipeline; and retraining the pipeline machine learning model to generate the machine learning pipelines from the machine learning notebooks based on the success of the machine learning pipeline or the failure of the machine learning pipeline.
  2. 2 . The method of claim 1 , wherein testing the machine learning pipeline results in the failure of the machine learning pipeline, and further comprising, in response to the failure of the machine learning pipeline: re-assigning, by the equipment, a labeled cell of the labeled cells to a different machine learning pipeline stage of the machine learning pipeline stages, resulting in a reconfigured machine learning pipeline; and testing, by the equipment, the reconfigured machine learning pipeline, resulting in a success of the reconfigured machine learning pipeline or a failure of the reconfigured machine learning pipeline.
  3. 3 . The method of claim 2 , wherein the re-assigning of the labeled cell is based on an error log generated by the equipment in connection with testing the machine learning pipeline.
  4. 4 . The method of claim 1 , wherein testing the machine learning pipeline results in the success of the machine learning pipeline, and further results in performance data representative of a performance measurement associated with the machine learning pipeline.
  5. 5 . The method of claim 4 , wherein the performance measurement comprises a time measurement indicative of an execution time for the machine learning pipeline to execute.
  6. 6 . The method of claim 4 , wherein the machine learning pipeline is a first machine learning pipeline and the performance data representative of the performance measurement is first performance data representative of a first performance measurement, and further comprising: comparing, by the equipment, the first performance measurement with a second performance measurement associated with a second machine learning pipeline; and selecting, by the equipment, based on a result of the comparing, the first machine learning pipeline or the second machine learning pipeline for deployment in connection with the machine learning notebook data structure.
  7. 7 . The method of claim 1 , further comprising storing, by the equipment, the success of the machine learning pipeline or the failure of the machine learning pipeline in a knowledge data store, wherein the knowledge data store is usable in connection with future assignments of future labeled cells to future machine learning pipeline stages.
  8. 8 . The method of claim 1 , further comprising: collecting, by the equipment, different machine learning notebook data structures and different machine learning pipeline data indicating different machine learning pipeline stages associated with the different machine learning notebook data structures; and aggregating, by the equipment, the different machine learning notebook data structures and the different machine learning pipeline data to generate a knowledge data store that is usable in connection with assigning the labeled cells to the machine learning pipeline stages.
  9. 9 . The method of claim 8 , further comprising: filtering, by the equipment, at least some of the different machine learning notebook data structures, resulting in filtered machine learning notebook data structures, wherein the filtered machine learning notebook data structures are used to generate the knowledge data store.
  10. 10 . The method of claim 1 , wherein: assigning the labeled cells to the machine learning pipeline stages comprises assigning the labeled cells to multiple different groups of the machine learning pipeline stages, a different group of the multiple different groups is associated with a different machine learning pipeline of different machine learning pipelines, and testing the machine learning pipeline comprises parallel testing of the different machine learning pipelines.
  11. 11 . Network equipment, comprising: at least one processor; and at least one memory that stores executable instructions that, when executed by the at least one processor, facilitate performance of operations, comprising: iteratively retraining a pipeline machine learning model to generate machine learning pipelines from machine learning notebooks, wherein the retraining comprises, for each machine learning notebook data structure of a group of machine learning notebook data structures, iteratively: labeling cells of the machine learning notebook data structure, resulting in labeled cells, wherein the labeling comprises, for each cell of the cells: parsing respective lines of code within the cell, identifying methods in the respective lines of code, classifying the methods respectively as a library-specific method or a user created method, and labeling the cell with a label based on library-specific methods within the cell, and not based on any user created methods within the cell; generating a machine learning pipeline using the labeled cells, wherein the generating comprises assigning the labeled cells to machine learning pipeline stages of the machine learning pipeline based on the labels applied to the labeled cells; testing the machine learning pipeline, resulting in a success of the machine learning pipeline or a failure of the machine learning pipeline; and retraining the pipeline machine learning model to generate the machine learning pipelines from the machine learning notebooks based on the success of the machine learning pipeline or the failure of the machine learning pipeline.
  12. 12 . The network equipment of claim 11 , wherein testing the machine learning pipeline results in the failure of the machine learning pipeline, and wherein the operations further comprise, in response to the failure of the machine learning pipeline: re-assigning a labeled cell of the labeled cells to a different machine learning pipeline stage of the machine learning pipeline stages, resulting in a reconfigured machine learning pipeline; and testing the reconfigured machine learning pipeline, resulting in a success of the reconfigured machine learning pipeline or a failure of the reconfigured machine learning pipeline.
  13. 13 . The network equipment of claim 12 , wherein the re-assigning of the labeled cell is based on an error log generated in connection with testing the machine learning pipeline.
  14. 14 . The network equipment of claim 11 , wherein testing the machine learning pipeline results in the success of the machine learning pipeline, and further results in performance data representative of a performance measurement associated with the machine learning pipeline.
  15. 15 . The network equipment of claim 14 , wherein the performance measurement comprises a time measurement indicative of an execution time for the machine learning pipeline to execute, wherein the machine learning pipeline is a first machine learning pipeline and the performance data representative of the performance measurement is first performance data representative of a first performance measurement, and wherein the operations further comprise: comparing the first performance measurement with a second performance measurement associated with a second machine learning pipeline; and based on a result of the comparing, selecting the first machine learning pipeline or the second machine learning pipeline for deployment in connection with the machine learning notebook data structure.
  16. 16 . A non-transitory machine-readable medium, comprising executable instructions that, when executed by a processor, facilitate performance of operations, comprising: iteratively retraining a pipeline machine learning model to generate machine learning pipelines from machine learning notebooks, wherein the retraining comprises iteratively for each machine learning notebook data structure of a group of machine learning notebook data structures: labeling cells of the machine learning notebook data structure, resulting in labeled cells, wherein the labeling comprises, for each cell of the cells: parsing respective lines of code within the cell, identifying methods in the respective lines of code, classifying the methods respectively as a library-specific method or a user created method, and labeling the cell with a label based on library-specific methods within the cell, and not based on any user created methods within the cell; generating a machine learning pipeline using the labeled cells, wherein the generating comprises assigning the labeled cells to machine learning pipeline stages of the machine learning pipeline based on the labels applied to the labeled cells; testing the machine learning pipeline, resulting in a success of the machine learning pipeline or a failure of the machine learning pipeline; and retraining the pipeline machine learning model to generate the machine learning pipelines from the machine learning notebooks based on the success of the machine learning pipeline or the failure of the machine learning pipeline.
  17. 17 . The non-transitory machine-readable medium of claim 16 , wherein testing the machine learning pipeline results in the failure of the machine learning pipeline, and wherein the operations further comprise, in response to the failure of the machine learning pipeline: re-assigning a labeled cell of the labeled cells to a different machine learning pipeline stage of the machine learning pipeline stages, resulting in a reconfigured machine learning pipeline; and testing the reconfigured machine learning pipeline, resulting in a success of the reconfigured machine learning pipeline or a failure of the reconfigured machine learning pipeline.
  18. 18 . The non-transitory machine-readable medium of claim 17 , wherein the re-assigning of the labeled cell is based on an error log generated in connection with testing the machine learning pipeline.
  19. 19 . The non-transitory machine-readable medium of claim 16 , wherein testing the machine learning pipeline results in the success of the machine learning pipeline, and further results in performance data representative of a performance measurement associated with the machine learning pipeline.
  20. 20 . The non-transitory machine-readable medium of claim 19 , wherein the performance measurement comprises a time measurement indicative of an execution time for the machine learning pipeline to execute, wherein the machine learning pipeline is a first machine learning pipeline and the performance data representative of the performance measurement is first performance data representative of a first performance measurement, and wherein the operations further comprise: comparing the first performance measurement with a second performance measurement associated with a second machine learning pipeline; and selecting based on a result of the comparing, the first machine learning pipeline or the second machine learning pipeline for deployment in connection with the machine learning notebook data structure.

Description

BACKGROUND Data analytics is rapidly growing across all industries. Having proper data analytics, data processing and data visualization tools has become more important than ever. Machine learning models are increasingly important for data analytics, data processing and data visualization. However, machine learning model development is complex and can involve significant resources. With the right tools, data scientists can develop machine learning models more efficiently. Machine learning development and training can be facilitated by the use of machine learning notebooks. For example, Project Jupyter is an open-source project which has developed a notebook tool called “Jupyter Notebooks.” Machine learning notebooks employ kernels and cells to help organize and run steps found in a machine learning project, such as fetching data, transforming data, training a machine learning model, and persisting the machine learning model. After a data scientist is satisfied with results of a machine learning notebook, the data scientist may struggle to transition their code into a high-performance, reliable machine learning pipeline. A machine learning pipeline provides a bridge from development to production of machine learning and artificial intelligence projects. Machine learning pipelines help automate machine learning workflows by processing and integrating datasets into a model, which can then be evaluated and delivered to production. For example, machine learning pipelines can handle training, serving, monitoring, and orchestrating of models on data infrastructures. The above-described background is merely intended to provide a contextual overview of some current issues and is not intended to be exhaustive. Other contextual information may become further apparent upon review of the following detailed description. BRIEF DESCRIPTION OF THE DRAWINGS The technology described herein is illustrated by way of example and not limited in the accompanying figures in which like reference numerals indicate similar elements and in which: FIG. 1 illustrates example labeling of cells of a machine learning notebook data structure, in accordance with one or more embodiments described herein. FIG. 2 illustrates example assigning of labeled cells to machine learning pipeline stages based on labels applied to the labeled cells, in accordance with one or more embodiments described herein. FIG. 3 illustrates example testing of a machine learning pipeline comprising the labeled cells assigned to the machine learning pipeline stages, resulting in a success of the machine learning pipeline or a failure of the machine learning pipeline, in accordance with one or more embodiments described herein. FIG. 4 illustrates an example architecture for automated extraction and testing of machine learning pipelines, in accordance with one or more embodiments described herein. FIG. 5 illustrates an example architecture and operations of a notebook collector which can be included in the architecture introduced in FIG. 4, in accordance with one or more embodiments described herein. FIG. 6 illustrates an example architecture and operations of a pipeline learner which can be included in the architecture introduced in FIG. 4, in accordance with one or more embodiments described herein. FIG. 7 illustrates an example architecture and operations of a pipeline tester which can be included in the architecture introduced in FIG. 4, in accordance with one or more embodiments described herein. FIG. 8 is a flow diagram of a first example, non-limiting computer implemented method employed in connection with automated generation of machine learning pipelines, in accordance with one or more embodiments described herein. FIG. 9 is a flow diagram of a second example, non-limiting computer implemented method employed in connection with automated generation of machine learning pipelines, in accordance with one or more embodiments described herein. FIG. 10 is a flow diagram of a third example, non-limiting computer implemented method employed in connection with automated generation of machine learning pipelines, in accordance with one or more embodiments described herein. FIG. 11 illustrates a block diagram of an example computer operable to provide any of the various devices described herein. DETAILED DESCRIPTION One or more embodiments are now described with reference to the drawings, wherein like reference numerals are used to refer to like elements throughout. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the various embodiments. It may be evident, however, that the various embodiments can be practiced without these specific details, e.g., without applying to any particular networked environment or standard. In other instances, well-known structures and devices are shown in block diagram form in order to facilitate describing the embodiments in additional detail. The su