EP-4348536-B1 - SYSTEM FOR HARNESSING KNOWLEDGE AND EXPERTISE TO IMPROVE MACHINE LEARNING

EP4348536B1EP 4348536 B1EP4348536 B1EP 4348536B1EP-4348536-B1

Inventors

FAGAN, DAVID
COYLE, MAURICE
FENTON, MICHAEL
SMYTH, BARRY

Dates

Publication Date: 20260506
Application Date: 20220525

Claims (15)

A method for harnessing knowledge, expertise and previous activity to support the design and implementation of data science workflows including machine learning elements by recommending to a user processing steps, settings and configurations in order to achieve an analytical goal for at least one activity, the method comprising: capturing input data for the at least one activity associated with a workflow; analysing the context of the at least one activity; searching the knowledge, expertise and previous activities for previous decisions made in previous contexts similar to the context of the at least one activity; determining which previous users made the previous decisions; evaluating the results of previous analysis determined by the previous decisions; modeling and learning at least one process associated with the input data; reviewing the result of the learning and modeling to produce an output; and providing at least one recommendation to the user, via a recommendations engine, for processing steps, settings and configurations for the at least one activity during a subsequent processing of the at least one activity based on one or more of the output, the previous decisions, the results of the previous analysis, or the previous users.
The method of claim 1 wherein one or more reputations of the previous users are calculated based on the evaluating the results of previous analysis determined by the previous decisions made by the previous users and the context of the previous analysis.
The method of claim 2, wherein the one or more reputations of the previous users weights the at least one recommendation to the user.
The method of any preceding claim wherein the recommendations engine includes a plurality of software-based recommenders that make recommendations about which techniques and configurations to use at each stage of the workflow.
The method of any preceding claim wherein the recommendations engine comprises a single multipurpose recommender or a plurality of specialized tuned recommenders.
The method of any preceding claim wherein the recommendations engine includes at least one of a data enhancement recommender, a problem definition recommender, a modeling practices recommender, and a visualization recommender.
The method of any preceding claim wherein the at least one recommendation is generated by a recommendations engine for each stage of a machine learning workflow regarding techniques and configurations to improve results.
The method of any preceding claim wherein the recommendations engine utilizes explicit and implicit inputs.
The method of claim 8 wherein the implicit inputs to the recommendations engine include one or more of decisions or actions made by previous users or are extracted from existing knowledge stores including at least machine learning communities and websites.
The method of claim 8 wherein the explicit inputs to the recommendations engine include at least one of manually defined problem statements, solution definitions and other user feedback.
The method of claim 8 wherein the explicit inputs to the recommendations engine include one or more of best practice decisions for each stage of a machine learning pipeline and different contexts or minimum standards defined within an organization or community of users.
The method of any preceding claim wherein the recommendation to be applied is selected by the user or automatically selected as the most appropriate.
The method of any preceding claim wherein a distance metric can be used to retrieve the previous contexts similar to the context of the at least on activity.
A system for harnessing knowledge, expertise and previous activity to support the design and implementation of data science workflows by recommending processing steps, settings and configurations in order to achieve an analytical goal for at least one activity, the system comprising: a memory device communicatively coupled to an input/out (I/O) device, the memory device capturing input data for the at least one activity associated with a workflow; and a processor configured to: analyse the context of the at least one activity; search the knowledge, expertise, and previous activities for previous decisions made in previous contexts similar to the context of the at least one activity; determine which previous users made the previous decisions; evaluate the results of previous analysis determined by the previous decisions; model and learn at least one process associated with the input data; review the result of the learning and modeling to produce an output; and provide at least one recommendation to the user, via a recommendations engine, for processing steps, settings and configurations for the at least one activity during a subsequent processing of the at least one activity based on the output, the previous decisions, the results of the previous analysis, and the previous users.
The system of claim 14 wherein the recommendations engine includes a plurality of software-based recommenders that make recommendations about which techniques and configurations to use at each stage of the workflow.

Description

FIELD OF INVENTION The present invention is directed to machine learning, and more particularly, for providing a system and method for harnessing human knowledge and expertise to improve machine learning processes. BACKGROUND When building workflows to train, test and validate machine learning models, the vast array of available techniques and configuration options at each stage of the process can be very difficult to navigate. These challenges require, or benefit from, a large amount of expertise to make the right decisions and to understand their inner workings and effects. In automated or guided machine learning systems, the use of brute-force techniques via automatic selection of techniques and configurations falls short in accounting for the creativity, subtleties, and nuances provided by human intuition and expertise. Furthermore, some systems benefit from the scientific method that allow users of the system to iterate and improve the models in order to achieve better performance, more reliable results, and improve levels of trust and confidence in the outcomes produced. US Patent Application No. 2018/165604A1 titled "SYSTEMS AND METHODS FOR AUTOMATING DATA SCIENCE MACHINE LEARNING ANALYTICAL WORKFLOWS" discloses systems and methods for automating data science machine learning using analytical workflows that provide for user interaction and iterative analysis US 10509672 B2 refers to a system for establishing and maintaining a standardize and interoperable resource information assertion environment. US 2020/411199 Al relates to platforms, systems media and methods for capturing clinical cases and expert-derived treatment rationales to facilitate biomedical decision making. Generally, brute-force approaches to automated machine learning processes fail to include user intuition or expertise in the process. They also do not allow domain knowledge to be applied to the machine learning process. Each execution of a machine learning process is one-size-fits-all, i.e., these systems do not adapt to specific problem characteristics or learn as time goes on. Generally, these systems remove the "science" from "data science," making it more difficult to understand the configurations that have been and will be selected and run experiments to uncover fresh insights and improve performance. Therefore, a clear need exists in delivering optimal machine learning performance in automatic or guided machine learning systems as well as in machine learning systems generally. SUMMARY As data science and machine learning explodes in popularity, organizations are struggling to meet demand with skilled resources, and thus their data-driven growth is impeded. Novice data scientists greatly benefit when they can learn by observing more senior data scientists build machine workflows to train, test, validate, evaluate, and deploy machine learning models. Thus, there is a need to more efficiently harness this expertise and use it to accelerate the education of data scientists. A system and method of harnessing knowledge and expertise to improve machine learning is disclosed. The system and method include capturing the data to input, preparing the captured data, enhancing the prepared data, modeling and learning the process associated with the enhanced data, reviewing the result of the learning and modeling to produce an output, visualizing the reviewed output, and input and recommendations from recommendations engine that make recommendations of techniques and configurations to use. The preparation of the data includes pre-processing, cleaning, and validating the input data. The preparation of the data includes at least one of exploring the data to identify outliers, missing values and other statistical characteristics, dealing with missing values, handling outliers, removing duplicate rows, initial validation and filtering. Enhancing the prepared data includes resolving class imbalance via under-sampling or over-sampling techniques, appropriate scaling and normalization of the data, and feature engineering to produce more representative and useful features. The learn/model process comprises defining a prediction task, selecting appropriate features to use, choosing a suitable algorithm and evaluation metric, tuning hyperparameters, and generating training/test/validation datasets. The review of the result of the learning and modeling involves at least one of scoring trained models, generating evaluation metrics, and determining appropriate statistical tests of significance for robust hypothesis testing. Visualization includes creation of graphical depictions of the evaluation results for analysis and decision-making. The recommendations engine includes a plurality of software-based recommenders that make recommendations about which techniques and configurations to use at each stage of the workflow. The recommendations engine comprises at least one of a multipurpose recommender and a plurality of specialized tuned recommenders. The