EP-4740148-A1 - SYSTEM AND METHOD FOR MACHINE LEARNING MODEL RE-FORMULATION

EP4740148A1EP 4740148 A1EP4740148 A1EP 4740148A1EP-4740148-A1

Abstract

A computerized method for determining when an ML model in a data generation process is not stable is described. An original set of training data is applied to each ML model in a data generation process. Loss values are determined from data samples from each of the ML models. The average distance between the data samples that have a difference in loss value less than a threshold are determined for each of the ML models. The dependency rate of separation between the average distances are analyzed versus the number of model runs as an exponential model. Based on the analyzing, it is determined whether the ML models are stable or divergent.

Inventors

BOUÉ, Laurent
RAMA, KIRAN

Assignees

Microsoft Technology Licensing, LLC

Dates

Publication Date: 20260513
Application Date: 20240630

Claims (20)

1. A system (100) for re-formulating a machine learning model, the system comprising: a memory (108) comprising: computer readable media (116); a first set of data (118) for a first machine learning (ML) model (110); and a second set of data (120) for a second ML model (112). the first set of data (118) comprising a first set of loss values for a first set of training samples used to train the first ML model (1 10), the second set of data (120) comprising a second set of loss values for the first set of training samples applied to the second ML model (112) that was trained using a second set of training samples; the computer readable media (116) comprising computer executable instructions (116) that cause a processor (104) to perform the following operations: identifying (402), from the first set of loss values for the first ML model (110), a first plurality of training sample pairs from the first set of training samples that have a difference in a loss value less than a threshold; identifying (404), from the second set of data (120), a second plurality of training sample pairs that correspond to the first plurality of training sample pairs; determining (406), from the first set of loss values, a first average loss value distance between each pair from the first plurality of training sample pairs from the first set of training samples: determining (408), from the second set of loss values, a second average loss value distance between each pair from the second plurality of training sample pairs; analyzing (410), as an exponential model, a dependency of a rate of separation between the first average loss value distance and the second average loss value distance versus a number of model runs; based on the analyzing, determining (412) whether the training of the second ML model (112) using the second set of training samples is stable; and causing (414) the first ML model (110) to be reformulated when the second ML model (112) is not stable.
2. The system of Claim 1, wherein the first set of loss values are determined by the following: receiving the first set of training samples; receiving ground truth data; applying the first set of training samples as input into the first ML model; receiving, based on the applying, a set of predictions from the first ML model, the set of predictions comprising a prediction corresponding to each training sample in the first set of training samples; comparing each prediction in the set of predictions to a respective ground truth from the ground truth data; and based on the comparing, determining a loss value for each training sample in the first set of training samples.
3. The system of any of Claims 1 and 2, wherein the second set of loss values are determined by the following: receiving the first set of training samples; receiving ground truth data; applying the first set of training samples as input into the second ML model trained using the second set of training samples; receiving, based on the applying, a set of predictions from the second ML model, the set of predictions comprising a prediction corresponding to each training sample in the first set of training samples; comparing each prediction in the set of predictions to a respective ground truth from the ground truth data; and based on the comparing, determining a loss value for each training sample in the first set of training samples.
4. The system of any of Claims 1-3. wherein the exponential model is d(k) ~ exp(k*lambda), where ‘"k” is the number of ML models, “d(k)” is an average loss value distance for an ML model, and lambda is a parameter is extracted from the data.
5. The system of Claim 4, wherein d(k) = Avg |Lk(xi) - Lk(xj)|, wherein “L"’ is a loss value, “k"’ is the number of the ML model, “x” is the first set of training samples, and “i” and “f ’ are training sample pairs from the first set of training samples that have a difference in a loss value less than the threshold.
6. The system of Claim 4, wherein when a value of lambda is less than or equal to zero, a current data generation process is stable and the training of the second ML model using the second set of training samples is stable; and wherein the value of lambda is greater than zero, the current data generation process is divergent and the training of the second ML model using the second set of training samples is not stable.
7. The system of any of Claims 1-6. wherein causing the first ML model to be reformulated comprises issuing an alert to a user.
8. A computerized method (400) comprising: identifying (402), from a first set of loss values for a first set of training samples used to train a first ML model (110), a first plurality of training sample pairs from a first set of training samples that have a difference in a loss value less than a threshold; identifying (404), from a second set of data (120) comprising a second set of loss values for the first set of training samples applied to a second ML model (112) that was trained using a second set of training samples, a second plurality of training sample pairs that correspond to the first plurality of training sample pairs, the second set of data (120) comprising a second set of loss values for the first set of training samples applied to the second ML model (120) that was trained using a second set of training samples; determining (406), from the first set of loss values, a first average loss value distance between each pair from the first plurality of training sample pairs from the first set of training samples; determining (408), from the second set of loss values, a second average loss value distance between each pair from the second plurality of training sample pairs; analyzing (410), as an exponential model, a dependency of a rate of separation between the first average loss value distance and the second average loss value distance versus a number of model runs; based on the analyzing, determining (412) whether the training of the second ML model (112) using the second set of data training samples is stable; and causing (414) the first ML model (110) to be reformulated when the second ML model (112) is not stable.
9. The computerized method of Claim 8, further comprising determining the first set of loss values by: receiving the first set of training samples; receiving ground truth data; applying the first set of training samples as input into the first ML model; receiving, based on the applying, a set of predictions from the first ML model, the set of predictions comprising a prediction corresponding to each training sample in the first set of training samples; comparing each prediction in the set of predictions to a respective ground truth from the ground truth data; and based on the comparing, determining a loss value for each training sample in the first set of training samples.
10. The computerized method of any of Claims 8 and 9, further comprising determining the second set of loss values by: receiving the first set of training samples; receiving ground truth data; applying the first set of training samples as input into the second ML model trained using the second set of training samples; receiving, based on the applying, a set of predictions from the second ML model, the set of predictions comprising a prediction corresponding to each training sample in the first set of training samples; comparing each prediction in the set of predictions to a respective ground truth from the ground truth data; and based on the comparing, determining a loss value for each training sample in the first set of training samples.
11. The computerized method of any of Claims 8-10, wherein the exponential model is d(k) ~ exp(k*lambda), where “k” is the number of ML models, ‘'d(k)” is an average loss value distance for an ML model, and lambda is a parameter is extracted from the data.
12. The computerized method of Claim 11, wherein d(k) = Avg |Lk(xi) - Lk(xj)|. wherein “L” is a loss value, “k” is the number of the ML model, ‘"x” is the first set of training samples, and “i” and “j” are training sample pairs from the first set of training samples that have a difference in a loss value less than the threshold.
13. The computerized method of Claim 11, wherein when a value of lambda is less than or equal to zero, a current data generation process is stable and the training of the second ML model using the second set of training samples is stable; and wherein the value of lambda is greater than zero, the current data generation process is divergent and the training of the second ML model using the second set of training samples is not stable.
14. The computerized method of any of Claims 8-13, wherein causing the first ML model to be reformulated comprises issuing an alert to a user.
15. A computer storage medium storing computer-executable instructions (116) that, upon execution by a processor (104), cause the processor (104) to perform the following: identifying (402), from a first set of loss values for a first set of training samples used to train a first ML model (1 10), a first plurality of training sample pairs from a first set of training samples that have a difference in a loss value less than a threshold; identifying (404), from a second set of data (120) comprising a second set of loss values for the first set of training samples applied to a second ML model (112) that was trained using a second set of training samples, a second plurality of training sample pairs that correspond to the first plurality of training sample pairs, the second set of data (120) comprising a second set of loss values for the first set of training samples applied to the second ML model (112) that was trained using the second set of training samples; determining (406), from the first set of loss values, a first average loss value distance between each pair from the first plurality of training sample pairs from the first set of training samples; determining (408), from the second set of loss values, a second average loss value distance between each pair from the second plurality of training sample pairs; analyzing (410), as an exponential model, a dependency of a rate of separation between the first average loss value distance and the second average loss value distance versus a number of model runs; based on the analyzing, determining (412) whether the training of the second ML model (112) using the second set of data training samples is stable; and causing (414) the first ML model (110) to be reformulated when the second ML model (112) is not stable.
16. The computer storage medium of Claim 15, wherein the first set of loss values are determined by the following: receiving the first set of training samples; receiving ground truth data; applying the first set of training samples as input into the first ML model; receiving, based on the applying, a set of predictions from the first ML model, the set of predictions comprising a prediction corresponding to each training sample in the first set of training samples; comparing each prediction in the set of predictions to a respective ground truth from the ground truth data: and based on the comparing, determining a loss value for each training sample in the first set of training samples.
17. The computer storage medium of any of Claims 15 and 16, wherein the second set of loss values are determined by the following: receiving the first set of training samples; receiving ground truth data; applying the first set of training samples as input into the second ML model trained using the second set of training samples; receiving, based on the applying, a set of predictions from the second ML model, the set of predictions comprising a prediction corresponding to each training sample in the first set of training samples; comparing each prediction in the set of predictions to a respective ground truth from the ground truth data: and based on the comparing, determining a loss value for each training sample in the first set of training samples.
18. The computer storage medium of any of Claims 15-17, wherein the exponential model is d(k) ~ exp(k*lambda), where “k” is the number of ML models, “d(k)” is an average loss value distance for an ML model, and lambda is a parameter is extracted from the data.
19. The computer storage medium of Claim 18. wherein d(k) = Avg |Lk(xi) - Lk(xj)|, wherein “L” is a loss value, “k” is the number of the ML model, “x” is the first set of training samples, and “i” and “j” are training sample pairs from the first set of training samples that have a difference in a loss value less than the threshold.
20. The computer storage medium of any of Claims 15-19, wherein when a value of lambda is less than or equal to zero, a current data generation process is stable and the training of the second ML model using the second set of training samples is stable; and wherein the value of lambda is greater than zero, the current data generation process is divergent and the training of the second ML model using the second set of training samples is not stable.

Description

SYSTEM AND METHOD FOR MACHINE LEARNING MODEL RE FORMULATION BACKGROUND [0001] Machine learning (ML) models are used in a variety of applications to analyze data and make predictions or decisions. ML models are trained to leam patterns and relationships from the data and generalize that knowledge to new, unseen instances. During training, the ML models adjust internal parameters to minimize a difference between predicted and actual outputs. Ensuring the ML models are current is very important to product performance. [0002] Retraining an ML model can be performed periodically, or based on a variety of factors, such as pre-defined cadence or drop in the ML model performance. However, the main goal of retraining the ML model is to keep the ML models up to date with whatever form of data/ concept drift an ML solution may be affected by. In almost every conventional retraining procedure, the retraining procedure collects new (i.e. more recent) training data and restarts the ML training procedure (either as fine-tuning from the previous model or from scratch). However, training models too frequently can affect stability as well as additional compute and overhead costs. Nonetheless, the underlying assumptions that were made during initial development of the ML model (such as feature engineering, choice of architecture, and target variables) remain unchanged as the ML models are being retrained. SUMMARY [0003] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in determining the scope of the claimed subject matter. [0004] Example solutions for determining a stability of a data generation process for machine learning (ML) models include: identifying, from a first set of loss values for a first ML model, a first plurality of training sample pairs from a first set of training samples that have a difference in a loss value less than a threshold; identify ing, from the second set of data, a second plurality of training sample pairs that correspond to the first plurality of training sample pairs, the second set of data comprising a second set of loss values for the first set of training samples applied to a second ML model that was trained using a second set of training samples; determining, from the first set of loss values, a first average loss value distance between each pair from the first plurality of training sample pairs from the first set of training samples; determining, from the second set of loss values, a second average loss value distance between each pair from the second plurality7 of training sample pairs; analyzing, as an exponential model, a dependency of a rate of separation between the first average loss value distance and the second average loss value distance versus a number of model runs; based on the analyzing, determining whether the training of the second ML model using the second set of data training samples is stable; and causing the first ML model to be reformulated when the second ML model is not stable. BRIEF DESCRIPTION OF THE DRAWINGS [0005] The present description will be better understood from the following detailed description read considering the accompanying drawings, wherein: [0006] FIG. 1 is a block diagram illustrating an example system in accordance with some embodiments; [0007] FIG. 2 is a flowchart illustrating an example method for determining loss values for an initial ML model. [0008] FIG. 3 is a flowchart illustrating an example method for determining loss values for a subsequent ML model using original training data. [0009] FIG. 4 is a flowchart illustrating an example method for determining a stability of a data generation process; [0010] FIG. 5 illustrates an example computing apparatus as a functional block diagram. [0011] Corresponding reference characters indicate corresponding parts throughout the drawings. In FIGs. 1 to 5, the systems are illustrated as schematic drawings. The drawings may not be to scale. Any of the figures may be combined into a single example or embodiment. DETAILED DESCRIPTION [0012] Aspects of the disclosure provide a system and method for determining the stability of a data generation process for machine learning (ML) models. Conventional systems only look at model re-training indicators. These indicators tell an ML scientist to re-train the ML model when there is a drift, as measured by change in input data distributions or a drop in the performance of the ML model, or a drop in real-world data performance. In these instances, the ML model is retrained using a current set of training data. However, what these systems fail to provide is an ability to determine if the ML model itself should be reformulated, not simply retrained. [0013] For example, data drift is usually estimated by looking at distribution s