US-20260127452-A1 - TECHNIQUES TO PREDICT COMPATIBILITY, APPLICABILITY, AND GENERALIZATION PERFORMANCE OF A MACHINE-LEARNING MODEL AT RUN TIME

US20260127452A1US 20260127452 A1US20260127452 A1US 20260127452A1US-20260127452-A1

Abstract

The present disclosure relates to analysis techniques to determine at run time whether a machine-learning model is applicable to a new set of input data. Particularly, aspects are directed to inputting an input query into a machine-learning model, generating, using the machine-learning model comprising model parameters learned for a particular task, a prediction associated with the task based on the input query, obtaining one or more metrics for generalization of the machine-learning model on the input query, the one or more metrics being computed using black-box and/or clear-box techniques for predicting a correctness of a model on a sample-by-sample basis (additionally applicable on a population level), by analyzing how the machine-learning model responds to the input query, and outputting a prediction of model generalization for the machine-learning model based on the one or more metrics.

Inventors

Abhejit Rajagopal
Thomas A. Hope
Peder E.Z. LARSON

Assignees

THE REGENTS OF THE UNIVERSITY OF CALIFORNIA

Dates

Publication Date: 20260507
Application Date: 20231005

Claims (20)

1 . A computer-implemented method comprising: obtaining testing data without ground truth labels; inputting the testing data into a machine-learning model; generating, using the machine-learning model comprising model parameters learned for a particular task, a prediction associated with the task based on the test data; obtaining one or more metrics for generalization of the machine-learning model on the test data, where the obtaining comprises: (i) computing metrics using intermediate feature representations extracted from the nodes of the computational graph or abstract syntax tree (AST) of the machine learning model, (ii) computing metrics using the output prediction in conjunction with the input data or intermediate features, (iii) computing metrics using the model weights and architectures and its interactions with the input data, intermediate features, or output predictions, or (iv) any combination of (i-iii); and outputting a prediction of model generalization for the machine learning model at the population-level based on the one or more metrics.
2 . A computer-implemented method comprising: obtaining testing data without ground truth labels; inputting the testing data into a machine-learning model; generating, using the machine-learning model comprising model parameters learned for a particular task, a prediction associated with the task based on the test data; obtaining one or more metrics for generalization of the machine-learning model on the test data, wherein the obtaining comprises executing various neural trace analysis algorithms that capture, summarize, or derive intelligence from intermediate feature representations at nodes of the machine-learning model's computational graph or abstract syntax tree (AST) during program execution in response to input data, including: (i) computing a clustering metric based on clustering of the intermediate feature representations within input samples of the testing data, wherein the clustering is calculated based on a subset of samples obtained from training data with ground truth labels; (ii) computing a modified Mixup metric based on convex combinations of each of the input samples from the testing data and a subset of samples obtained from the training data with ground truth labels corresponding to the prediction for each of the input samples; (iii) computing a confidence metric based on clustering of the intermediate feature representations within the input samples of the testing data, wherein the clustering is calculated based on the input samples from the testing data; (iv) computing a roughness metric using dimensionality-reduction of the intermediate features to define a smoothness criteria for the training data and the testing data; or (v) any combination of (i)-(iv); and outputting a prediction of model generalization for the machine learning model at a level of the testing data based on the one or more metrics.
3 . The computer-implemented method of claim 2 , wherein the computing the clustering metric comprises: extracting the intermediate feature representations for each input sample in the testing data; performing, using the subset of samples and principal component analysis, the dimensionality-reduction of the intermediate features to predetermined dimensions at each layer of the machine-learning model, wherein the ground truth labels of the training data indicate membership to one or more classes; computing the clustering for the testing data based on a partition of the subset of samples, wherein the membership is used as the partition to compute and average the clustering metric to measure the clustering over each layer of the machine-learning model; fitting mixture models or kernel density estimators to the intermediate feature representations corresponding to the samples from the subset of samples in the partition, wherein the fitting comprises computing a secondary partition at each layer on the input samples in the testing data based on the clustering computed for the testing data to obtain parameters for the mixture models; and computing the clustering metric based on the measure of the clustering over each layer of the machine-learning model and the mixture models.
4 . The computer-implemented method of claim 2 , wherein the modified Mixup metric comprises: computing an agreement in classification between each of the input samples from the testing data and the subset of the training data corresponding to the prediction for each of the input samples; when the agreement is indicative that the prediction is incorrect, the input sample is mixed with a subset of the input samples from the testing data that correspond to the incorrect prediction; and when the agreement is indicative that the prediction is correct, using the agreement as the modified Mixup metric.
5 . The computer-implemented method of claim 2 , wherein the computing the confidence metric comprises: extracting the intermediate feature representations for each input sample in the testing data; computing a partition of the subset of samples obtained from training data by k-means or similar clustering of an output domain of each layer of the machine-learning model, wherein the ground truth labels of the training data indicate membership to one or more classes; computing the clustering for the testing data based on a partition of the subset of samples, wherein the membership is used as the partition to compute and average the clustering metric to measure the clustering over each layer of the machine-learning model; fitting mixture models or kernel density estimators to the intermediate feature representations corresponding to the samples from the subset of samples in the partition, wherein the fitting comprises computing a secondary partition at each layer on the input samples in the testing data based on the clustering computed for the testing data to obtain parameters for the mixture models; and computing the confidence metric based on the measure of the clustering over each layer of the machine-learning model and the mixture models.
6 . The computer-implemented method of claim 2 , wherein the computing the roughness metric comprises: extracting the intermediate feature representations for each input sample in the testing data; computing a partition of the subset of samples obtained from training data by k-means or similar clustering of an output domain of each layer of the machine-learning model, wherein the ground truth labels of the training data indicate membership to one or more classes; computing the clustering for the testing data based on a partition of the subset of samples, wherein the membership is used as the partition to compute and average the clustering metric to measure the clustering over each layer of the machine-learning model; and computing the roughness metric based on the measure of the clustering over each layer of the machine-learning model.
7 . The computer-implemented method of claim 2 , further comprising predicting the model generalization for the machine learning model at the level of the testing data based on the one or more metrics.
8 . A computer-implemented method comprising: obtaining a sample query without ground truth labels; inputting the sample query into a machine-learning model; generating, using the machine-learning model comprising model parameters learned for a particular task, a prediction associated with the task based on the sample query; obtaining one or more metrics for generalization of the machine-learning model on the sample query, wherein the obtaining comprises: (i) computing metrics using intermediate feature representations extracted from the nodes of the computational graph or abstract syntax tree (AST) of the machine learning model, (ii) computing metrics using the output prediction in conjunction with the input data or intermediate features, (iii) computing metrics using the model weights and architectures and its interactions with the input data, intermediate features, or output predictions, or (iv) any combination of (i-iii); and outputting a prediction of model generalization for the machine learning model at a level of the sample query based on the one or more metrics.
9 . A computer-implemented method comprising: obtaining a sample query without ground truth labels; inputting the sample query into a machine-learning model; generating, using the machine-learning model comprising model parameters learned for a particular task, a prediction associated with the task based on the sample query; obtaining one or more metrics for generalization of the machine-learning model on the sample query, wherein the obtaining comprises executing various neural trace analysis algorithms that capture, summarize, or derive intelligence from intermediate feature representations at nodes of the machine-learning model's computational graph or abstract syntax tree (AST) during program execution in response to input data, including: (i) computing a clustering metric based on clustering of the intermediate feature representations within the sample query, wherein the clustering is calculated based on a subset of samples obtained from training data with ground truth labels; (ii) computing a modified Mixup metric based on convex combinations of the sample query and a subset of samples obtained from the training data with ground truth labels corresponding to the prediction for the sample query; (iii) computing a confidence metric based on clustering of the intermediate feature representations within the sample query; or (iv) any combination of (i)-(iii); and outputting a prediction of model generalization for the machine learning model at a level of the sample query based on the one or more metrics.
10 . The computer-implemented method of claim 9 , wherein the computing the clustering metric comprises: extracting the intermediate feature representations for the sample query; performing, using the subset of samples and principal component analysis, the dimensionality-reduction of the intermediate features to predetermined dimensions at each layer of the machine-learning model, wherein the outputs machine-learning model's prediction of the sample query data indicate membership to one or more classes; computing the clustering for the sample query based on a partition of the subset of samples, wherein the membership is used as the partition to compute and average the clustering metric to measure the clustering over each layer of the machine-learning model; fitting mixture models or kernel density estimators to the intermediate feature representations corresponding to the samples from the subset of samples in the partition, wherein the fitting comprises computing a secondary partition at each layer on the sample query based on the clustering computed for the sample query to obtain parameters for the mixture models; and computing the clustering metric based on the measure of the clustering over each layer of the machine-learning model and the mixture models.
11 . The computer-implemented method of claim 9 , wherein the modified Mixup metric comprises: computing an agreement in classification between the sample query and the subset of the training data corresponding to the prediction for the sample query; when the agreement is indicative that the prediction is incorrect, the sample query is mixed with a subset of the input samples from the training data that correspond to the incorrect prediction; and when the agreement is indicative that the prediction is correct, using the agreement as the modified Mixup metric.
12 . The computer-implemented method of claim 9 , wherein the computing the confidence metric comprises: extracting the intermediate feature representations for the sample query; computing a partition of the subset of samples obtained from training data by k-means or similar clustering of an output domain of each layer of the machine-learning model, computing the clustering for the sample query based on a partition of the subset of samples, wherein the membership is used as the partition to compute and average the clustering metric to measure the clustering over each layer of the machine-learning model; fitting mixture models or kernel density estimators to the intermediate feature representations corresponding to the samples from the subset of samples in the partition, wherein the fitting comprises computing a secondary partition at each layer on the sample query based on the clustering computed for the sample query to obtain parameters for the mixture models; and computing the confidence metric based on the measure of the clustering over each layer of the machine-learning model and the mixture models.
13 . The computer-implemented method of claim 9 , further comprising predicting the model generalization for the machine learning model at the level of the sample query based on the one or more metrics.
14 . A system comprising: one or more processors; and a memory coupled to the one or more processors, the memory storing a plurality of instructions executable by the one or more processors, the plurality of instructions comprising instructions that when executed by the one or more processors cause the one or more processors to perform operations comprising: obtaining testing data without ground truth labels; inputting the testing data into a machine-learning model; generating, using the machine-learning model comprising model parameters learned for a particular task, a prediction associated with the task based on the test data; obtaining one or more metrics for generalization of the machine-learning model on the test data, where the obtaining comprises: (i) computing metrics using intermediate feature representations extracted from the nodes of the computational graph or abstract syntax tree (AST) of the machine learning model, (ii) computing metrics using the output prediction in conjunction with the input data or intermediate features, (iii) computing metrics using the model weights and architectures and its interactions with the input data, intermediate features, or output predictions, or (iv) any combination of (i-iii); and outputting a prediction of model generalization for the machine learning model at the population-level based on the one or more metrics.
15 . A system comprising: one or more processors; and a memory coupled to the one or more processors, the memory storing a plurality of instructions executable by the one or more processors, the plurality of instructions comprising instructions that when executed by the one or more processors cause the one or more processors to perform operations comprising: obtaining testing data without ground truth labels; inputting the testing data into a machine-learning model; generating, using the machine-learning model comprising model parameters learned for a particular task, a prediction associated with the task based on the test data; obtaining one or more metrics for generalization of the machine-learning model on the test data, wherein the obtaining comprises of various neural trace analysis algorithms that capture, summarize, or derive intelligence from intermediate feature representations at nodes of the machine-learning model's computational graph or abstract syntax tree (AST) during program execution in response to input data, including: (i) computing a clustering metric based on clustering of the intermediate feature representations within input samples of the testing data, wherein the clustering is calculated based on a subset of samples obtained from training data with ground truth labels; (ii) computing a modified Mixup metric based on convex combinations of each of the input samples from the testing data and a subset of samples obtained from the training data with ground truth labels corresponding to the prediction for each of the input samples; (iii) computing a confidence metric based on clustering of the intermediate feature representations within the input samples of the testing data, wherein the clustering is calculated based on the input samples from the testing data; (iv) computing a roughness metric using dimensionality-reduction of the intermediate features to define a smoothness criteria for the training data and the testing data; or (v) any combination of (i)-(iv); and outputting a prediction of model generalization for the machine learning model at a level of the testing data based on the one or more metrics.
16 . The system of claim 15 , wherein the computing the clustering metric comprises: extracting the intermediate feature representations for each input sample in the testing data; performing, using the subset of samples and principal component analysis, the dimensionality-reduction of the intermediate features to predetermined dimensions at each layer of the machine-learning model, wherein the ground truth labels of the training data indicate membership to one or more classes; computing the clustering for the testing data based on a partition of the subset of samples, wherein the membership is used as the partition to compute and average the clustering metric to measure the clustering over each layer of the machine-learning model; fitting mixture models or kernel density estimators to the intermediate feature representations corresponding to the samples from the subset of samples in the partition, wherein the fitting comprises computing a secondary partition at each layer on the input samples in the testing data based on the clustering computed for the testing data to obtain parameters for the mixture models; and computing the clustering metric based on the measure of the clustering over each layer of the machine-learning model and the mixture models.
17 . The system of claim 15 , wherein the modified Mixup metric comprises: computing an agreement in classification between each of the input samples from the testing data and the subset of the training data corresponding to the prediction for each of the input samples; when the agreement is indicative that the prediction is incorrect, the input sample is mixed with a subset of the input samples from the testing data that correspond to the incorrect prediction; and when the agreement is indicative that the prediction is correct, using the agreement as the modified Mixup metric.
18 . The system of claim 15 , wherein the computing the confidence metric comprises: extracting the intermediate feature representations for each input sample in the testing data; computing a partition of the subset of samples obtained from training data by k-means or similar clustering of an output domain of each layer of the machine-learning model, wherein the ground truth labels of the training data indicate membership to one or more classes; computing the clustering for the testing data based on a partition of the subset of samples, wherein the membership is used as the partition to compute and average the clustering metric to measure the clustering over each layer of the machine-learning model; fitting mixture models or kernel density estimators to the intermediate feature representations corresponding to the samples from the subset of samples in the partition, wherein the fitting comprises computing a secondary partition at each layer on the input samples in the testing data based on the clustering computed for the testing data to obtain parameters for the mixture models; and computing the confidence metric based on the measure of the clustering over each layer of the machine-learning model and the mixture models.
19 . The system of claim 15 , wherein the computing the roughness metric comprises: extracting the intermediate feature representations for each input sample in the testing data; computing a partition of the subset of samples obtained from training data by k-means or similar clustering of an output domain of each layer of the machine-learning model, wherein the ground truth labels of the training data indicate membership to one or more classes; computing the clustering for the testing data based on a partition of the subset of samples, wherein the membership is used as the partition to compute and average the clustering metric to measure the clustering over each layer of the machine-learning model; and computing the roughness metric based on the measure of the clustering over each layer of the machine-learning model.
20 . The system of claim 15 , wherein the operations further comprise predicting the model generalization for the machine learning model at the level of the testing data based on the one or more metrics.

Description

CROSS-REFERENCES TO RELATED APPLICATIONS The present application claims benefit and priority to U.S. Provisional Application No. 63/414,045, filed on Oct. 7, 2022, the entire contents of which are incorporated herein by reference for all purposes. STATEMENT OF GOVERNMENT SUPPORT The invention was made with government support under F32EB030411 awarded by the National Institute of Health and National Institute of Biomedical Imaging and Bioengineering; and under R01CA229354 awarded by the National Institute of Health and National Cancer Institute. The government has certain rights in the invention. FIELD The present disclosure relates to machine-learning model generalization (i.e., inference success), and in particular to analysis techniques to determine at run time (i.e., the inference phase) whether a machine-learning model, such as a deep neural network (DNN) or a convolutional neural network (CNN), is applicable to (or will work correctly on) a new set of input data. BACKGROUND A central goal of machine-learning is to have a predictive model generalize to previously-unseen data. In this respect, deep learning has enjoyed exceptional success on a wide variety of inference tasks (recognition, interpolation, extrapolation) and data types (images, video, text, graphs), using both supervised and unsupervised (or self-supervised) training. Yet, despite numerous attempts from statistical learning and approximation theorists, the secret to model generalization is still not well understood. Understanding the key properties of model generalization, even from an empirical standpoint, is especially important as deep learning makes its debut in critical human facing applications, such as transportation, medical imaging, and computer-aided diagnosis, where human-in-the-loop operation is not always possible or the goal is to exceed human capabilities. In the context of image recognition tasks, the generalization performance of a predictive model is typically evaluated using p-norms on a set of previously-unseen images, colloquially called the test set. Although annotated datasets such as CIFAR. STL, and SVNH have relatively large test sets, there have been numerous examples where models with strong performance on these sets fail to generalize to similarly sampled data. This is classically understandable, as it is often intractable to obtain sufficient statistics on high-dimensional input domains (e.g. leading to an abundance of both natural and orchestrated adversarial attacks in the vicinity of the data), but this does not help to explain the uncanny generalization performance of deep networks on particular sets of previously-unseen data. Moreover, statistical characterizations of performance (e.g., test-set “accuracy”) are not sufficient for building trust in model predictions, especially when the model inputs or features are not easily interpretable or verifiable by a user. Unlike low-dimensional models or models based on physics, conventional deep neural networks are unable to identify when they have made a mistake. This deficiency can again be traced to the high-dimensional nature of modem data-defined tasks like image recognition, where training data is typically so sparse that it is impossible to know when or where there is support for an inference using classical distance functions (e.g., mesh norm) without being overly pessimistic. This is problematic as most existing statistical and function approximation frameworks only apply in the vicinity or in the limit of data, or equivalent assumptions on the target function. SUMMARY Analysis techniques are disclosed herein to determine at run time whether a machine-learning model is applicable to a new set of input data. The techniques are focused around analyzing the interior nodes of a machine-learning model (e.g., DNN) using a combination of algorithm parameters (e.g., DNN weights), historical data with annotations (e.g. training data), and historical and run-time data without annotations. Feedback obtained from the analysis can be provided to a user (e.g., clinician, operator) concerning whether the machine-learning model worked, whether it can be trusted, why it worked or did not work, and why it can or cannot be trusted. The techniques are applicable to any machine-learning system that accepts input data. Since the purpose of the techniques is to identify prediction success and failures at run time without human supervision, it is envisioned that such techniques could be the integrated into the field of machine-learning systems, particularly in applications where human supervision is either undesirable or infeasible. Specific examples within radiology include not only computer-aided diagnosis or classification, but also machine-learning based image reconstruction. There is also the application in many other industries, such as for self-driving cars, or any automated analysis or AI-assisted prediction software, such as for person identification from photos, or frau