US-12626189-B2 - Predicting a diagnostic test result from patient laboratory testing history
Abstract
The present disclosure relates to techniques for preprocessing samples and using preprocessed samples and machine learning models to predict clinical diagnostic tests for a patient from their historical laboratory testing data. Particularly, aspects are directed to obtaining datasets including features and/or historical laboratory test results for subjects, filtering the datasets based on a denoise-balance scheme to obtain filtered datasets, training a machine learning model using the filtered datasets to obtain a trained machine learning model, and providing the trained machine learning model. A candidate machine learning model may be an ensemble of classifiers implemented with a boosting algorithm, and the ensemble is trained by applying base machine learning algorithms on different distributions of the filtered datasets. The ensemble is then combined into a machine learning model having the set of learned model parameters for predicting results for clinical diagnostic tests.
Inventors
- Walter Joseph Jessen
- Stanley Ian Letovsky
Assignees
- LABORATORY CORPORATION OF AMERICA HOLDINGS
Dates
- Publication Date
- 20260512
- Application Date
- 20221114
Claims (17)
- 1 . A computer-implemented method comprising: obtaining datasets for subjects, wherein each of the datasets comprises subject features, wherein the subject features comprise an index and historical laboratory test results corresponding to test codes; filtering the datasets based on a denoise-balance scheme to obtain a plurality of subsets of filtered datasets, wherein the denoise-balance scheme comprises at least one of an index filter, a test code filter, a feature filter, and a balance filter, and wherein the filtered datasets comprise filtered features; training a machine learning model using a first subset of the plurality of subsets, wherein the first subset of the filtered datasets comprises: (i) a set of outcome predictor datasets including historical laboratory test results for subjects that tested abnormal for a clinical diagnostic test, and (ii) a set of control datasets including historical laboratory test results for subjects that tested normal for the clinical diagnostic test; validating the machine learning model using a second subset of the plurality of subsets; adjusting the machine learning model by repeating the training and the validating until a predetermined condition is satisfied; in response to the adjusting, obtaining a set of model parameters; and providing the trained machine learning model having the set of model parameters.
- 2 . The computer-implemented method of claim 1 , wherein the filtering the datasets comprises: denoising the subject features based on the index filter, wherein the denoising comprises removing a first set of the datasets from the datasets; denoising the subject features based on the test code filter, wherein the denoising comprises removing a second set of the datasets from the datasets; obtaining at least two subsets of the datasets based on a historical laboratory test result corresponding to a predetermined test code, wherein the historical laboratory test is in the subject features; calculating a feature number for each subject feature in each subset based on the feature filter; sorting a total feature number, wherein the total feature number is a sum of at least two of the feature numbers for the at least two subsets; denoising the subject features based on the feature filter, wherein the denoising comprises removing a third set of the datasets from the datasets; and balancing the feature numbers based on the balance filter, wherein a ratio of the at least two feature numbers is in a predetermined range, wherein the balancing comprises removing a fourth set of the datasets from the datasets.
- 3 . The computer-implemented method of claim 2 , wherein the obtaining the at least two subsets of the datasets comprises removing datasets from at least one of the at least two subsets, and wherein the calculating the feature number comprises removing datasets from the at least two subsets to maintain a predetermined ratio scope between the at least two subsets.
- 4 . The computer-implemented method of claim 2 , wherein the predetermined test code is (i) a Non-Alcoholic Steatohepatitis (NASH) fibrosis score test code, (ii) an albumin/creatinine ratio (ACR) test code, or (iii) an estimated glomerular filtration rate (eGFR) test code.
- 5 . The computer-implemented method of claim 1 , wherein the training the machine learning model further comprises: training an ensemble of classifiers implemented with a boosting algorithm on the first subset by applying base machine learning algorithms on different distributions of the first subset, wherein the training causes the ensemble of classifiers to learn a function that maps a training input space derived from the first subset to a target output space such that the function is an accurate predictor for the target output space, wherein the target output space is a result of the clinical diagnostic test, and wherein the function is learned by finding the set of model parameters that minimize a cost function that measures a difference between ground truth values for the subjects that tested abnormal or normal for the clinical diagnostic test and predicted results of the clinical diagnostic test; and combining the ensemble of classifiers into the trained machine learning model having the set of model parameters for predicting the result for the clinical diagnostic test.
- 6 . The computer-implemented method of claim 1 , further comprising: obtaining a dataset for a subject, wherein the dataset comprises an index and historical laboratory test results corresponding to test codes; inputting the dataset into the trained machine learning model; predicting, using the trained machine learning model, a result for a clinical diagnostic test; and outputting, using the trained machine learning model, a classification of the clinical diagnostic test based on the result for the clinical diagnostic test.
- 7 . A computer-program product tangibly embodied in a non-transitory machine-readable medium, including instructions configured to cause one or more data processors to perform: obtaining datasets for subjects, wherein each of the datasets comprises subject features, wherein the subject features comprise an index and historical laboratory test results corresponding to test codes; filtering the datasets based on a denoise-balance scheme to obtain a plurality of subsets of filtered datasets, wherein the denoise-balance scheme comprises at least one of an index filter, a test code filter, a feature filter, and a balance filter, and wherein the filtered datasets comprise filtered features; training a machine learning model using a first subset of the plurality of subsets, wherein the first subset of the filtered datasets comprises: (i) a set of outcome predictor datasets including historical laboratory test results for subjects that tested abnormal for a clinical diagnostic test, and (ii) a set of control datasets including historical laboratory test results for subjects that tested normal for the clinical diagnostic test; validating the machine learning model using a second subset of the plurality of subsets; adjusting the machine learning model by repeating the training and the validating until a predetermined condition is satisfied; in response to the adjusting, obtaining a set of model parameters; and providing the trained machine learning model having the set of model parameters.
- 8 . The computer-program product of claim 7 , wherein the filtering the datasets comprises: denoising the subject features based on the index filter, wherein the denoising comprises removing a first set of the datasets from the datasets; denoising the subject features based on the test code filter, wherein the denoising comprises removing a second set of the datasets from the datasets; obtaining at least two subsets of the datasets based on a historical laboratory test result corresponding to a predetermined test code, wherein the historical laboratory test is in the subject features; calculating a feature number for each subject feature in each subset based on the feature filter; sorting a total feature number, wherein the total feature number is a sum of at least two of the feature numbers for the at least two subsets; denoising the subject features based on the feature filter, wherein the denoising comprises removing a third set of the datasets from the datasets; and balancing the feature numbers based on the balance filter, wherein a ratio of the at least two feature numbers is in a predetermined range, wherein the balancing comprises removing a fourth set of the datasets from the datasets.
- 9 . The computer-program product of claim 8 , wherein the obtaining the at least two subsets of the datasets comprises removing datasets from at least one of the at least two subsets, and wherein the calculating the feature number comprises removing datasets from the at least two subsets to maintain a predetermined ratio scope between the at least two subsets.
- 10 . The computer-program product of claim 8 , wherein the predetermined test code is (i) a Non-Alcoholic Steatohepatitis (NASH) fibrosis score test code, (ii) an albumin/creatinine ratio (ACR) test code, or (iii) an estimated glomerular filtration rate (eGFR) test code.
- 11 . The computer-program product of claim 7 , wherein the training the machine learning model further comprises: training an ensemble of classifiers implemented with a boosting algorithm on the first subset by applying base machine learning algorithms on different distributions of the first subset, wherein the training causes the ensemble of classifiers to learn a function that maps a training input space derived from the first subset to a target output space such that the function is an accurate predictor for the target output space, wherein the target output space is a result of the clinical diagnostic test, and wherein the function is learned by finding the set of model parameters that minimize a cost function that measures a difference between ground truth values for the subjects that tested abnormal or normal for the clinical diagnostic test and predicted results of the clinical diagnostic test; and having the set of model parameters for predicting the result for the clinical diagnostic test.
- 12 . The computer-program product of claim 7 , wherein the one or more data processors are caused to further perform: obtaining a dataset for a subject, wherein the dataset comprises an index and historical laboratory test results corresponding to test codes; inputting the dataset into the trained machine learning model; predicting, using the trained machine learning model, a result for a clinical diagnostic test; and outputting, using the trained machine learning model, a classification of the clinical diagnostic test based on the result for the clinical diagnostic test.
- 13 . A system comprising: one or more data processors; and a non-transitory computer readable medium storing instructions which, when executed on the one or more data processors, cause the one or more data processors to perform: obtaining datasets for subjects, wherein each of the datasets comprises subject features, wherein the subject features comprise an index and historical laboratory test results corresponding to test codes; filtering the datasets based on a denoise-balance scheme to obtain a plurality of subsets of filtered datasets, wherein the denoise-balance scheme comprises at least one of an index filter, a test code filter, a feature filter, and a balance filter, and wherein the filtered datasets comprise filtered features; training a machine learning model using a first subset of the plurality of subsets, wherein the first subset of the filtered datasets comprises: (i) a set of outcome predictor datasets including historical laboratory test results for subjects that tested abnormal for a clinical diagnostic test, and (ii) a set of control datasets including historical laboratory test results for subjects that tested normal for the clinical diagnostic test; validating the machine learning model using a second subset of the plurality of subsets; adjusting the machine learning model by repeating the training and the validating until a predetermined condition is satisfied; in response to the adjusting, obtaining a set of model parameters; and providing the trained machine learning model having the set of model parameters.
- 14 . The system of claim 13 , wherein the filtering the datasets comprises: denoising the subject features based on the index filter, wherein the denoising comprises removing a first set of the datasets from the datasets; denoising the subject features based on the test code filter, wherein the denoising comprises removing a second set of the datasets from the datasets; obtaining at least two subsets of the datasets based on a historical laboratory test result corresponding to a predetermined test code, wherein the historical laboratory test is in the subject features; calculating a feature number for each subject feature in each subset based on the feature filter; sorting a total feature number, wherein the total feature number is a sum of at least two of the feature numbers for the at least two subsets; denoising the subject features based on the feature filter, wherein the denoising comprises removing a third set of the datasets from the datasets; and balancing the feature numbers based on the balance filter, wherein a ratio of the at least two feature numbers is in a predetermined range, wherein the balancing comprises removing a fourth set of the datasets from the datasets.
- 15 . The system of claim 14 , wherein the obtaining the at least two subsets of the datasets comprises removing datasets from at least one of the at least two subsets, wherein the calculating the feature number comprises removing datasets from the at least two subsets to maintain a predetermined ratio scope between the at least two subsets, and wherein the predetermined test code is (i) a Non-Alcoholic Steatohepatitis (NASH) fibrosis score test code, (ii) an albumin/creatinine ratio (ACR) test code, or (iii) an estimated glomerular filtration rate (eGFR) test code.
- 16 . The system of claim 13 , wherein the training the machine learning model further comprises: training an ensemble of classifiers implemented with a boosting algorithm on the first subset by applying base machine learning algorithms on different distributions of the first subset, wherein the training causes the ensemble of classifiers to learn a function that maps a training input space derived from the first subset to a target output space such that the function is an accurate predictor for the target output space, wherein the target output space is a result of the clinical diagnostic test, and wherein the function is learned by finding the set of model parameters that minimize a cost function that measures a difference between ground truth values for the subjects that tested abnormal or normal for the clinical diagnostic test and predicted results of the clinical diagnostic test; and combining the ensemble of classifiers into the trained machine learning model having the set of model parameters for predicting the result for the clinical diagnostic test.
- 17 . The system of claim 13 , wherein the one or more data processors are caused to further perform: obtaining a dataset for a subject, wherein the dataset comprises an index and historical laboratory test results corresponding to test codes; inputting the dataset into the trained machine learning model; predicting, using the trained machine learning model, a result for a clinical diagnostic test; and outputting, using the trained machine learning model, a classification of the clinical diagnostic test based on the result for the clinical diagnostic test.
Description
CROSS REFERENCE TO RELATED APPLICATIONS The present application claims priority and benefit from U.S. Provisional Application No. 63/298,925, filed Jan. 12, 2022, and U.S. Provisional Application No. 63/278,342, filed Nov. 11, 2021, the entire contents of which are incorporated herein by reference for all purposes. FIELD The present disclosure relates to clinical testing, and in particular to techniques for preparing samples and using machine learning models to predict clinical diagnostic test results for a patient from their historical laboratory test results. BACKGROUND Clinical laboratories are healthcare facilities providing a wide range of laboratory procedures which aid clinicians in carrying out the diagnosis, treatment, and management of patients. Clinical laboratories report most laboratory test results as individual numerical or categorical values. However, individual test results, viewed in isolation, are typically of limited diagnostic value. To adequately use test results for patient diagnosis and management, clinicians usually must integrate many individual test results from a patient and interpret them in the context of clinical data and medical knowledge, judgment, and experience. While this manual approach to test result interpretation is the current standard in most cases, computational approaches to laboratory data integration and analysis offer tremendous potential to enhance diagnostic value. In particular, many patients will have hundreds or thousands of these individual test results, often spanning years. As a consequence, many clinicians can easily overlook key results or important patterns and trends within sets of laboratory data. Furthermore, important diagnostic information may sometimes be contained within patterns across numerous data elements that may be too subtle or complex to identify without the aid of computational approaches. In addition, because the human brain faces great challenges in simultaneously considering a large number of data points, even the most experienced clinicians may be unable to extract all the useful information from existing clinical and laboratory data. SUMMARY In various embodiments, a computer-implemented method is provided that comprises: obtaining datasets for subjects, wherein each of the datasets comprises subject features, wherein the subject features comprise an index and historical laboratory test results corresponding to test codes; filtering the datasets based on a denoise-balance scheme to obtain filtered datasets, wherein the denoise-balance scheme comprises an index filter, a test code filter, a feature filter, and a balance filter, and wherein the filtered datasets comprise filtered features; training a machine learning model using the filtered datasets to obtain a trained machine learning model; and providing the trained machine learning model. In some embodiments, the filtering the datasets comprises: denoising the subject features based on the index filter, wherein the denoising comprises removing a first set of the datasets from the datasets; denoising the subject features based on the test code filter, wherein the denoising comprises removing a second set of the datasets from the datasets; obtaining at least two subsets of the datasets based on a historical laboratory test result corresponding to a predetermined test code, wherein the historical laboratory test is in the subject features; calculating a feature number for each subject feature in each subset based on the feature filter; sorting a total feature number, wherein the total feature number is a sum of at least two of the feature numbers for the at least two subsets; denoising the subject features based on the feature filter, wherein the denoising comprises removing a third set of the datasets from the datasets; and balancing the feature numbers based on the balance filter, wherein a ratio of the at least two feature numbers is in a predetermined range, wherein the balancing comprises removing a fourth set of the datasets from the datasets. In some embodiments, the obtaining the at least two subsets of the datasets comprises removing datasets from at least one of the at least two subsets. In some embodiments, the calculating the feature number comprises removing datasets from the at least two subsets to maintain a predetermined ratio scope between the at least two sub sets. In some embodiments, the predetermined test code is a Non-Alcoholic Steatohepatitis (NASH) fibrosis score test code. In some embodiments, the predetermined test code is albumin/creatinine ratio (ACR). In some embodiments, the predetermined test code is estimated glomerular filtration rate (eGFR). In some embodiments, the predetermined ratio scope is about [1:5, 5:1]. In some embodiments, the feature filter determines a number of the filtered features for the machine learning model. In some embodiments, the number of the filtered features is equal to or less than 150. In some embodiments, the training t