EP-4742268-A2 - A METHOD OF GENERATING A STATISTICAL MODEL TO DETECT, OR PREDICT THE RISK OF, AN OUTCOME
Abstract
A system and method of generating a model M that can detect or predict the predisposition to an outcome, for example a person's predisposition to developing a health condition such as preeclampsia, with a positive or/and negative predictive value better or equal to a predefined predictive value target, where the method factors in the impact of prevalence on test performance. The system and method employ an iterative population segregation methodology, involving at least two population segregations steps in which each step, independently, employs a probability-defined model. The first segregation step is selected from one of a rule-in probability defined model and a rule-out probability defined model, and the second segregation step is selected from the other of a rule-in probability defined model and a rule-out probability defined model. The present invention overcomes the technical limitations of existing predictive models and provides a solution to achieve a superior predictive model delivering an accurate risk prediction result, and in a less computationally intensive way. By isolating and segmenting particular population subsets according to the invention a much more robust and accurate way of detecting, or predict risk of, an outcome is achieved.
Inventors
- TUYTTEN, ROBIN
- THOMAS, GREGOIRE
Assignees
- Metabolomic Diagnostics Limited
Dates
- Publication Date
- 20260513
- Application Date
- 20190211
Claims (15)
- A computer implemented method of generating a model M to detect or predict the risk of a health condition in a subject, the method comprising the steps of: providing a population of test subjects and measurement data for a plurality (n) of variables for each of the test subjects selected from biometric, life-style and/or physiological characteristics; providing a first model M1 configured to predict the presence or risk of the health condition in the population of test subjects comprising 1 to n variables, wherein the model M1 is configured to segregate the population of test subjects into first generation population subsets A and B in which the subjects in the first generation population subset A have a probability selected from one of: a probability to have/develop the health condition lower than or equal to a predefined rule-out cut-off probability; or a probability to have/develop the health condition higher than or equal to a predefined rule-in cut-off probability; characterised by the steps of: segmenting the population of subjects based on the first model M1 into first generation population subsets A and B; when the first generation population subset B comprises a sufficient number of subjects, providing a second model M2 configured to predict the presence or risk of the health condition in the first generation population subset B, the model comprising 1 to n variables, wherein the model M2 is configured to segregate the first generation population subset B into second generation population subsets A and B in which the subjects in the second generation population subset A have a probability selected from the other of: a probability to have/develop the health condition lower than or equal to a predefined rule-out cut-off probability; or a probability to have/develop the health condition higher than or equal to a predefined rule-in cut-off probability; wherein, when one of the population subsets does not comprise a sufficient number of subjects, the previous segregation step is repeated using an alternative model configured to generate a population subset with a sufficient number of subjects, segmenting the first generation population subset B based on the second model M2 into second generation population subset A and B; and and wherein the model M comprises the first model M1 and the second model M2; and outputting a value indicative of a detection, or a prediction of risk, of the health condition in the subject based on said model M and wherein the health condition is selected from: a hypertensive disorder, a cardiovascular disease, a proliferative disorder, an inflammatory disease, an autoimmune disease, a metabolic disorder, a neurological disease, a hepatic disorder, or a pulmonary disease, and excluding a disorder of pregnancy.
- A computer implemented method according to Claim 1 in which the model M2 is configured to segregate the first generation population subset B into second generation population subsets A and B in which the subjects in the second generation population subset A have a probability selected from the other of: a probability to have/develop the health condition lower than or equal to a predefined NPV cut-off; or a probability to have/develop the health condition higher than or equal to a predefined PPV cut-off.
- A computer implemented method according to Claim 1 in which the model M1 is configured to segregate the population of test subjects into first generation population subsets A and B in which the subjects in the first generation population subset A have a probability to have/develop the health condition lower than or equal to a predefined rule-out cut-off probability; wherein the model M2 is configured to segregate the first generation population subset B into second generation population subsets A and B in which the subjects the second generation population subset A have a probability to have/develop the health condition higher than or equal to a predefined rule-in cut-off probability which is optionally a PPV cut-off.
- A computer implemented method according to Claim 1 in which the model M1 is configured to segregate the population of test subjects into first generation population subsets A and B in which the subjects in the first generation population subset A have a probability to have/develop the health condition higher than or equal to a predefined rule-in cut-off probability; wherein the model M2 is configured to segregate the first generation population subset B into second generation population subsets A and B in which the subjects the second generation population subset A have a probability to have/develop the health condition lower than or equal to a predefined rule-out probability which is optionally a NPV cut-off.
- A computer implemented method according to any preceding Claim, including an additional segmentation step comprising: when the second generation population subset B comprises a sufficient number of subjects, providing a third model M3 configured to predict the presence or risk of the health condition in the second generation population subset B, the model comprising 1 to n variables, wherein the model M3 is configured to segregate the second generation population subset B into third generation population subsets A and B such that the subjects in the third generation population subset A has a probability selected from one of: a probability to have/develop the health condition lower than or equal to a predefined rule-out NPV cut-off; or a probability to have/develop the health condition higher than or equal to a predefined rule-in PPV cut-off; and segmenting the other of the second generation population subsets based on the third model M3 into third generation population subsets A and B.
- A computer implemented method according to Claim 5 in which: the first model M1 is configured to segment the population of test subjects such that the subjects the first generation population subset A have a probability to have/develop the health condition higher than a predefined rule-in cut-off probability; the second model M2 is configured to segment the first generation population subset B such that the subjects in the second generation population subset A have a probability to have/develop the (health) condition lower than a predefined rule-out cut-off probability; and the third model M3 is configured to segment the second generation population subset B such that the subjects in at least one of the third generation population subsets have a probability to have/develop the health condition higher than a predefined rule-in PPV cut-off.
- A computer implemented method according to Claim 5 in which: the first model M1 is configured to segment the population of test subjects such that the subjects the first generation population subset A have a probability to have/develop the health condition lower than a predefined rule-out cut-off probability; the second model M2 is configured to segment the first generation population subset B such that the subjects in the second generation population subset A have a probability to have/develop the health condition higher than a predefined rule-in cut-off probability; and the third model M3 is configured to segment the second generation population subset B such that the subjects in at least one of the third generation population subsets have a probability to have/develop the health condition lower than a predefined rule-out NPV cut-off.
- A computer implemented method according to Claim 6 in which: the first model M1 is configured to segment the population of test subjects such that the subjects the first generation population subset A have a probability to have/develop the health condition higher than a predefined PPV rule-in cut-off; the second model M2 is configured to segment the first generation population subset B such that the subjects in the second generation population subset A have a probability to have/develop the health condition lower than a predefined rule-out cut-off probability; and the third model M3 is configured to segment the second generation population subset B such that the subjects in at least one of the third generation population subsets have a probability to have/develop the health condition higher than a predefined rule-in PPV cut-off.
- A computer implemented method according to Claim 4 in which: the first model M1 is configured to segment the population of test subjects such that the subjects the first generation population subset A have a probability to have/develop the health condition lower than a predefined rule-out NPV cut-off; the second model M2 is configured to segment the first generation population subset B such that the subjects in the second generation population subset A have a probability to have/develop the health condition higher than a predefined rule-in cut-off probability; and the third model M3 is configured to segment the second generation population subset B such that the subjects in at least one of the third generation population subsets have a probability to have/develop the health condition lower than a predefined rule-out NPV cut-off.
- A computer implemented method according to Claim 1, including one or more additional segmentation steps in which each segmentation step comprises: when the nth generation population subset B comprises a sufficient number of subjects, providing a (n+1)th model Mn+1 configured to predict the presence or risk of the health condition in the nth generation population subset B, the model comprising 1 to n variables, wherein the model Mn+1 is configured to segregate the nth generation population subset B into (n+1)th generation population subset A and B such that the subjects in the (n+1)th generation population subset A have a probability selected from one of: a probability to have/develop the health condition lower than or equal to a predefined rule-out NPV cut-off; or a probability to have/develop the health condition higher than or equal to a predefined rule-in PPV cut-off; and; and segmenting the nth generation population subset B based on the (n+1)th multivariable model into (n+1)th generation population subsets A and B.
- A computer implemented method according to Claim 10, in which a final segregation step employs a final model configured to predict the presence or risk of the health condition in the penultimate generation population subset B, the model comprising 1 to n variables, wherein the final model is configured to segregate the penultimate generation population subset B into final generation population subset A and B such that the subjects in the final generation population subset A have a probability selected from one of: a probability to have/develop the health condition lower than or equal to a predefined rule-out NPV cut-off; or a probability to have/develop the health condition higher than or equal to a predefined rule-in PPV cut-off;
- A computer implemented method according to Claim 1, including one or more additional segmentation steps in which each segmentation step comprises: (a) when the nth generation population subset B comprises a sufficient number of subjects, generating a (n+1)th model Mn+1 comprising 1 ton variables, configured to predict the presence or risk of developing a health condition in the nth generation population subset B, wherein the model Mn+1 is configured to segregate the nth generation population subset B into (n+1)th generation population subset A and B whereby when the subjects in the (n+1)th generation population subset A are added to the combined earlier generation population rule-in subsets A, the composite rule-in population has a probability higher than or equal to a predefined rule-in PPV cut-off to have/develop the condition; or when the subjects in the (n+1)th generation population subset A are added to the combined earlier generation population rule-out subsets A, the composite rule-out population have a probability lower than or equal to a predefined rule-out NPV cut-off to have/develop the condition; and. (b) segmenting the nth generation population subset B based on the (n+1)th multivariable model into (n+1)th generation population subsets A and B.
- A computer implemented method according to Claim 10, 11 or 12 in which successive segmentation steps are repeated until a population subset is generated that cannot be further segmented.
- A computer implemented system for generating a model M for detecting or predicting risk of a health condition in a subject, comprising: an input device; a processor; a memory;an output device; said processor operatively coupled to the input device, the memory and the output device, said processor configured with: a module or means for providing a population of test subjects and measurement data for a plurality (n) of variables for each of the test subjects selected from biometric, life-style and/or physiological characteristics; a module or means for providing a first model M1 configured to predict the presence or risk of the health condition in the population of test subjects comprising 1 to n variables, wherein the model M1 is configured to segregate the population of test subjects into first generation population subsets A and B in which the subjects in the first generation population subset A have a probability selected from one of: a probability to have/develop the health condition lower than or equal to a predefined rule-out cut-off probability; or a probability to have/develop the health condition higher than or equal to a predefined rule-in cut-off probability; characterised by : a module or means for segmenting the population of subjects based on the first model M1 into first generation population subsets A and B; optionally, a module or means for determining when the first generation population subset B comprises a sufficient number of subjects; a module or means for providing a second model M2 configured to predict the presence or risk of the health condition in the first generation population subset B, the model comprising 1 to n variables, wherein the model M2 is configured to segregate the first generation population subset B into second generation population subsets A and B in which the subjects in the second generation population subset A have a probability selected from the other of: a probability to have/develop the health condition lower than or equal to a predefined rule-out cut-off probability; or a probability to have/develop the health condition higher than or equal to a predefined rule-in cut-off probability; a module or means for segmenting the first generation population subset B based on the second model M2 into second generation population subset A and B; wherein the model M comprises the first model M1 and the second model M2; and means for outputting a value indicative of a detection, or a prediction of risk, of the health condition in the subject based on said model M and wherein the health condition is selected from: a hypertensive disorder, a cardiovascular disease, a proliferative disorder, an inflammatory disease, an autoimmune disease, a metabolic disorder, a neurological disease, a hepatic disorder, or a pulmonary disease, and excluding a disorder of pregnancy.
- A computer program product stored in a non-transitory storage medium, said storage medium operatively coupled to a processor and said computer program product causing the processor to generate a model M for detecting or predicting risk of a health condition in a subject, the processor configured to implement the steps of: providing a population of test subjects and measurement data for a plurality (n) of variables for each of the test subjects selected from biometric, life-style and/or physiological characteristics; providing a first model M1 configured to predict the presence or risk of the health condition in the population of test subjects comprising 1 to n variables, wherein the model M1 is configured to segregate the population of test subjects into first generation population subsets A and B in which the subjects in the first generation population subset A have a probability selected from one of: a probability to have/develop the health condition lower than or equal to a predefined rule-out cut-off probability; or a probability to have/develop the health condition higher than or equal to a predefined rule-in cut-off probability; segmenting the population of subjects based on the first model M1 into first generation population subsets A and B; characterised in that the first generation population subset B comprises a sufficient number of subjects, providing a second model M2 configured to predict the presence or risk of the health condition in the first generation population subset B, the model comprising 1 to n variables, wherein the model M2 is configured to segregate the first generation population subset B into second generation population subsets A and B in which the subjects in the second generation population subset A have a probability selected from the other of: a probability to have/develop the health condition lower than or equal to a predefined rule-out cut-off probability; or a probability to have/develop the health condition higher than or equal to a predefined rule-in cut-off probability; segmenting the first generation population subset B based on the second model M2 into second generation population subset A and B; and wherein the model M comprises the first model M1 and the second model M2; and outputting a value indicative of a detection, or a prediction of risk, of the health condition in the subject based on said model M and wherein the health condition is selected from: a hypertensive disorder, a cardiovascular disease, a proliferative disorder, an inflammatory disease, an autoimmune disease, a metabolic disorder, a neurological disease, a hepatic disorder, or a pulmonary disease, and excluding a disorder of pregnancy.
Description
Field of the Invention The present invention relates to a system and method of generating a model M to detect, or predict the risk of, an outcome. In particular, the invention relates to a method of generating a model M to detect, or predict an outcome, in particular the risk of a health condition in a subject. Also contemplated are uses of the model to detect or predict the outcome, and in particular predict risk of a subject developing a health condition. Background to the Invention Classically, to identify a population at high risk of a defined outcome (such as developing a specific health condition), it is common to lock the false positive rate (FPR, 1-specificty) to a target value and then, for any given classifier, to observe at which sensitivity (detection rate of future cases) the ROC curve crosses the specificity criterion. Conversely, to identify a population at low risk, it is common to lock the false negative rate (FNR, 1-sensitivity) to a target value and then, for any given classifier, to observe at which specificity (detection for future non-cases) the ROC curve crosses the FNR criterion to identify a population at low risk. Typically, one will first develop a prognostic model which maximizes AUROC and then establish its estimated detection rate at the set criterion. However, prognostic models with high AUROC are not necessarily the best models when the intended clinical application is either rule-in or rule-out [1]. What is more, prognostic models typically do not have high AUROC; for instance, one of the most used prognostic models is the Framingham risk model for prediction of cardiovascular risk, discriminates only reasonably in certain (sub)populations, with a receiver-operating characteristic (ROC) curve area of little over 0.70 [2][3]. The statistics AUROC, Sn, and Sp are considered prevalence-independent statistics, yet prevalence (or incidence; depending on the application) is important when assessing the clinical usefulness of a prognostic test. When a prognostic test is assessed / applied in its clinically relevant context, metrics like positive and negative predictive value (PPV and NPV), which take the disease prevalence (or incidence) into account, are more appropriate. Here, PPV corresponds the fraction of patients that will actually develop the condition (TP, True Positives) within the group of all patients that have a positive test result (True Positives + False Positives (FP)); or in other words, PPV is the probability that the disease is present when the test is positive. NPV corresponds to the fraction of patients that will actually not develop the conditions (TN, True Negatives) within the group of all patients that have a negative test result (True Negatives + False Negatives (FN)), or in other words NPV is the probability that the disease is not present when the test is negative. Typically, a prognostic rule-in test should 1) identify a minimum proportion of the patients that will actually develop the disease and 2) ensure that this true positive group has a sufficiently large proportion of the patients testing positive. In other words, such prognostic tests must reach a minimum sensitivity and minimum PPV. Likewise, a prognostic rule-out test should 1) identify a minimum proportion of patients that will certainly not develop the disease and 2) ensure that of the patients testing negative, sufficiently few will develop the disease (false negatives). Such test must therefore reach a minimum specificity and minimum negative predictive value (NPV). It is easily understood that predictive values are important determinants of the performance of a classifier, as it allows quantifying the "cost" associated with a change of clinical pathway following a prognostic test result. For illustrative purposes, consider the following hypothetical scenario. If the total monetary cost (and/or health cost as a result of an undesired side effects) of an available prophylactic treatment is high, a health care system might determine, based on a cost-benefit analysis, that it can support treatment of a high-risk group with a 1:5 chance of developing the condition (i.e., where minimally 1 in 5 of those treated will eventually develop the condition (and therefore warrants treatment) and maximally 4 in 5 are false positives, and hence will be needlessly offered the treatment). This criterion translates to a prognostic test which should select a high-risk group with a PPV = 0.2. In this scenario, a test which classifies 50% of future cases into a high-risk group with a PPV = 0.2 is considered better (from the health economics point of view) than a test which classifies 75% of future cases in a high-risk group with a PPV = 0.1. The latter would amount to a "cost" of treating 9 False Positives for each True Positive, which would be deemed not fit-for-purpose by the health care system where the cost of the prophylactic treatment is high. The limitations of defining rule-in or rule-out tests using eit