CN-122029607-A - Concurrent classification of cancer origin for organ types and tumor biology types
Abstract
Methods for cancer origin (CSO) prediction are disclosed to predict CSO characteristics. The CSO prediction may include affected organs or organ groups and tumor biology. A method for training a parallel CSO classifier includes obtaining training samples derived from subjects with known cancer diagnoses, each training sample comprising methylation sequence reads corresponding to nucleic acid fragments in biological samples collected from each subject, and each known cancer signal origin including known affected organs or organ groups from a plurality of organs or organ groups and known tumor biology from a plurality of tumor biology categories. The method includes generating, for each training sample, a feature vector based on the methylated sequence reads. The method includes generating a first training data set comprising the feature vectors of the training samples and the known organs or organ groups, and training an organ or organ group classifier with the first training data set to predict an organ or organ group from the plurality of organs or organ groups based on the input feature vectors. The method includes generating a second training data set comprising the feature vectors of the training samples and the known tumor biological categories, and training a tumor biological classifier with the second training data set to predict tumor biology from the plurality of tumor biological categories based on the input feature vectors.
Inventors
- Jorge budeno
- Catherine. Kurtzman
- Rita Shakanovic
Assignees
- 格里尔公司
Dates
- Publication Date
- 20260512
- Application Date
- 20241018
- Priority Date
- 20231020
Claims (20)
- 1. A method for training independent parallel cancer signal origin classifiers, the method comprising: obtaining training samples derived from subjects having known cancer diagnoses, each training sample comprising methylation sequence reads corresponding to nucleic acid fragments in biological samples collected from each subject, and each known cancer diagnosis comprising a known organ or organ group of the affected plurality of organs or organ groups and a known tumor biology of a plurality of tumor biology categories; generating, for each training sample, a feature vector based on the methylated sequence reads; generating a first training data set comprising the feature vectors of the training samples and the known organ or group of organs; Training an organ or organ group classifier with the first training dataset to predict an organ or organ group from the plurality of organs or organ groups based on an input feature vector; generating a second training data set comprising the feature vectors and the known tumor biological categories for the training samples, and Training a tumor biology classifier with the second training data set to predict tumor biology from the plurality of tumor biology classes based on input feature vectors.
- 2. A method according to any preceding claim, wherein: the method further comprises: for each training sample, extracting the known organ or organ group and the known tumor biology class from the known cancer diagnosis and clinical information.
- 3. The method of any preceding claim, wherein the feature vector is based at least in part on methylation characteristics of reads of the methylation sequences.
- 4. The method of claim 3, wherein the methylation signatures comprise a methylation density at one or more loci, a density of hypermethylated sequence reads at one or more loci, a density of hypomethylated sequence reads at one or more loci, a count of methylated sequence reads determined to be abnormally methylated at one or more loci, or some combination thereof.
- 5. The method of any preceding claim, wherein generating the first training data set comprises excluding information about tumor biology.
- 6. The method of any preceding claim, wherein generating the second training data set comprises excluding information about the affected organ or organ group.
- 7. A method according to any preceding claim, wherein: the method further comprises: determining, for each feature, an information gain in distinguishing the organs or organ groups; identifying discriminatory features of the organ or organ group classifier based on the information gains, and Modifying feature vectors of the first training set to consist of these discriminant features, wherein the modified feature vectors are used to train the organ or organ group classifier.
- 8. A method according to any preceding claim, wherein: the method further comprises: for each feature, determining information gain in distinguishing the tumor biological categories; Identifying discriminatory features of the tumor biology classifier based on the information gains, and Modifying feature vectors of the second training set to consist of these discriminant features, wherein the modified feature vectors are used to train the tumor biology classifier.
- 9. The method of any preceding claim, wherein the organ or organ group classifier or the tumor biology classifier is a machine learning model.
- 10. The method of any preceding claim, further comprising training the organ or organ group classifier and the tumor biology classifier in a parallel training process.
- 11. The method of any preceding claim, further comprising training the organ or organ group classifier prior to training the tumor biology classifier.
- 12. The method of claim 11, wherein the output of the organ or organ group classifier is appended to the feature vector of the second training data set prior to training the tumor biology classifier.
- 13. The method of any preceding claim, further comprising training the tumor biology classifier prior to training the organ or organ group classifier.
- 14. The method of claim 13, wherein the output of the tumor biology classifier is appended to the feature vectors of the first training data set prior to training the organ or organ group classifier.
- 15. The method of any preceding claim, wherein the organs or organ groups comprise breast, prostate, lung, head or neck, anus, cervix, ovary or fallopian tube, uterus, bladder or urothelium, kidney, stomach or esophagus, liver or intrahepatic bile duct, pancreas, extrahepatic bile duct or gall bladder, colon or rectum, bone or soft tissue, skin, blood, lymphatic system or bone marrow, thyroid, obscured tissue, or some combination thereof.
- 16. The method of any preceding claim, wherein the tumor biological categories comprise lymphoid neoplasms, myeloid neoplasms, plasma cell neoplasms, neuroendocrine cancers or tumors, adenocarcinomas, squamous cell carcinoma and non-human papillomavirus-related, human papillomavirus-related cancers, hepatocellular carcinoma, neoplasms of Mi Leshi tube origin, transitional cell carcinoma, mesenchymal neoplasms, melanocyte neoplasms, mesothelial neoplasms, other tumor biology, vague tumor biology, or some combination thereof.
- 17. A method for predicting the origin of a cancer signal, the method comprising: Obtaining a test sample derived from a subject, the test sample comprising a methylation sequence read corresponding to a nucleic acid fragment in a biological sample collected from the subject; generating, for the test sample, a first feature vector based on methylated sequence reads associated with a first feature set identified as discriminating for an organ or organ group classification; generating, for the test sample, a second feature vector based on methylated sequence reads associated with a second feature set identified as discriminating for tumor biological classification; applying an organ or organ group classifier to the first feature vectors to predict an organ or organ group of the cancer associated with the test sample from a plurality of organs or organ groups; Applying a tumor biology classifier to the second feature vector to predict a tumor biology of the cancer associated with the test sample from a plurality of tumor biology classes; Wherein the organ or organ group classifier and the tumor biology classifier are independently trained on training samples derived from subjects having a known cancer diagnosis comprising known organs or organ groups of the affected plurality of organs or organ groups and known tumor biology in a plurality of tumor biology classes, each training sample comprising methylation sequence reads corresponding to nucleic acid fragments in biological samples collected from each subject, and Information is provided for a diagnostic test to diagnose cancer based on the predicted organ or organ group and the predicted tumor biology.
- 18. The method of claim 17, wherein the organ or organ group classifier and the tumor biology classifier are trained by: Generating, for each training sample, a feature vector based on methylated sequence reads of the training sample; Generating a first training dataset comprising the feature vectors of the training samples and the known organ or group of organs of the known cancer diagnosis; Training an organ or organ group classifier with the first training dataset to predict an organ or organ group from the plurality of organs or organ groups based on an input feature vector; Generating a second training data set comprising the feature vectors of the training samples and the known tumor biological categories of the known cancer diagnosis, and Training a tumor biology classifier with the second training data set to predict tumor biology from the plurality of tumor biology classes based on input feature vectors.
- 19. The method of any one of claims 17 to 18, wherein generating the first training data set comprises excluding information about tumor biology, and wherein generating the second training data set comprises excluding information about the affected organ or organ group.
- 20. The method according to any one of claims 17 to 19, wherein the method further comprises: determining, for each feature, an information gain in distinguishing the organs or organ groups; identifying discriminatory features of the organ or organ group classifier based on the information gains, and Modifying feature vectors of the first training set to consist of these discriminant features, wherein the modified feature vectors are used to train the organ or organ group classifier.
Description
Concurrent classification of cancer origin for organ types and tumor biology types Background Cancer is a leading cause of death worldwide. Mortality from cancer is exacerbated by the fact that cancer is often detected in advanced stages, which limits the efficacy of treatment options for long-term survival. Current detection methods are generally specific for cancer, i.e., each type of cancer (breast, lung, colorectal, prostate, etc.) is screened individually. Thus, each screening process is tailored to a particular cancer. For example, mammography scanning is used for breast cancer detection, whereas colonoscopy or stool examination is helpful for colorectal cancer detection. Each different screening method is generally not cross-applicable to other cancers. In addition, the present screening methods are hampered by low detection rates or high false positive rates. Low detection rates often fail to detect early stage cancers because the cancer has just progressed. High false positive rates misdiagnose cancer-free subjects as positive for the cancer status. Thus, most screening tests are only practical when used to test subjects who have a high risk of developing a screened cancer or who have symptoms indicative of the presence of suspected cancer. Thus, most screening tests have limited ability to detect cancer in the general population. Novel studies have shown that aberrant DNA methylation is in many disease processes, including cancer. DNA methylation plays a role in regulating gene expression and defining tissue differentiation, cell identity, and/or embryonic lineages. Thus, abnormal DNA methylation can create problems in normal gene expression pathways or cell identity, leading to cancer or other diseases. For example, specific patterns of differentially methylated regions can be used as molecular markers for various disease states. Detection of these differentially methylated regions can be accomplished by sequencing analysis of the free DNA molecule. Typically, the free DNA molecule is a DNA molecule produced in a bodily fluid. These DNA molecules are typically released due to natural cell death, active release of healthy cells, or tumor-derived DNA molecules shed from tumor cells undergoing cell death. Nevertheless, even techniques for detecting differentially methylated regions present a number of challenges. Early cancer detection is particularly challenging due to the small ratio of tumor cells to non-cancerous cells in the subject. The minimum ratio may be on the order of 1:1000, 1:10,000 or even 1:100,000. This creates the challenge of detecting small amounts of cancer "signals" in otherwise healthy "noise", especially when the signals are analyzed with readily accessible sample conditions, such as blood draws to assess the presence of cancer signals in plasma (e.g., free DNA). Further challenges may arise when providing insight into the cancer detected in a subject. For example, a multiple cancer detection test may only provide a binary prediction as to whether a subject has cancer. This insight may limit the ability of medical personnel to continue diagnosing cancer and/or treating a subject. Diagnostic examinations and treatment options are typically tailored to the specific organ group and tumor biology affected. Thus, there is a need to increase the granularity of analytical predictions to better inform healthcare personnel of diagnostic examination options. The present disclosure is directed to addressing the challenges described above. The background description provided herein is intended to generally introduce background information of the present disclosure. Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art or prior art by inclusion in this section. Disclosure of Invention One or more of the inventions described herein provide improvements in cancer detection, diagnosis, and treatment, particularly providing granularity of cancer origin (CSO) prediction. One or more inventions described herein contemplate training parallel CSO classifiers to predict organ types and tumor biology types affected by cancer, respectively. Parallel training of CSO classifiers divides CSO predictive analysis to avoid confounding organ type and tumor biology type predictions when only cancer signal origins are typically predicted. In some examples, the CSO classifier is trained with a training data set derived from the same training sample set. A training data set is generated for each CSO classifier based on, for example, methylation sequencing data for each sample and a known CSO signature for the sample, the known CSO signature representing a clinically true phase that is currently obtained only after diagnostic examination and cancer diagnosis. Training of CSO classifiers can be parallel such that each classifier learns patterns in methylation sequencing data separately and independently t