EP-4738369-A1 - A METHOD OF ANALYZING A SAMPLE FOR OVARIAN CANCER DETECTION USING A MACHINE LEARNING MODEL
Abstract
The subject of the invention is a method for analyzing a biological material sample for the detection of ovarian cancer, comprising isolating RNA from platelets, preparing cDNA libraries, fragmenting, molecularly indexing with Illumina barcodes, sequencing, and bioinformatic processing, in which the cDNA libraries, after bioinformatic processing, are mapped to the human reference genome (hg19), then samples with a read count greater than 100,000 per sample are selected, the data is normalized using a variance-stabilizing transformation, model training is performed on training data and classification performance is assessed using a validation set; in the final stage, the sample is evaluated by the resulting algorithm, which classifies the sample as "ovarian cancer" or "healthy".
Inventors
- JOPEK, Maksym
- SUPERNAT, ANNA
- SIECZCZYNSKI, Michal
- PASTUSZAK, KRZYSZTOF
- Zaczek, Anna
Assignees
- Gdanski Uniwersytet Medyczny
Dates
- Publication Date
- 20260506
- Application Date
- 20251010
Claims (9)
- A method for analyzing a biological material sample for the detection of ovarian cancer (OC), comprising isolating RNA from platelets, preparing cDNA libraries, fragmenting, molecularly indexing with Illumina barcodes, sequencing, and performing bioinformatic processing, characterized in that - the cDNA libraries, after bioinformatic processing, are mapped to the human reference genome (hg19); - a cDNA sample with a read count greater than 100,000 per sample is selected; - the data is normalized using a variance-stabilizing transformation; - the model training is performed on training data, and assessment of classification quality is performed on the validation set; - the sample is evaluated by the resulting algorithm, which classifies the sample as "ovarian cancer" or " healthy individual."
- The method according to claim 1, characterized in that each cDNA sample contains information on RNA expression levels for 5,277 genes found in platelets.
- The method according to claim 1, characterized in that the algorithm was trained on samples from two independent cohorts 1 and 2, which were combined and normalized using the DESeq2 package in R.
- The method according to any one of claims 1 to 3, characterized in that the samples from the cohorts training dataset originate from healthy donors and donors with OC.
- The method according to any one of claims 1 to 4, characterized in that the training of the algorithm was limited to samples from asymptomatic healthy women and women with OC.
- The method according to any one of claims 1 to 5, characterized in that, for all cancer stages, the specificity of the method is at least 54.9% for a model with 100% sensitivity, and the sensitivity is at least 74.5% for a model with 100% specificity, when only women are included in the model.
- The method according to any one of claims 1 to 5, characterized in that, for all cancer stages, the specificity of the method is at least 35.2% for a model with 100% sensitivity, and the sensitivity is at least 52.9% for a model with 100% specificity, when only OC are included in the model.
- The method according to any one of claims 1 to 5, characterized in that, for all cancer stages, the specificity of the method is at least 40.8% for a model with 100% sensitivity, and the sensitivity is at least 53.9% for a model with 100% specificity, when only women and OC are included in the model.
- The method according to any one of claims 1 to 5, characterized in that, for all cancer stages, the specificity of the method is at least 59.2% for a model with 100% sensitivity, and the sensitivity is at least 75.5% for a model with 100% specificity, when the platelet RNA profile and donor age are included in the model as a separate feature.
Description
The subject of the invention is a method for analyzing a biological material sample: preferably RNA derived from blood platelets for the early detection of ovarian cancer (OC) using a trained machine-learning model. OC poses a major challenge in oncology due to its asymptomatic nature in the early stages, which leads to late diagnosis and a five-year survival rate of only 50%. The lack of effective screening tests for asymptomatic, average-risk patients further hampers early-stage detection, underscoring the urgent need for innovative methods to identify this malignancy. Although the serum biomarker CA-125 is routinely used in clinics, it is not suitable as a screening tool due to its low sensitivity in early OC and a high rate of false positives (elevated CA-125 levels may also accompany menstruation, pregnancy, or endometriosis). According to the state of the art, liquid biopsies most commonly use cell-free circulating DNA (cfDNA) or circulating tumor cells (CTCs). In case of cfDNA, success in detecting ovarian cancer was reported by J. D. Cohen et al., 2018, and by D. Killock, 2018, where the CancerSEEK diagnostic test was developed to simultaneously assess protein biomarker concentrations and cfDNA (containing a fraction of circulating tumor DNA). CancerSEEK achieved a high OC detection rate of around 98%. However, for samples from stage I patients, this test demonstrated only 40% sensitivity. Using the same dataset, K. C. Wong et al., 2019, obtained approximately 96% sensitivity for OC detection, and ~77% and ~90% for stages I and II, respectively, at 99% specificity when detecting any type of cancer using a semi-naive Bayesian machine-learning model (CancerA1DE). In the publication by S. Rahaman, X. Li, J. Yu, and K.-C. Wong et al., 2019, the authors achieved 99% sensitivity at 99% specificity in cancer detection using a one-dependence estimators ensemble model (CancerEMC). The first study employed 39 biomarkers and considered patient sex, whereas the second used only 15 biomarkers combined with age, ethnicity, and sex. Another multi-cancer screening test, Galleri (GRAIL), disclosed in C. Sheridan, 2017, achieved a 67% detection rate for OC at 99% specificity in the early stage. However, the clinical utility of this test proved limited due to high costs. In light of these studies and their limitations, the use of platelet RNA represents a valuable alternative for OC detection and more. Changes in the platelet RNA profile are detectable at early stages, offering a chance to capture the disease even during its asymptomatic phase. In M. A. Jopek et al., 2024, it was shown that applying logistic regression to platelet RNA expression data enables 76.5% sensitivity at 99% specificity for OC detection (72% sensitivity at a 99% specificity threshold for early detection). The ImPlatelet tool disclosed in K. Pastuszak et al., 2021, based on deep learning, is another example of the potential of platelet-based diagnostics, achieving up to 88% specificity with 95% sensitivity. Furthermore, as presented by Y. Gao et al., 2022, combining platelet RNA with the CA-125 marker yielded 85.3% specificity at 86.4% sensitivity in samples from various OC stages and 73.2% sensitivity at 86.0% specificity in case of early OC detection. It should be noted here that, while cross-study comparisons are difficult to perform due to differing metrics and their levels, they nonetheless show the trade-offs between sensitivity and specificity and the importance of a balanced diagnostic approach. In addition, the many studies also differ by applying various methods for normalizing test samples, notably: separate normalization of the test and training sets, and joint normalization of test and training data. The objective of the invention is to address the lack of tools for early ovarian cancer diagnostics using blood as a liquid-biopsy material, enabling a minimally invasive alternative to traditional methods. An additional objective is to improve diagnostic methodology by developing specialized classifiers tailored to different demographic cohorts. Currently, the biggest problem for detecting ovarian cancer is low sensitivity of tests. Our developed solution is characterized by high sensitivity for detecting stages I and II for a model intended for screening, and high specificity for a high-risk group. The subject of the invention is a method for analyzing a biological material sample for early detection of ovarian cancer, comprising isolating RNA from platelets, preparing cDNA libraries, fragmenting, molecular indexing with Illumina barcodes, sequencing, and bioinformatic processing, characterized in that the cDNA libraries, after bioinformatic processing, are mapped to the human reference genome (hg19), then a cDNA sample with a read count greater than 100,000 per sample is selected, the data are normalized using a variance-stabilizing transformation, and the model is trained on training-subset based on assessment of classification qua