US-20260128130-A1 - CLASSIFIER AND TRAINING SYSTEM FOR EARLY IDENTIFICATION OF PLACENTA ACCRETA SPECTRUM RISK IN HIGH-RISK PREGNANT WOMEN
Abstract
A classifier and a training system for early identification of a placenta accreta spectrum risk in high-risk pregnant women. This system implements normalization and discretization based on analysis with genome-wide maternal plasma cell-free DNA promoter coverages, and identifies an optimal target combination with 23 genes including ABHD1, ALG1L2, EYS, and the like, demonstrating potential for preparation of a diagnostic kit. Based on a machine learning algorithm, the constructed classifier exhibits high sensitivity and specificity in predicting occurrence of PAS, and areas under the receiver operating characteristic curve (AUC) all exceed 0.85, effectively implementing early risk assessment for high-risk pregnant women. This classifier clinically provides a non-invasive predictive tool, holding significant clinical application value and medical significance.
Inventors
- Zhonghua SHI
- Runrun HAO
- Bin Zhang
- Xueqi BAI
- Shanshan WANG
- Shiman HU
- Sutong KAN
- Haoyan SHI
Assignees
- CHANGZHOU MATERNAL AND CHILD HEALTH CARE HOSPITAL
Dates
- Publication Date
- 20260507
- Application Date
- 20250930
- Priority Date
- 20240930
Claims (17)
- 1 . A classifier training system for early identification of a placenta accreta spectrum risk in high-risk pregnant women, comprising: a dataset division module, configured to extract pre-collected medical data of placenta accreta spectrum (PAS) high-risk pregnant women, and divide the medical data into a discovery dataset, a training dataset, an internal validation dataset, and an external validation dataset; an impact factor extraction module, configured to perform annotation on pre-collected non-invasive prenatal testing (NIPT) data of the PAS high-risk pregnant women with genome-wide cell-free DNA promoter nucleosome coverage profiles and perform feature extraction, calculate original read coverages at pTSS regions, and perform TPM-like normalization on the original read coverages at the pTSS regions to obtain TPM-like normalized pTSS coverages NPC-TPM, wherein NPC-TPM corresponding to each gene is used as an impact factor; a feature selection module, configured to screen impact factors of high-risk pregnant women with occurrence of PAS and high-risk pregnant women without occurrence of PAS in the discovery dataset, to determine feature factors; a feature factor discretization module, configured to determine an optimal cutoff value of each feature factor, and perform discretization on the feature factors based on the optimal cutoff value; a model acquisition module, configured to input the determined feature factors to a recursive feature elimination (RFE) process, gradually construct a PAS prediction classifier by using a plurality of machine learning algorithms; acquire a disease risk assessment result of a to-be-predicted target, apply k-fold cross-validation to enhance assessment robustness; and extract an optimal feature factor combination, and output an optimal classifier and an assessment result thereof, wherein in the model acquisition module, the optimal feature factor combination comprises the following target genes: ABHD1, ALG1L2, EYS, FAM157C, KDSR, KRT5, LANCL2, LINC00390, LINC00964, LOC105371998, LOC107987394, LOC644090, LYZL2, MIR184, MIR4802, MYT1L, NGDN, NSD2, PACRG.AS3, SAP30L.AS1, SLC16A12.AS1, TADA3, and TMEM147.AS1.
- 2 . The system according to claim 1 , wherein in the dataset division module, pregnant women with occurrence of PAS and pregnant women without occurrence of PAS are matched based on maternal age, gestational age at NIPT, fetal sex, and distribution of high-risk factors, samples from a primary center are randomly divided into the training dataset and the internal validation dataset, samples from each sub-center are used as the external validation dataset, and matching is performed on the training dataset at 1:1 based on the gestational age at NIPT and the fetal sex to obtain the discovery dataset.
- 3 . The system according to claim 1 , wherein in the impact factor extraction module, NIPT data is aligned to human reference genome hg19 by using a sequence alignment algorithm, PCR duplicates are removed, a region from −1000 bp to +1000 bp around a transcription start site (TSS) is determined as a promoter region pTSS, and the original read coverage at the pTSS region is calculated.
- 4 . The system according to claim 3 , wherein in the impact factor extraction module, TPM-like normalization is performed on the original read coverage at the pTSS region, wherein NPC-TPM is obtained through the following formula: NPC - TPM i = q i / l i ∑ j ( q j / l j ) * 1 0 6 = q i ∑ j q j * 1 0 6 wherein NPC-TPM i represents a TPM-like normalized pTSS coverage of a gene i, q i represents an original read coverage at the pTSS region, l i represents a transcript length, and Σ j (q j /l j ) represents a sum of pTSS read coverages of all genes normalized based on the transcript length in one sample.
- 5 . The system according to claim 1 , wherein in the feature selection module, the impact factors of the high-risk pregnant women with occurrence of PAS and the high-risk pregnant women without occurrence of PAS in the discovery dataset are subjected to three differential analysis methods: DESeq2, limma-voom, and a rank-sum test, to obtain impact factors with a p value of less than 0.05 as the feature factors.
- 6 . The system according to claim 1 , wherein in the feature factor discretization module, to enhance universality and clinical utility of the classifier for different sequencing platforms, the optimal cutoff value of each feature factor is set to a NPC-TPM value with a maximum sum of sensitivity and specificity in the training dataset; a case in which a feature factor NPC-TPM is greater than a corresponding optimal cutoff value is set to 1; and a case in which a feature factor NPC-TPM is not greater than a corresponding optimal cutoff value is set to 0.
- 7 . The system according to claim 1 , wherein in the model acquisition module, the PAS prediction classifier is gradually constructed by using the plurality of machine learning algorithms, and the classifier is trained separately by logistic regression (LR), and support vector machines (SVM) with linear and RBF kernels.
- 8 . The system according to claim 1 , wherein in the model acquisition module: support vector machine (SVM)-RBF kernel is selected for the optimal classifier.
- 9 . A classifier for early identification of a placenta accreta spectrum risk in high-risk pregnant women, wherein obtained through the training system according to claim 1 .
- 10 . Use of a reagent for detecting expression of genes ABHD1, ALG1L2, EYS, FAM157C, KDSR, KRT5, LANCL2, LINC00390, LINC00964, LOC105371998, LOC107987394, LOC644090, LYZL2, MIR184, MIR4802, MYT1L, NGDN, NSD2, PACRG.AS3, SAP30L.AS1, SLC16A12.AS1, TADA3, and TMEM147.AS1 in preparation of a diagnostic kit for early identification of a placenta accreta spectrum risk in high-risk pregnant women.
- 11 . A classifier for early identification of a placenta accreta spectrum risk in high-risk pregnant women, wherein obtained through the training system according to claim 2 .
- 12 . A classifier for early identification of a placenta accreta spectrum risk in high-risk pregnant women, wherein obtained through the training system according to claim 3 .
- 13 . A classifier for early identification of a placenta accreta spectrum risk in high-risk pregnant women, wherein obtained through the training system according to claim 4 .
- 14 . A classifier for early identification of a placenta accreta spectrum risk in high-risk pregnant women, wherein obtained through the training system according to claim 5 .
- 15 . A classifier for early identification of a placenta accreta spectrum risk in high-risk pregnant women, wherein obtained through the training system according to claim 6 .
- 16 . A classifier for early identification of a placenta accreta spectrum risk in high-risk pregnant women, wherein obtained through the training system according to claim 7 .
- 17 . A classifier for early identification of a placenta accreta spectrum risk in high-risk pregnant women, wherein obtained through the training system according to claim 8 .
Description
CROSS-REFERENCE TO RELATED APPLICATION This application claims the priority benefit of China application serial no. 202411383048.5, filed on Sep. 30, 2024. The entirety of the above-mentioned patent application is hereby incorporated by reference herein and made a part of this specification. BACKGROUND Technical Field The present invention relates to the field of model prediction, and in particular, to a classifier and a training system for early identification of a placenta accreta spectrum risk in high-risk pregnant women. Background Art Placenta accreta spectrum (PAS) is a leading cause of critical and life-threatening obstetric conditions. However, its prenatal diagnosis has limitations such as late detection and frequent missed diagnoses, posing significant challenges especially for primary hospitals. Studies indicate that two-thirds of PAS cases remain undiagnosed clinically. Failure to identify PAS prenatally is a risk factor for intrapartum and postpartum massive hemorrhage, blood transfusion, emergency interventions, and hysterectomy. Therefore, accurate early prediction of PAS helps high-risk pregnant women make informed reproductive decisions, facilitates high-risk referrals, allows for multidisciplinary consultations, and reduces risks for pregnant or lying-in women and perinatal infants. Cell-free DNA (cfDNA) in plasma originates from the release of apoptotic cells. The cfDNA carries nucleosome footprints that can reflect gene expression information of its tissue of origin. During pregnancy, approximately 10% of cfDNA in circulating blood originates from the placenta. Therefore, cfDNA in plasma in early pregnancy carries gene expression information of the placenta and decidua. During early-to-mid pregnancy, genome-wide cfDNA promoter nucleosome coverage profiles can reflect expression patterns of tissue of origin, demonstrating extremely high predictive value for placenta-derived diseases, particularly PAS. Non-invasive prenatal testing (NIPT) is clinically common for prenatal screening, and relies on low-coverage whole-genome sequencing across different sequencing platforms for hospitals worldwide, such as Illumina, Life, and BGI. In recent years, NIPT-based extraction of cfDNA promoter nucleosome coverage profiles has shown significant value not only in screening for fetal chromosomal abnormalities, but also in early prediction of pregnancy complications, such as fetal growth restriction, macrosomia, and preeclampsia. However, there is no effective early prediction model for placenta accreta spectrum. SUMMARY To address practical needs and drawbacks in the related art, the present invention provides a classifier and a training system for early identification of a PAS risk in high-risk pregnant women, to resolve current lack of methods for accurately predicting occurrence of PAS during early-to-mid pregnancy. Technical Solutions To achieve the foregoing objective, the present invention provides the following technical solutions: According to a first aspect, the present invention provides a classifier for early identification of a PAS risk in high-risk pregnant women based on plasma cfDNA promoter coverages, where a target gene combination includes ABHD1, ALG1L2, EYS, FAM157C, KDSR, KRT5, LANCL2, LINC00390, LINC00964, LOC105371998, LOC107987394, LOC644090, LYZL2, MIR184, MIR4802, MYT1L, NGDN, NSD2, PACRG.AS3, SAP30L.AS1, SLC16A12.AS1, TADA3, and TMEM147.AS1. In the present invention, genome-wide cfDNA promoter coverage profiles are discovered in plasma of pregnant women with occurrence of PAS during early-to-mid pregnancy based on NIPT data, an optimal gene combination and an optimal cutoff value of each gene are obtained based on machine learning strategies (see a second aspect) to train an optimal classifier, and an area under the curve (AUC) of the receiver operating characteristic curve (ROC) is predicted in an independent validation dataset, which reaches 0.85 or more, demonstrating good potential as a screening means for PAS in high-risk pregnant women. A method for assessing a placenta accreta spectrum risk in high-risk pregnant women includes the following steps: Data collection and preprocessing: collect low-coverage whole-genome sequencing data of high-risk pregnant women undergoing NIPT, and perform necessary preprocessing, to ensure data quality. Promoter region identification and coverage extraction: align NIPT data to human reference genome hg19 by using software bwa-mem, SAMtools, and BEDtools and a sequence alignment algorithm, and determine promoter regions pTSS from −1000 bp to +1000 bp around transcription start sites (TSS) of 23 genes: ABHD1, ALG1L2, EYS, FAM157C, KDSR, KRT5, LANCL2, LINC00390, LINC00964, LOC105371998, LOC107987394, LOC644090, LYZL2, MIR184, MIR4802, MYT1L, NGDN, NSD2, PACRG.AS3, SAP30L.AS1, SLC16A12.AS1, TADA3, and TMEM147.AS1, to obtain original read coverages of the 23 genes at the pTSS regions. Feature factor normalization: normalize the original read co