CN-121999877-A - Tissue origin deducing method and device based on cancer specific chromatin accessibility marker
Abstract
The invention discloses a tissue origin deducing method, a device and a storage medium based on a cancer specific chromatin accessibility marker, belonging to the technical field of gene detection. The method aims to solve the problem of accurate cancer tracing by using low-cost shallow whole genome sequencing data. The present invention proposes a fragment discreteness index that combines the terminal dispersity and coverage fluctuations of free DNA fragments to more accurately characterize chromatin accessibility. Candidate accessibility regions are identified from the data by a statistical verification strategy that combines globally with locally. The core step is to screen out the unique markers of specific cancer species by eliminating the common accessibility region of various cancers and healthy controls. Based on the specific markers, multidimensional fragment histology features are extracted, a machine learning multi-classification model is constructed, and the probability that a sample to be detected is derived from different cancer types is predicted.
Inventors
- SHAO YANG
- WANG XIAONAN
- ZHAO MINCHAO
- CHANG SHUANG
- WU SHUYU
- NA CHENGLONG
- ZHANG XIAN
- CHANG ZHILI
Assignees
- 南京世和基因生物技术股份有限公司
- 南京世和医疗器械有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251107
Claims (10)
- 1. A method of tissue origin inference based on cancer specific chromatin accessibility markers, for non-therapeutic and diagnostic purposes, comprising the steps of: a) Providing free DNA sequencing data of a sample to be tested, wherein the data is subjected to the processes of removing the linker and low quality sequences, aligning to a reference genome and removing the repeated sequences; b) Obtaining a set of cancer-specific chromatin accessibility markers for each cancer type for a predetermined set of cancer types; c) Extracting values of at least one fragment histology feature for the test sample on all genomic regions defined by the cancer specific chromatin accessibility markers for each cancer type obtained in step b), and summarizing and normalizing the extracted feature values for each cancer type to form a multi-dimensional feature vector representing the test sample; d) Inputting the feature vector obtained in the step c) into a pre-trained multi-classification machine learning model, outputting probability scores of the samples to be tested corresponding to each cancer type in the group, and deducing tissue origins according to the probability scores.
- 2. A tissue origin inference device based on cancer specific chromatin accessibility markers, comprising: the sequencing data acquisition module is used for providing free DNA sequencing data of a sample to be tested, and the data is subjected to the processes of removing a connector and a low-quality sequence, comparing the connector and the low-quality sequence to a reference genome and removing a repeated sequence; A marker screening module for obtaining a set of cancer-specific chromatin accessibility markers for each cancer type for a predetermined set of cancer types; the feature vector construction module is used for extracting values of at least one fragment histology feature for the sample to be tested on all genome areas defined by the cancer specific chromatin accessibility markers of each cancer type obtained by the marker screening module, and summarizing and standardizing a plurality of feature values extracted for each cancer type to form a multidimensional feature vector representing the sample to be tested; The traceability analysis module is used for inputting the feature vector obtained by the feature vector construction module into a pre-trained multi-classification machine learning model, outputting probability scores of the samples to be tested corresponding to each cancer type in the group, and deducing tissue origins according to the probability scores.
- 3. The apparatus of claim 2, wherein the free DNA sequencing data is derived from shallow whole genome sequencing of a plasma sample to a single sample sequencing depth of 5X; The group of cancer types includes breast cancer, cervical cancer, intestinal cancer, esophageal cancer, gastric cancer, head and neck cancer, kidney cancer, liver and gall cancer, lung cancer, melanoma, ovarian cancer, pancreatic cancer, prostate cancer, sarcoma, thyroid cancer, urothelial cancer, and endometrial cancer.
- 4. The device of claim 2, wherein the method for obtaining a chromatin accessibility marker comprises: i. Identifying a corresponding set of candidate chromatin accessibility regions by analyzing, for each cancer type and healthy control group, respective free DNA sequencing data by combining a plurality of single sample data of the same disease type to obtain high depth representative sample data having a depth of not less than 100X; Combining all other non-target cancer types and candidate chromatin accessibility areas of a healthy control group aiming at the target cancer type to form a background set, and combining and de-duplicating adjacent intervals with the distance of less than 20bp on the genome in the background set; removing from the candidate chromatin accessibility region of the target cancer type a region overlapping the background set by a ratio of more than 0.5, the overlapping ratio being defined as the total length of bases overlapping the target candidate region and the background set divided by the base length of the target candidate region itself; The identification of the candidate chromatin accessibility region in the step i) is realized by adopting a sliding window with the length of 200bp and the step length of 20bp to scan on a genome and calculating the fragment discrete index FDI of the window meeting the basic quality filtering, wherein the basic quality filtering condition is that the sequencing coverage of the window is more than 0, the window does not belong to a dark region formed by a repeated sequence of the genome or a low-complexity region, and the average comparability score of all bases in the window is not less than 0.9; The segment dispersion index FDI is defined as the product of the segment end dispersion index EDI and the segment coverage standard deviation std (coverage), i.e., fdi=edi×std (coverage); The fragment end dispersion index EDI is used to measure the spatial dispersion of free DNA fragment ends in a region by the formula Performing a calculation, wherein: n is the total number of free DNA fragment ends in the region; j is the end index from 1 to n in the region; w is a preset local inter-cell width, and the value of w is 20bp; x is a preset constant, and the value of x is 0.5; C j,x×w is the total number of ends in the cell with the j-th end as the center and the width w.
- 5. The apparatus of claim 2, wherein the identification of candidate chromatin accessibility regions further comprises a statistical test step combining global and local saliency tests to screen windows with significantly higher FDI values; The global significance checking in the statistical checking step includes: a. carrying out normalization processing on FDI values of all windows filtered by basic quality to obtain FDI_norm; b. Performing Beta distribution fitting on FDI_norm values of all windows on a single chromosome scale, and calculating the right tail cumulative probability of the FDI_norm value of each window as global significance p_global; c. Performing multiple verification and correction on all p_global values by adopting a Benjamini-Hochberg method to control the error discovery rate FDR, and reserving a window which simultaneously satisfies p_global <0.05 and FDR <0.05 as a global significant candidate window; The statistical verification step further includes a local saliency verification that performs the following on each global saliency candidate window: a. taking all windows within 5kb of the upstream and downstream of the center of the candidate window as local backgrounds, carrying out normal distribution approximation on FDI values of the local backgrounds, and calculating to obtain a mean value mu and a standard deviation sigma; b. Calculating the right tail local salience p_local=1- Φ ((EDI- μ)/σ) of the candidate window relative to the local background, wherein Φ is a cumulative distribution function of the standard normal distribution; c. A window of p_local <0.05 is reserved; and finally, the statistical test step comprises a region integration step, and windows which pass through the global and local significance test and have adjacent distances of less than or equal to 200bp on the genome are combined to form a final candidate chromatin accessibility region.
- 6. The apparatus of claim 2, wherein the segmentality features include a segment discreteness index FDI, a directed perceived fragmentation feature OCF, a transcription start site neighbor coverage TSS, and a nucleosome depletion region coverage NDR, and The construction of the feature vector specifically comprises the steps of calculating 4 features for each identified cancer specific chromatin accessibility marker interval, then grouping according to cancer types, respectively calculating the arithmetic average value of the 4 features under each cancer type, thus obtaining a single feature value representing each cancer type under the feature dimension, finally obtaining the 4-dimensional feature average value vector corresponding to each cancer type, and forming the feature vector together.
- 7. The device of claim 6, wherein the directionally-aware fragmentation signature OCF is obtained by calculating the directional difference or ratio of the number of cfDNA fragments 5-terminal to the positive and negative strand endogenous to the genome in a specific region, wherein a higher OCF value indicates higher accessibility.
- 8. The apparatus of claim 6, wherein the transcription initiation site neighbor coverage TSS is obtained by calculating an average coverage depth of cfDNA within a 2kb window of 1kb range upstream and downstream of each cancer specific chromatin accessibility marker region with the center of the marker region as an anchor point, and performing RPKM normalization, wherein a lower TSS value indicates higher accessibility, and wherein the nucleosome depletion region coverage NDR is obtained by calculating an average coverage depth of cfDNA within a 500bp window of 250bp upstream and downstream of the center of each cancer specific chromatin accessibility marker region, and calculating a ratio thereof corresponding to the marker region according to the calculated total coverage depth of TSS interval, wherein a lower NDR value indicates higher accessibility.
- 9. The apparatus of claim 2, wherein the multi-classification machine learning model is constructed based on an H2O AutoML framework, and wherein the inferring tissue origin further comprises outputting a Top-2 cancer type prediction result.
- 10. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of claim 1.
Description
Tissue origin deducing method and device based on cancer specific chromatin accessibility marker Technical Field The invention relates to a SWGS CFDNA-based cancer specific chromatin accessibility marker identification method, a tissue origin inference device and a computer storage medium, belonging to the technical field of gene detection. Background Chromatin accessibility (chromatin accessibility) refers to the accessibility and binding degree of DNA to transcription factors, RNA polymerase and various regulatory complexes in a three-dimensional folding cell background, and is essentially determined by the occupation and positioning of nucleosomes, histone modification and remodeling factor dynamics, wherein an open region (such as a promoter and an enhancer) is sparse in nucleosomes, easy to contact with proteins, and a closed region is compact in nucleosomes and limited in binding. In recent years, this concept has been successfully migrated to the field of plasma cfDNA research. Since nucleosome protection and regulatory protein binding will leave a "footprint" in the fragmentation pattern, cfDNA is allowed to retain open chromatin information of the source tissue. Researchers based on the physical law that an open region is easier to be cut by endonucleases, shorter end points are generated and fragments are scattered, the fragment length distribution, the end point motif and the nucleosome footprint are combined, the tissue-specific accessibility map is reconstructed noninvasively, and the regulation and control activity is traced from the peripheral blood level by a histone modification research method such as cfChIP-seq. There is a great deal of evidence that active chromatin fragments in plasma are highly correlated with the promoter/enhancer of the code map, RNAPol II stop sites and gene expression levels, distinguishing multiple tissue open lineages, indicative of tumor or organ origin. Footprint analysis based on nucleosome occupancy/pitch has also been used for disease typing and lesion discrimination and validated in real clinical settings. Compared with signals such as mutation, copy number, methylation and the like, the accessibility marker has the functions (reflecting gene regulation activity), specificity (sensitive to tissue/organ open lineages) and instantaneity (suitable for dynamic monitoring), and has higher interpretation and application potential in tissue tracing, early screening and curative effect/recurrence evaluation. Cancer tissue Traceability (TOO) is a key link in early screening, diagnosis and metastasis origin determination. Current traceability schemes rely mainly on three classes of signals, copy number/mutation, methylation and fragment histology. The sensitivity or cost of the former two is limited under low tumor load, and fragment histology is easily affected by coverage fluctuation and sequencing bias, and background modeling and reproducibility are still to be improved. Disclosure of Invention The patent provides a low-cost engineering process for shallow WGS, which is characterized by an open state according to comprehensive indexes of endpoint dispersion and coverage fluctuation, screens robust candidate sections through global and local statistical tests, combines neighbor windows to generate accessibility features which can be directly used for learning and modeling, and then interfaces related open signals with a tissue traceability model to realize multi-cancer source judgment and clinical monitoring. The scheme reduces the dependence on high-depth directional capture, inhibits sequencing and spatial bias, improves signal stability and statistical reliability, has good expandability and landing property, is suitable for clinical scenes such as early screening of cancers, unknown primary focus identification, curative effect evaluation, recurrence early warning and the like, and has novelty and practical value. The method comprises the steps of constructing a composite index EDI=EDI×std (coverage) by using fragment endpoint dispersity and coverage fluctuation, scanning a whole genome by a sliding window, filtering by combining coverage, dark areas and comparability, performing beta fitting calculation on normalized FDI, performing global significance and BH correction, performing local normal examination on a + -5 kb neighborhood, only reserving a double significant window, merging neighbors to serve as candidate areas, screening cancer-specific candidate areas by removing a common interval crossing cancer seeds/tissues, extracting a multi-dimensional accessibility feature set, constructing a machine learning model, and reducing cost and improving accuracy and floor-standing property. A method of tissue origin inference based on cancer specific chromatin accessibility markers, comprising the steps of: a) Providing free DNA sequencing data of a sample to be tested, wherein the data is subjected to the processes of removing the linker and low quality sequence