Search

CN-122024836-A - Cancer marker screening method and system combining co-methylation module and cell component association analysis

CN122024836ACN 122024836 ACN122024836 ACN 122024836ACN-122024836-A

Abstract

The invention discloses a cancer marker screening method and a cancer marker screening system combining a co-methylation module and cell component association analysis, which relate to the technical field of bioinformatics, wherein the method comprises the steps of obtaining DNA methylation sequencing data of a cancer tissue sample group and a control tissue sample group and preprocessing to generate a methylation matrix; the method comprises the steps of obtaining differential sites by differential methylation analysis and forming differential methylation fragments according to adjacent merging rules, constructing a co-methylation network based on fragment methylation level correlation and dividing to obtain module characteristic values, estimating cell components by using methylation reference spectra and/or single cell data, screening target modules according to rules that the module characteristic values are related to cancer cell proportions and are not related to non-cancer cell proportions, and outputting cancer cell specific candidate fragment sets. By the technical scheme, the cell mixing and infiltration interference is inhibited, the specificity and stability of the marker are improved, and the consistency of the cross-queue and the parting diagnosis performance are improved.

Inventors

  • REN QIAOLING
  • PAN ZILONG
  • Deng Yingqing

Assignees

  • 杭州六次元基因科技有限公司

Dates

Publication Date
20260512
Application Date
20260128

Claims (10)

  1. 1. A method of screening for a cancer marker in combination with a co-methylation module and cellular component association analysis, comprising: Obtaining DNA methylation sequencing data of a cancer tissue sample group and a control tissue sample group, and preprocessing the DNA methylation sequencing data to obtain a methylation matrix of CpG sites multiplied by samples; performing differential methylation analysis on the cancer tissue sample group and the control tissue sample group based on the methylation matrix to obtain a differential methylation site set, and combining the differential methylation site set according to a preset adjacent combination rule to obtain a differential DNA methylation fragment set; Constructing a co-methylation network based on the methylation level correlation of the differential DNA methylation fragment set in the cancer tissue sample group, and carrying out module division on the co-methylation network to obtain at least one co-methylation module and module characteristic values of each co-methylation module; Estimating the cellular components of the set of cancer tissue samples using cell type-specific methylation reference profile data and/or single cell sequencing data to obtain a proportion of cancer cells and a proportion of one or more non-cancer cell types for each cancer tissue sample; Determining a target co-methylation module according to a preset screening rule based on the correlation of the module characteristic value of each co-methylation module with the cancer cell proportion and the one or more non-cancer cell type proportions, wherein the preset screening rule comprises that the module characteristic value and the cancer cell proportion meet a preset correlation judgment threshold value, and the module characteristic value and the one or more non-cancer cell type proportions respectively do not meet the preset correlation judgment threshold value; Determining the differential DNA methylation fragments contained by the target co-methylation module as cancer cell specific differential DNA methylation fragments and outputting as a candidate set of cancer DNA methylation markers.
  2. 2. The method of claim 1, wherein the DNA methylation sequencing data is at least one of whole genome DNA methylation sequencing data WGBS and reduced representative bisulfite sequencing data RRBS; the pretreatment process of the DNA methylation sequencing data comprises the following steps: and performing quality control and filtering on the original sequencing data, comparing the filtered read to a reference genome, calculating the methylation level of CpG sites, and filtering the CpG sites according to a coverage threshold value to generate the methylation matrix.
  3. 3. The method of claim 1, wherein the differential methylation analysis uses one of a t-test, a Wilcoxon test, a beta-binomal model, and MOABS algorithm, and the set of differential methylation sites is obtained under conditions that the absolute value of the differential methylation level meets a predetermined differential threshold, wherein the predetermined differential threshold is 0.2.
  4. 4. The method of claim 1, wherein the predetermined proximity merge rule comprises combining consecutive adjacent differential methylation sites into one differential DNA methylation fragment when the adjacent differential methylation sites are separated by no more than 200bp on the reference genome and the number of consecutive adjacent differential methylation sites is no less than 3.
  5. 5. The method for screening a cancer marker in combination with a co-methylation module and cellular component association analysis according to claim 1, wherein the specific process of modular division of the co-methylation network comprises: Calculating a methylation level correlation coefficient matrix for the set of differential DNA methylation fragments in the set of cancer tissue samples; Converting the correlation coefficient matrix into an adjacent matrix based on a preset soft threshold value and calculating a topological overlap matrix; hierarchical clustering is carried out on the topological overlapping matrix, and dynamic shearing is adopted to obtain the co-methylation module.
  6. 6. The method of claim 5, wherein the module signature value is the first principal component of a methylation level matrix of the corresponding co-methylated intra-module differential DNA methylation fragments in the set of cancer tissue samples.
  7. 7. The method of claim 1, wherein the cellular component estimation uses a cell deconvolution algorithm UXM _ deconv and outputs a ratio of cancer cells and a ratio of at least one non-cancer cell type; The preset correlation determination threshold is a correlation significance threshold, wherein the module feature value and cancer cell proportion satisfying the preset correlation determination threshold includes a correlation test p value of less than 0.05, and the module feature value and the at least one non-cancer cell type proportion failing to satisfy the preset correlation determination threshold includes a correlation test p value of not less than 0.05.
  8. 8. The method of claim 1, further comprising validating the cancer DNA methylation marker candidate set after outputting the cancer DNA methylation marker candidate set, the validating comprising at least one of: Extracting methylation levels of different DNA methylation fragments corresponding to the cancer DNA methylation marker candidate set as characteristics, constructing a machine learning classification model for distinguishing a cancer tissue sample set from a control tissue sample set, and verifying the machine learning classification model by utilizing the external verification sample set to obtain a diagnosis performance evaluation result; Cross-validating the candidate set with a set of cell type-specific differential DNA methylation fragments derived based on cell type-specific methylation reference profile data or single cell sequencing data to validate cross-queue consistency and cell specificity of the candidate set.
  9. 9. A cancer marker screening system combining a co-methylation module and a cell component association analysis, characterized in that, a cancer marker screening method for performing the binding co-methylation module and cellular component association assay of any one of claims 1 to 8, comprising: The sample data processing unit is used for acquiring DNA methylation sequencing data of the cancer tissue sample group and the control tissue sample group, and preprocessing the DNA methylation sequencing data to obtain a methylation matrix of CpG sites and samples; the tissue difference analysis unit is used for carrying out difference methylation analysis on the cancer tissue sample group and the control tissue sample group based on the methylation matrix to obtain a difference methylation site set, and combining the difference methylation site set according to a preset adjacent combination rule to obtain a difference DNA methylation fragment set; A co-methylation dividing unit, configured to construct a co-methylation network based on the methylation level correlation of the differential DNA methylation fragment set in the cancer tissue sample group, and perform module division on the co-methylation network to obtain at least one co-methylation module and a module characteristic value of each co-methylation module; A cell component estimation unit for estimating cell components of the cancer tissue sample group by using cell type-specific methylation reference spectrum data and/or single cell sequencing data to obtain a cancer cell proportion and at least one non-cancer cell type proportion of each cancer tissue sample; The related characteristic screening unit is used for determining a target co-methylation module according to a preset screening rule based on the correlation of the module characteristic value of each co-methylation module, the cancer cell proportion and the at least one non-cancer cell type proportion, wherein the preset screening rule comprises that the module characteristic value and the cancer cell proportion meet a preset correlation judgment threshold value, and the module characteristic value and the at least one non-cancer cell type proportion do not meet the preset correlation judgment threshold value; and a marker output unit for determining the differential DNA methylation fragments contained in the target co-methylation module as cancer cell specific differential DNA methylation fragments and outputting the cancer cell specific differential DNA methylation fragments as a cancer DNA methylation marker candidate set.
  10. 10. The cancer marker screening system of claim 9, further comprising a marker validation unit for validating the cancer DNA methylation marker candidate set, the validation comprising at least one of: obtaining DNA methylation sequencing data of an external verification sample set, extracting methylation levels of differential DNA methylation fragments corresponding to the cancer DNA methylation marker candidate set as characteristics, constructing a machine learning classification model for distinguishing a cancer tissue sample set from a control tissue sample set, and verifying the machine learning classification model by using the external verification sample set to obtain a diagnosis performance evaluation result; Obtaining a set of cell type-specific differential DNA methylation fragments based on cell type-specific methylation reference profile data or single cell sequencing data, and cross-verifying the cancer DNA methylation marker candidate set with the set of cell type-specific differential DNA methylation fragments to verify cross-queue consistency and cell specificity of the candidate set.

Description

Cancer marker screening method and system combining co-methylation module and cell component association analysis Technical Field The invention relates to the technical field of bioinformatics, in particular to a cancer marker screening method combining a co-methylation module and cell component association analysis and a cancer marker screening system combining the co-methylation module and the cell component association analysis. Background The occurrence and development of cancer are accompanied by remarkable epigenetic abnormality, wherein the DNA methylation change has the characteristics of early appearance, higher stability, capability of detection in tissue samples and body fluid source DNA, and the like, and has been widely used in the scenes of cancer molecular typing, early detection, auxiliary diagnosis, and the like. The prior art generally constructs methylation profiles at the level of CpG sites or genomic fragments based on whole genome or simplified representative methylation sequencing data, and obtains candidate regions of difference through differential methylation analysis between cancer tissue samples and control tissue samples, thereby serving as potential markers for subsequent modeling and validation. However, in a tissue sample-based methylation marker screening procedure, cancer tissues often contain multiple non-cancer cell components such as cancer cells, immune cells, fibroblasts, endothelial cells and the like, and cell composition differences among different samples can introduce a significant "component mixing" effect, so that partial difference methylation signals reflect cell proportion changes rather than methylation abnormalities of cancer cells themselves, thereby reducing cancer cell specificity and cross-queue consistency of candidate markers and affecting robust application of the candidate markers in typing and early screening. In this regard, the prior art has attempted to mitigate heterogeneous interference by cell type-resolved sequencing or correction strategies based on cell composition estimation, but still has limitations in terms of cost and availability. For example, single-cell DNA methylation sequencing can be used for describing methylation differences in cell resolution, but experiments and data processing costs are high, and the conventional large-queue screening process is difficult to directly replace, and when methylation signals are subjected to component elimination based on cell deconvolution results and other modes, the phenomena that regression residuals are negative or methylation levels after correction exceed a reasonable range and the like possibly occur, so that result interpretation and stability are insufficient, and reliable screening and verification of candidate markers are affected. Disclosure of Invention In view of the above problems, the invention provides a cancer marker screening method and a system combining a co-methylation module and cell component association analysis, which effectively solve the problems of insufficient cancer marker screening accuracy and specificity caused by cell component hybridization in the prior art by combining differential methylation analysis and cell component estimation, and can accurately identify a methylation module related to cancer cell specific change by constructing a co-methylation network and carrying out module division. In addition, by combining with the reverse-rolling analysis of cell components, the influence of non-cancer cells such as immune cells and fibroblasts can be further eliminated, so that the screened markers have higher cancer cell specificity, the screening accuracy of the cancer markers is improved, reliable candidate sets can be provided for subsequent marker verification and clinical application, and the method has stronger operability and wide application prospect. To achieve the above object, the present invention provides a cancer marker screening method combining a co-methylation module and a cell component association analysis, comprising: Obtaining DNA methylation sequencing data of a cancer tissue sample group and a control tissue sample group, and preprocessing the DNA methylation sequencing data to obtain a methylation matrix of CpG sites multiplied by samples; performing differential methylation analysis on the cancer tissue sample group and the control tissue sample group based on the methylation matrix to obtain a differential methylation site set, and combining the differential methylation site set according to a preset adjacent combination rule to obtain a differential DNA methylation fragment set; Constructing a co-methylation network based on the methylation level correlation of the differential DNA methylation fragment set in the cancer tissue sample group, and carrying out module division on the co-methylation network to obtain at least one co-methylation module and module characteristic values of each co-methylation module; Estimating the cellular