CN-121983122-A - Group epigenetic analysis method and system based on consensus peak identification

CN121983122ACN 121983122 ACN121983122 ACN 121983122ACN-121983122-A

Abstract

The invention provides a population epigenetic analysis method and system based on consensus peak identification, which belong to the technical field of bioinformatics, and the method comprises the steps of S1, identifying the consensus peak from a multi-sample peak file based on a position deviation threshold value and a minimum sample proportion, S2, calculating the coverage depth of the consensus peak based on fragments/reading segments to obtain an original peak intensity matrix, S3, eliminating GC preference through GC content partition local correction, S4, carrying out cross-sample normalization by adopting DESeq2, TMM and other methods, and S5, automatically removing implicit batch effects by utilizing latent variable analysis. The system correspondingly comprises five functional modules. The invention solves the problems of inconsistent peak coordinates, incomparable signals and large technical deviation interference in the multiple sample epigenetic data through an integrated flow, realizes the full-automatic generation from the original data to the peak intensity matrix with high quality and strong comparability, and remarkably improves the analysis efficiency and accuracy of large-scale researches of the group ChIP-seq, the ATAC-seq and the like.

Inventors

LI XINGWANG
GUO MINGYUE
Yan Jiapei
XUE ZHIFEI
GUO LIANGLIANG
ZENG HAOWEN

Assignees

华中农业大学

Dates

Publication Date: 20260505
Application Date: 20251223

Claims (10)

1. A population epigenetic analysis method based on consensus peak identification, comprising: S1, reading enrichment peak area files of all samples, scanning peak areas after sorting according to chromosomes, judging and merging peaks belonging to the same enrichment area according to a preset position deviation threshold value, screening out areas which appear stably in a group based on a preset minimum sample proportion, and generating a consensus peak set and a sample list in which all consensus peaks appear; S2, for each region in the consensus peak set, reading a sequence comparison file of a corresponding sample, calculating the coverage depth of the region based on a fragment or a reading segment according to the sequencing library type, and calculating the intensity index of each sample in each consensus peak based on the coverage depth to form an original peak intensity matrix; S3, extracting genome sequences of each consensus peak region and calculating GC content of the genome sequences, dividing the GC content into a plurality of intervals, generating correction factors based on the relation between local statistics and global statistics of peak intensities in each GC content interval, and correcting the original peak intensity matrix to obtain a GC corrected matrix; S4, performing scale correction on the GC corrected matrix by adopting a selected normalization method to obtain a normalized peak intensity matrix; S5, carrying out data transformation on the normalized peak intensity matrix, estimating the number of potential variables of the implicit batch effect, removing the influence represented by the potential variables through a regression model, carrying out inverse transformation and post-processing, and outputting the finally corrected peak intensity matrix.
2. The method according to claim 1, wherein in the step S1, the positional shift threshold is set to a value ranging from 50 to 150 base pairs, and the minimum sample ratio is set to a value ranging from 0.1 to 0.3.
3. The method of claim 1, wherein in step S2, the intensity index comprises at least one of maximum coverage, average coverage, coverage median, peak area, segment coverage area calculated based on paired read template length for double ended sequencing data, and coverage area calculated based on read length for single ended sequencing data.
4. The method according to claim 1, wherein in the step S3, the GC content is uniformly divided into 50to 200 intervals, and the correction factor is determined according to the ratio of the local average value of the peak intensities in each GC interval to the global average value of all the peak intensities.
5. The method according to claim 1, wherein in the step S4, the normalization method includes at least one of a DESeq2 size factor method, a TMM method, a CPM method, a TPM method, and an upper quartile method.
6. The method according to claim 1, wherein the step S5 comprises performing asinh transformation on the data matrix, automatically estimating the number of latent variables by utilizing latent variable analysis, constructing a linear model comprising a known biological variable design matrix and the latent variables, performing regression correction, and performing sinh inverse transformation and minimum translation on the residual.
7. The method according to claim 1, wherein in the step S1, for each finally determined consensus peak, the start-stop coordinates thereof are the average of the start-stop coordinates of all the combined sample peak areas.
8. The method of claim 1, further comprising the step of visualizing a quality assessment of the final corrected peak intensity matrix, including generating at least one of a peak intensity profile, an inter-sample correlation heat map, a principal component analysis map, a MA map.
9. A population epigenetic analysis system for carrying out the method of any one of claims 1 to 8, comprising: the peak intensity calculation module is used for executing the step S2 to generate an original peak intensity matrix; the GC correction module is used for executing the step S3 and generating a GC corrected matrix; The normalization module is used for executing the step S4, providing a plurality of normalization methods and outputting a normalization matrix; And the batch effect correction module is used for executing the step S5, removing the implicit batch effect and outputting a final matrix.
10. The system of claim 9, further comprising a visualization module for performing said step S6 to generate a data quality assessment chart.

Description

Group epigenetic analysis method and system based on consensus peak identification Technical Field The invention relates to the technical field of bioinformatics, in particular to a population epigenetic analysis method and system based on consensus peak identification. Background With the development of high-throughput sequencing technology, the research of group epigenetic genetics based on technologies such as ChIP-seq, ATAC-seq and the like is increasing, and aims to analyze the variation rule of chromatin states in the group and the association of the variation rule with the characteristics. The core of such studies is to construct a "Peak" (Peak) that is comparable across samples and accurate in signalA sample "matrix". However, prior art flows face a number of challenges in processing multi-sample data. Firstly, because of factors such as experimental noise and comparison difference, the start-stop coordinates of the independently identified enrichment peaks (Peak) of different samples often deviate on the genome, so that the Peak correspondence relationship across samples is difficult to directly establish. Simple coordinate merging strategies fail to accurately identify the true common peak (ConsensusPeak). Second, quantification of peak intensities (e.g., coverage) is systematically disturbed by a variety of technical factors such as sequencing depth, library type (single ended SE or double ended PE), GC content preference (GC-bias), and implicit batch effect (BatchEffect). Uncorrected peak intensity matrices fail to reflect true biological differences, severely impacting the accuracy of subsequent differential analysis, clustering, and epigenetic Quantitative Trait Locus (QTL) mapping. Existing tools focus on a single link of the data processing flow (such as peak detection, normalization or batch correction), and lack an integrated, automated solution from shared peak definition, accurate signal quantization, systematic bias correction to batch effect removal. Researchers often need to manually connect a plurality of independent tools in series, the flow is tedious and error-prone, and consistency of parameters and logics among the steps is difficult to ensure, and the requirements of large-scale group research on data quality and analysis efficiency cannot be met. Therefore, there is a need to develop an integrated analysis method and system that can systematically solve the above problems. Disclosure of Invention The invention aims to solve at least one technical problem in the prior art, and provides a population epigenetic analysis method and system based on consensus peak identification. In a first aspect, an embodiment of the present invention provides a population epigenetic analysis method based on consensus peak identification, including: S1, reading enrichment peak area files of all samples, scanning peak areas after sorting according to chromosomes, judging and merging peaks belonging to the same enrichment area according to a preset position deviation threshold value, screening out areas which appear stably in a group based on a preset minimum sample proportion, and generating a consensus peak set and a sample list in which all consensus peaks appear; S2, for each region in the consensus peak set, reading a sequence comparison file of a corresponding sample, calculating the coverage depth of the region based on a fragment or a reading segment according to the sequencing library type, and calculating the intensity index of each sample in each consensus peak based on the coverage depth to form an original peak intensity matrix; S3, extracting genome sequences of each consensus peak region and calculating GC content of the genome sequences, dividing the GC content into a plurality of intervals, generating correction factors based on the relation between local statistics and global statistics of peak intensities in each GC content interval, and correcting the original peak intensity matrix to obtain a GC corrected matrix; S4, performing scale correction on the GC corrected matrix by adopting a selected normalization method to obtain a normalized peak intensity matrix; S5, carrying out data transformation on the normalized peak intensity matrix, estimating the number of potential variables of the implicit batch effect, removing the influence represented by the potential variables through a regression model, carrying out inverse transformation and post-processing, and outputting the finally corrected peak intensity matrix. Further, in the step S1, the value range of the position offset threshold is 50 to 150 base pairs, and the value range of the minimum sample ratio is 0.1 to 0.3. Further, in the step S2, the intensity index includes at least one of a maximum coverage, an average coverage, a coverage median, and a peak area, a segment coverage area is calculated based on a pair-wise read template length for double-ended sequencing data, and a coverage area is calculated based on a read le