CN-121983119-A - Gene sequencing data management method for whole genome methylation sequencing
Abstract
The invention discloses a gene sequencing data management method for whole genome methylation sequencing, which relates to the technical field of gene sequencing data management, and comprises the steps of obtaining original gene sequencing data and preprocessing to generate a regional methylation matrix; the method comprises the steps of calculating technical difference metrics and phenotype difference metrics of each region unit based on a region methylation matrix and sample metadata, identifying a technical methylation bias domain, classifying the technical methylation bias domain into a steady state methylation bias domain and an event-driven technical methylation bias domain according to occurrence frequencies of the technical methylation bias domain in a plurality of batches, constructing and classifying the technical methylation bias overstock domain based on a linear adjacent relation and a co-bias relation, carrying out region shielding on the steady state technical methylation bias overstock domain, carrying out normalization correction on the event-driven technical methylation bias overstock domain, generating a corrected region methylation matrix, calculating a sample quality index based on the corrected matrix, and carrying out quality classification.
Inventors
- FENG JINYAO
- CHEN MENGYA
- WANG MINGXU
- Qi Xianjia
Assignees
- 上海旭燃生物科技有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20251216
Claims (9)
- 1. A method of managing gene sequencing data for whole genome methylation sequencing, comprising the steps of: Acquiring raw gene sequencing data of whole genome methylation sequencing, and preprocessing to generate a regional methylation matrix, wherein the regional methylation matrix represents regional methylation levels of a plurality of samples on a plurality of regional units; Calculating a technical difference metric and a phenotypic difference metric of each region unit based on the region methylation matrix and the sample metadata, identifying a technical methylation bias domain according to a preset technical bias determination threshold and a phenotypic difference upper limit, and classifying the technical methylation bias domain into a steady-state technical methylation bias domain and an event-driven technical methylation bias domain based on occurrence frequencies of the technical methylation bias domain in a plurality of batches; Constructing a technical methylation bias hyper-domain based on a linear adjacency and a co-bias relationship of the technical methylation bias domain on a genome, and classifying the technical methylation bias hyper-domain into a steady-state technical methylation bias hyper-domain and an event-driven technical methylation bias hyper-domain; For the methylation bias domain of the event-driven technology, dividing a sample into an event sample set and a control sample set according to sample metadata, calculating offset and carrying out normalization correction on the regional methylation level in the event sample set to generate a corrected regional methylation matrix; Based on the corrected regional methylation matrix, calculating a sample quality index of each sample, and classifying the samples according to quality thresholds.
- 2. The method for managing gene sequencing data for whole genome methylation sequencing according to claim 1, wherein the original gene sequencing data for whole genome methylation sequencing is obtained and preprocessed, and the method comprises the steps of receiving a sequence read file divided by samples from a sequencing platform, wherein the sequence read file is a read file in FASTQ format; allocating unique sample identification for each sample based on a pre-configured sample number in a sequencing task, establishing a one-to-one correspondence between the sample identification and a sequence read file to form an original gene sequencing data set organized by the sample, automatically acquiring sample basic information and production information associated with each sample from a sequencing platform log and writing the sample basic information into a sample metadata table to enable the sample metadata table to establish an index relationship with the original gene sequencing data set, reading an alkali matrix value for each sequence read, calculating an average base quality score, discarding sequence reads with the average alkali matrix value smaller than a preset minimum average quality score, cutting continuous low-quality bases at two ends of the reserved reads, discarding the sequence reads when the length of the cut sequence reads is smaller than a preset length threshold, performing joint sequence matching and cutting on the reserved reads based on a pre-configured sequencing joint sequence set, discarding the cut sequence reads with the length smaller than the length threshold to obtain sequencing data after quality control, comparing the sequencing data with a reference genome for constructing an index for the sequencing object species, generating a sequence read with the same sequence, recording the sequence with the same coordinate of a comparison result, stopping the sequence read, and stopping the sequence read in a coordinate of the sequence read file, it is marked as a repeated read and rejected in subsequent analysis.
- 3. A gene sequencing data management method for whole genome methylation sequencing according to claim 2, wherein the generation of the regional methylation matrix comprises predefining a regional unit set on a reference genome, wherein each regional unit corresponds to a continuous genome interval and has a unique regional identifier, dividing the whole genome into non-overlapping fixed length windows as candidate regional units from a chromosome starting position by taking a preset window length L base pairs as a unit, reading functional annotation information containing CpG islands, cpG lands and promoter regions, marking the candidate regional units overlapped with the functional annotation regions as functional regional units, marking the rest markers as background regional units, traversing each base compared to a reference genome cytosine position based on the comparison result data set, recording as a methylation count when the read base is C, recording as a non-methylation count when the read base is T, accumulating site-level methylation counts and non-methylation counts onto the regional units to obtain regional methylation and regional coverage of each sample on each regional unit, marking the methylation and the methylation count as a non-methylation count of each sample on the regional unit, setting the methylation count and the methylation count as a non-methylation count of the sample on the sample region, setting the ratio as a threshold value when the methylation count and the methylation count is less than the methylation count of the sample is less than the threshold value or the methylation count is less than the threshold value when the methylation count is less than the threshold value or the methylation count is equal to the threshold value or the effective than the threshold value, the regional methylation matrix is constructed with the regional methylation level as the matrix element.
- 4. The gene sequencing data management method for whole genome methylation sequencing according to claim 1, wherein the method comprises the steps of calculating a technical difference metric and a phenotype difference metric of each region unit based on a region methylation matrix and sample metadata, recording at least a sample number, a sample type, a tissue source, a batch number, a sequencing platform model number, a reagent batch number, a database creation scheme, an on-machine time and operator information of each sample in a sample metadata table, dividing the samples according to sample grouping attributes recorded in the sample metadata table to obtain a sample grouping set G, wherein the sample grouping attributes comprise technical attributes and biological attributes, selecting the samples obtained based on the batch number, the sequencing platform model number or the reagent batch number from the sample grouping set G to form a technical evaluation grouping set, selecting the sample grouping form an evaluation grouping set based on the sample type or the tissue source from the sample grouping set G, calculating average regional methylation levels of each technical grouping in the technical evaluation grouping set for each region unit, calculating average regional methylation levels of each phenotype grouping in the phenotype evaluation grouping set, and constructing a technical difference metric function based on average regional methylation differences between different phenotype grouping sets.
- 5. The method of claim 4, wherein identifying technical methylation bias fields comprises configuring a technical bias determination threshold and an upper phenotype difference limit for each of the region units, determining that a region unit has a significant technical methylation bias in the current sample set and labeling it as a technical methylation bias field when the technical bias determination threshold is greater than the technical bias determination threshold and the phenotype difference metric is less than or equal to the upper phenotype difference limit, creating a region bias labeling table, and writing the region identification of the region unit labeled as the technical methylation bias field, the corresponding bias score, and the primary sample grouping attribute that causes the bias into the region bias labeling table.
- 6. The method according to claim 1, wherein the classifying the technical methylation bias domain into the steady state technical methylation bias domain and the event driven technical methylation bias domain based on the occurrence frequency of the technical methylation bias domain in the plurality of batches comprises repeatedly performing the step of identifying the technical methylation bias domain during the processing of the plurality of sequencing batches or the plurality of items and accumulating the technical methylation bias domain labeling results of each region unit in the different batches, counting the number of batches labeled as the technical methylation bias domain in all the processed batches and the number of batches analyzed by the total reference, and calculating the bias occurrence frequency F (r), wherein r is the identification of the region unit RU, the steady state frequency threshold Lower limit of event-driven frequency Wherein 0< When the bias occurrence frequency F (r) is greater than or equal to the steady-state frequency threshold The region unit r is marked as the steady state technology methylation bias domain when the bias occurrence frequency F (r) is between the event-driven frequency lower limit And steady state frequency threshold The region unit r is marked as an event driven technology methylation bias domain when the bias occurrence frequency F (r) is less than the event driven frequency lower limit In this case, the region unit r is recorded as an unfractionated technology bias domain.
- 7. The gene sequencing data management method for whole genome methylation sequencing of claim 1, wherein constructing a technical methylation bias super-domain based on a linear adjacency and a co-bias relationship of technical methylation bias domains on a genome comprises taking all regional units marked as technical methylation bias domains as nodes, judging whether any two regional units are located on the same chromosome and the physical distance is smaller than a preset adjacency threshold according to the chromosome positions and interval boundaries of the regional units, if so, considering that the linear adjacency exists, otherwise, not exists, constructing bias intensity vectors based on the regional methylation levels of all the technical methylation bias domains on all samples, calculating the Pierce correlation coefficient between the bias intensity vectors of any two technical methylation bias domains, and when the correlation coefficient is larger than the preset correlation threshold, considering that the two technical methylation bias domains have the co-relationship, and carrying out clustering on all the technical methylation bias domains as the technical methylation bias domains with the same adjacent relation and the same technological bias domains, and carrying out multi-domain clustering on the methylation bias domains, and carrying out the clustering of the technical methylation bias domains as the technology comprises the unique combination.
- 8. The method of claim 1, wherein classifying the technical methylation bias hyper-domain into a steady-state technical methylation bias hyper-domain and an event driven technical methylation bias hyper-domain and masking or normalizing the regional methylation level comprises counting a number of steady-state technical methylation bias hyper-domain and event driven technical methylation bias hyper-domain for each technical methylation bias hyper-domain, and marking the technical methylation bias hyper-domain as an event driven technical methylation bias hyper-domain when the number of steady-state technical methylation bias hyper-domain is greater than or equal to a preset steady-state proportional threshold; when a certain area unit belongs to the methylation bias oversomain of the event driven technology, dividing the sample into an event sample set and a control sample set according to the batch number, the sequencing platform model or the reagent batch number associated with the methylation bias oversomain of the event driven technology, calculating the control average methylation level and the event average methylation level of the area unit in the control sample set and the event sample set respectively, taking the difference value as an offset, carrying out normalization correction on the area methylation level of the sample belonging to the event sample set on the area unit, and after the normalization correction on the shielding and the normalization correction on all the samples and all the area units, and constructing a modified regional methylation matrix by taking the modified regional methylation level as a matrix element, and recording mark information whether each regional unit is shielded or not and whether the event-driven normalization modification is carried out or not in the matrix.
- 9. The method for gene sequencing data management for whole genome methylation sequencing of claim 1, wherein calculating a sample quality index for each sample based on the corrected regional methylation matrix and classifying the samples according to the quality threshold comprises counting, for each sample, the number of regional units of each sample that are labeled as a technical methylation bias domain, a steady-state technical methylation bias domain, and an event-driven technical methylation bias domain before correction and their duty cycle in the whole regional units, and forming a sample quality index in combination with quality control indexes such as a comparison rate, a unique comparison rate, and a repeated read ratio; counting the number of effective area units and the number of total area units of each sample in the corrected area methylation matrix, wherein the effective area units are area units which are not marked as shielding on the sample and have methylation values which are not marked as missing, and representing the effective coverage condition of the sample by the ratio of the number of the effective area units to the number of the total area units; The method comprises the steps of taking the number of effective area units of a sample as a molecule, taking the product of the total area unit number and 1 plus the residual bias average deviation of the sample as a denominator, obtaining a sample quality index through division operation, marking the sample as a suspicious sample or a sample to be retested when the sample quality index is smaller than the quality threshold value, marking the sample as a quality qualified sample when the sample quality index is not smaller than a preset quality threshold value, and writing the sample quality index and quality grade corresponding to the sample into a sample quality marking table.
Description
Gene sequencing data management method for whole genome methylation sequencing Technical Field The invention relates to the technical field of gene sequencing data management, in particular to a gene sequencing data management method for whole genome methylation sequencing. Background Whole genome methylation sequencing technology is increasingly used in epigenetic studies, but has significant technical bias problems in the data analysis process. The bias mainly comes from non-biological factors such as sequencing platform difference, experimental batch effect, batch number change of a library building reagent and the like, and can systematically influence accurate measurement of cytosine methylation level, the prior art mainly adopts an experimental flow standardization or simple statistical correction method to control technical variation, but the methods have obvious limitations that on one hand, experimental standardization cannot completely eliminate systematic bias, especially when the bias presents different modes such as stable existence or sporadic occurrence in a specific genome region, and on the other hand, the traditional statistical correction method is difficult to effectively identify and distinguish complex bias modes with regional specificity and batch dependence based on global assumption. In addition, the prior art lacks systematic recognition, classification and dynamic management mechanisms for technical bias, and cannot establish a closed-loop management system from bias recognition to quality evaluation to strategy optimization, so that in practical application, the technical bias can be misjudged as a real biological signal, or valuable biological signals are lost due to excessive correction, and the accuracy and reliability of downstream analysis are seriously affected. The present invention proposes a solution to the above-mentioned problems. Disclosure of Invention In order to overcome the above-mentioned drawbacks of the prior art, embodiments of the present invention provide a method for managing gene sequencing data for whole genome methylation sequencing, so as to solve the above-mentioned problems set forth in the background art. In order to achieve the above purpose, the present invention provides the following technical solutions: a method of gene sequencing data management for whole genome methylation sequencing, comprising the steps of: Acquiring raw gene sequencing data of whole genome methylation sequencing, and preprocessing to generate a regional methylation matrix, wherein the regional methylation matrix represents regional methylation levels of a plurality of samples on a plurality of regional units; Calculating a technical difference metric and a phenotypic difference metric of each region unit based on the region methylation matrix and the sample metadata, identifying a technical methylation bias domain according to a preset technical bias determination threshold and a phenotypic difference upper limit, and classifying the technical methylation bias domain into a steady-state technical methylation bias domain and an event-driven technical methylation bias domain based on occurrence frequencies of the technical methylation bias domain in a plurality of batches; Constructing a technical methylation bias hyper-domain based on a linear adjacency and a co-bias relationship of the technical methylation bias domain on a genome, and classifying the technical methylation bias hyper-domain into a steady-state technical methylation bias hyper-domain and an event-driven technical methylation bias hyper-domain; For the methylation bias domain of the event-driven technology, dividing a sample into an event sample set and a control sample set according to sample metadata, calculating offset and carrying out normalization correction on the regional methylation level in the event sample set to generate a corrected regional methylation matrix; Based on the corrected regional methylation matrix, calculating a sample quality index of each sample, and classifying the samples according to quality thresholds. In a preferred embodiment, raw gene sequencing data of whole genome methylation sequencing is obtained and preprocessed, the method comprises the steps of receiving sequence reading files divided by samples from a sequencing platform, wherein the sequence reading files are read files in a FASTQ format, distributing unique sample identifiers for each sample based on sample numbers pre-configured in a sequencing task, establishing a one-to-one correspondence between the sample identifiers and the sequence reading files to form a raw gene sequencing data set of the whole genome, automatically obtaining sample basic information and production information related to each sample from a sequencing platform log, writing the sample basic information and production information into a sample metadata table, enabling the sample metadata table to establish an index relation with the raw gene s