CN-121393548-B - Mutation back noise filtering calculation method based on second-generation sequencing data

CN121393548BCN 121393548 BCN121393548 BCN 121393548BCN-121393548-B

Abstract

The invention discloses a mutation back noise filtering calculation method based on second-generation sequencing data, relates to the technical field of high-throughput sequencing data analysis, and aims to solve the problems that the prior art depends on a base line, is poor in suitability and is high in false positive. The algorithm firstly preprocesses sequencing data to obtain a pileup file, classifies sites according to sequence context, calculates error rate, constructs a layered back noise model to test mutation, combines sample types to qualify real mutation and back noise, and simultaneously evaluates sample quality. The method does not need to construct a base line in advance, adapts to hybrid capture and amplicon sequencing, can improve mutation detection specificity and consistency of repeated samples, ensures reliable results, and is suitable for scenes such as tumor gene detection.

Inventors

HAN TIANCHENG
DU BO
HUANG YU
LIU DAN
FAN RUI
TAN YUNTAO
FAN XINPING
CHEN JINXIANG
LUO LEI
CHEN WEIZHI

Assignees

臻和（北京）科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251023

Claims (9)

1. The mutation back noise filtering calculation method based on the second-generation sequencing data is characterized by comprising the following steps of: s1, preprocessing sequencing data of a sequencing sample, and obtaining pileup files of the sample at sites of each capturing interval or amplicon interval; S2, classifying potential noise sites according to the base type of the current site, the sequence context information of the mutation position, the depth of the sites and the sequencing direction, and calculating the global back noise error rate of different base errors of each type of site and the error rate of each type of site at a specific depth, wherein the interval of the specific depth is (d-w/2, d+w/2), and w is the preset depth interval width; s3, constructing a first layer back noise model for each error type of each base, estimating a first layer error rate, and carrying out first layer inspection on the sample mutation based on the first layer back noise model; S4, counting the number of potential noise sites passing through the first layer inspection in each error type, if the number of potential noise sites is more than or equal to a preset threshold v, constructing a second layer back noise model for the error type, and carrying out the second layer inspection on the matched mutation based on the second layer back noise model; and S5, according to the detection result, the single base mutation and the multi-base mutation are qualitatively determined, the real mutation and the background noise are distinguished, and the quality of the sample is evaluated.
2. The mutation back noise filtering calculation method based on the second-generation sequencing data, which is characterized in that in S1, the specific process of the pretreatment is as follows: S1.1, comparing a base sequence obtained by sequencing to a human genome reference sequence to obtain an original bam file; S1.2, if the sequencing sample is capture sequencing, de-duplicating the original bam file, and if the sequencing sample is amplicon sequencing, not de-duplicating; s1.3, carrying out local weight comparison on the bam file to obtain a processed bam file; s1.4, counting the number of positive strand reads and the number of negative strand reads of the base quality > BQ and the sequencing quality > MQ at each base site within the capturing interval by using the processed bam file, and outputting the positive strand support number and the negative strand support number of various base types of the base quality > BQ and the sequencing quality > MQ at the base site as pileup files.
3. The mutation back noise filtering calculation method based on the second-generation sequencing data, which is characterized in that S2 specifically comprises the following steps: s2.1, base classification: The base type of the current site comprises 4 types, which are A, T, C, G respectively; The sequence context information comprises upstream n-position bases, downstream m-position bases, up to n-position bases, each position has A, T, C, G four possibilities, 4 types n , up to m-position bases, each position has A, T, C, G four possibilities, 4 types m ; the base types of the current site, the upstream of the current site and the downstream of the current site are arranged and combined to obtain a total of 4 n+m+1 base types, wherein n and m are preset positive integers; s2.2, screening rules of potential noise sites are that positive chain data and negative chain data are combined, mutation frequency and support number are calculated, if the mutation frequency is greater than or equal to f and the support number is greater than or equal to S, f is a preset mutation frequency threshold value, S is a preset mutation support number threshold value, the potential real mutation sites are judged and eliminated, and the rest sites are potential noise sites; S2.3, calculating global back noise error rates of different error types for each base in the 4 n+m+1 base types, wherein the global back noise error rates are the ratio of the total error support number to the total depth; S2.4, calculating the error rate of different error types at a specific depth d for each base in the 4 n+m+1 base types, wherein the error rate at the specific depth is the ratio of the total error support number to the total depth in a specific depth interval, the interval of the specific depth is (d-w/2, d+w/2), and w is the preset depth interval width.
4. The mutation back noise filtering calculation method based on the second-generation sequencing data, wherein S3 specifically comprises selecting one from a depth-independent constant model, a depth-dependent power function model, a depth-dependent exponent model or a threshold model as a first layer back noise model, estimating parameters of each model formula through a least square method or a maximum likelihood method, and selecting an optimal formula and parameters from the parameters; calculating the first layer error rate for all mutation of the sample by the selected optimal formula and parameters And performing a first layer inspection; The first layer test adopts Binomial test, the original assumption is that mutation support numbers obey Binomial distribution (D, e), p values of a positive strand and a negative strand are calculated, wherein e is a first layer error rate estimated by a first layer back noise model, and D is the site depth of a corresponding sequencing strand.
5. The method for mutation back noise filtering calculation based on second-generation sequencing data of claim 4, wherein in S3, the first layer back noise model is one of a depth independent constant model, a depth dependent power function model, a depth dependent exponential model or a threshold model, and wherein: the depth independent constant model is: ; the global back noise error rate obtained by calculation in the step S2.3; the depth dependent power function model is: in which the first layer error rate increases with decreasing depth, or The error rate of the first layer is reduced along with the reduction of the depth, and a and b are related parameters; The depth-dependent exponential model is: in which the first layer error rate increases with decreasing depth, or A and b are related parameters, g is the global back noise error rate calculated in S2.3, and D is the median depth of the sample; The form of the threshold model is related to the site depth such that when the site depth is ≡ maxD, ; The global background noise error rate obtained by calculation in the step S2.3 is obtained by adopting the power function model related to the depth or the exponent model related to the depth when the depth of the locus is less than maxD, wherein the power function model is as follows: At this time, the first layer error rate increases with decreasing depth, or The first layer error rate at this time decreases with decreasing depth, and the exponential model takes the form of: At this time, the first layer error rate increases with decreasing depth, or At this time, the first layer error rate decreases with decreasing depth, b is a related parameter, and g and maxD are constant terms.
6. The mutation back noise filtering calculation method based on the second-generation sequencing data, which is characterized in that in S4, a second layer back noise model is constructed based on potential noise sites passing through the first layer test, the model form is consistent with that of the first layer back noise model, the parameters are estimated by a maximum likelihood method, and the parameters are iteratively optimized until the number of the potential noise sites passing through the test is lower than a threshold v, and the second layer test adopts a binomial test to calculate p values of a positive chain and a negative chain.
7. The mutation back noise filtering calculation method based on the second-generation sequencing data, which is characterized in that in S5, the single base mutation qualitative rule is as follows: For amplicon sequencing samples, the mutation is considered to be a true mutation when it meets the following conditions, whereas it is considered to be a background noise: if the error type of the mutation does not construct a second layer model, the positive strand p value is required to be less than the threshold value or the negative strand p value is required to be less than the threshold value; If the mutation error type builds a second layer model, the positive chain p value is less than the threshold value and the positive chain second layer p value is less than the threshold value, or the negative chain p value is less than the threshold value and the negative chain second layer p value is less than the threshold value; for hybrid capture sequencing samples, the mutation is considered to be a true mutation when the mutation meets the following conditions, whereas it is considered to be a backnoise: If the error type of the mutation does not construct a second layer model, the positive strand p value is required to be less than the threshold value and the negative strand p value is required to be less than the threshold value; If the error type of the mutation builds the second layer model, it is required that the positive strand p-value < threshold and the positive strand second layer p-value < threshold, and the negative strand p-value < threshold and the negative strand second layer p-value < threshold.
8. The mutation back noise filtering calculation method based on the second-generation sequencing data, which is characterized in that in S5, the multi-base mutation qualitative rule is that the multi-base mutation is disassembled into a plurality of single-base mutations, if all the single-base mutations are judged to be back noise, the multi-base mutation is back noise, otherwise, the multi-base mutation is true mutation.
9. The mutation back noise filtering calculation method based on the second-generation sequencing data, as set forth in claim 8, wherein S5 further comprises a sample quality evaluation step, specifically comprising: calculating the overall back noise error rate and the detection limit of each error type; overall back noise error rate = max (average first layer error rate, second layer error rate); detection limit=max (first layer minimum supported reads number, second layer minimum supported reads number)/sample bit depth; and merging and evaluating the complementary error types, and judging that the sample quality is abnormal if the detection limit exceeds a preset threshold value.

Description

Mutation back noise filtering calculation method based on second-generation sequencing data Technical Field The invention relates to the technical field of high-throughput sequencing data analysis, in particular to a mutation back noise filtering calculation method based on second-generation sequencing data. Background The existing sequencing data back noise filtering technology mainly comprises two directions, namely, one direction is that the analysis is carried out on reads with the same molecular tag through experimental flow optimization, for example, molecular tags are used, and part of noise is removed when the reads are de-duplicated. However, molecular tags have extremely high requirements on sequencing depth, and the experimental and sequencing costs are high. Another is to estimate the noise level during analysis after deduplication and filter single base mutations using a model. Fitting to the background noise pattern is a difficulty for model filtering methods. Since different sequencers have different background noise patterns, sample quality and experimental flow will also have an impact on the background noise patterns. In the model filtering method, a part of the model filtering method adopts a mode of constructing a base line, the back noise error rate is estimated in advance in a batch of base line samples, and single base mutation of a sequencing sample is filtered by using the estimated error rate. The method has certain requirements on the quantity of the base line samples, and has higher cost. Every time the sequencer or the experimental procedure is replaced, the baseline sample needs to be collected again and a baseline is constructed, otherwise, the background noise error rate of the baseline can not accurately reflect the actual condition of the sequencing sample, and thus the filtration of mutation is too loose or too severe, and the accuracy of the detection result is affected. Even within the same sample, there may be differences in the background noise error rates for different sequence contexts, different depths. Other methods do not rely on baseline, build models directly inside the sample, learn noise patterns under different sequence contexts, and filter the locations according to the learned noise patterns. With this approach, the currently published methods of noise estimation and modeling do not handle these differences very comprehensively, resulting in poor performance of the filtering algorithm at some sites. Furthermore, due to the nature of amplicon sequencing, there is a difference in amplification efficiency from different PCR templates, resulting in some reads containing false signals being more easily amplified than others. This phenomenon makes the background noise of amplicon sequencing far more different at different sites than capture sequencing, resulting in some model filtering methods that cannot be used in amplicon sequencing. Disclosure of Invention The invention aims to overcome the defects of the prior art, and provides a single base mutation (SNV) back noise filtering model algorithm which does not need to construct a base line in advance, has no limit on the sample type and can be universally used for hybridization capture sequencing or amplicon sequencing data, and the specificity of an SNV report result and the consistency among repeated samples are improved on the premise of maintaining a certain sensitivity. At the same time, the algorithm can be extended to be applied to multiple base substitution (Complex) type mutations. In order to achieve the above purpose, the invention provides a mutation back noise filtering calculation method based on second generation sequencing data, which comprises the following steps: s1, preprocessing sequencing data of a sequencing sample, and obtaining pileup files of the sample at sites of each capturing interval or amplicon interval; s2, classifying the position points according to the base type of the current position point, the sequence context information of the mutation position, the depth of the position point and the sequencing direction, and calculating the global back noise error rate of different base errors of each type of position point and the error rate of each type of position point at a specific depth; s3, constructing a first layer back noise model for each error type of each base, estimating a first layer error rate, and carrying out first layer inspection on the sample mutation based on the first layer back noise model; S4, counting the number of potential noise sites passing through the first layer inspection in each error type, if the number of potential noise sites is more than or equal to a preset threshold v, constructing a second layer back noise model for the error type, and carrying out the second layer inspection on the matched mutation based on the second layer back noise model; and S5, according to the detection result, the single base mutation and the multi-base mutation are quali