CN-116994647-B - Method for constructing model for analyzing mutation detection result

CN116994647BCN 116994647 BCN116994647 BCN 116994647BCN-116994647-B

Abstract

The invention provides a method for constructing a model for analyzing a mutation detection result, which comprises the steps of obtaining a positive sequencing data set which is definitely a positive mutation site and a negative sequencing data set which is definitely a negative mutation site, extracting characteristics of the mutation site from the positive sequencing data set and the negative sequencing data set respectively, and constructing the model by utilizing the characteristic result obtained in the previous step, wherein the characteristics comprise at least one of an AD0 value, an AD1 value, an AF0 value, an AF1 value, a GT value, a DP value, a GQ value, an MQ value and a QUAL value. The model obtained by the method can accurately predict whether the positive variation data is false positive, can further obtain the genotype of the variation site, is beneficial to faster and more accurate positioning to possible variation, and reduces the cost and the turnaround time of orthogonal experiments.

Inventors

TANG FEI
WANG ZHONGHUA
SUN JUAN
PENG ZHIYU

Assignees

天津华大基因科技有限公司
天津华大医学检验所有限公司

Dates

Publication Date: 20260512
Application Date: 20220425

Claims (8)

1. A method for constructing a model for analyzing a mutation detection result, comprising: acquiring a positive sequencing data set which is definitely positive variation sites and a negative sequencing data set which is definitely negative variation sites; extracting features of variant sites from the positive sequencing dataset and the negative sequencing dataset, respectively; Constructing a model by utilizing the characteristic result obtained in the previous step; wherein the characteristic comprises at least one of: AD0 value, the depth of the first allele in the variant locus genotype; AD1 value, the depth of the second allele in the variant locus genotype; AF0 value, the frequency of the first allele in the variant locus genotype; AF1 value, the frequency of the second allele in the variant locus genotype; GT value is a single numerical value; DP value: sequencing depth value; GQ value, the quality value of the variant locus genotype; MQ value, quality of mutation site mapping; QUAL value, the quality value of the probability of a mutation site; the positive sequencing data set of positive mutation sites and the negative sequencing data set of negative mutation sites are obtained by the following methods: Acquiring a sequencing dataset; Comparing the sequencing data set with reference data by using GATK software to obtain a candidate positive variation data set; analyzing and processing the candidate positive variation data set to obtain a positive sequencing data set which is definitely positive variation sites and a negative sequencing data set which is definitely negative variation sites; The analysis process includes: standard clinical interpretation is carried out on the candidate positive variation data set, and a variation data set which is possibly pathogenic is obtained; Performing orthogonal test analysis on the potentially pathogenic variation data set to obtain a positive sequencing data set which is definitely a positive variation site and a negative sequencing data set which is definitely a negative variation site, wherein the positive sequencing data set comprises an SNV variation type data set and an INDEL variation type data set, and the SNV variation type data set and the INDEL variation type data set respectively comprise a homozygous genotype data set and a heterozygous genotype data set; The model is selected from a random forest classification model.
2. The method of claim 1, wherein the reference sequence is selected from the group consisting of human genome hg19.
3. The method of claim 1, wherein the random forest classification model has a threshold of 0.95±0.05.
4. A method of analyzing a mutation test result, comprising: acquiring a candidate positive variation data set; The method for constructing a model for analyzing mutation detection results according to any one of claims 1 to 3, wherein the machine learning model is used for analyzing the candidate positive mutation data set so as to predict whether the positive mutation data in the candidate positive mutation data set is a false positive and/or genotype of a mutation site.
5. The method of claim 4, wherein when the confidence of the candidate positive variation data is below a threshold of the model, performing an orthogonal test analysis on the candidate positive variation data to predict whether the positive variation data in the candidate positive variation data set is a false positive.
6. A construction apparatus for a model for analyzing a mutation detection result, comprising: An acquisition module adapted to acquire a positive sequencing dataset that is specifically a positive variation site and a negative sequencing dataset that is a negative variation site; An extraction module adapted to extract features of variant sites from the positive and negative sequencing data sets, respectively; the construction module is suitable for constructing a model by utilizing the characteristic results obtained by the extraction module; wherein the characteristic comprises at least one of: AD0 value, the depth of the first allele in the variant locus genotype; AD1 value, the depth of the second allele in the variant locus genotype; AF0 value, the frequency of the first allele in the variant locus genotype; AF1 value, the frequency of the second allele in the variant locus genotype; GT value is a single numerical value; DP value: sequencing depth value; GQ value, the quality value of the variant locus genotype; MQ value, quality of mutation site mapping; QUAL value, the quality value of the probability of a mutation site; the acquisition module comprises: a sequencing dataset acquisition module adapted to acquire a sequencing dataset; The comparison processing module is suitable for comparing the sequencing data set with reference data by using GATK software to obtain a candidate positive variation data set; The analysis processing module is suitable for analyzing and processing the candidate positive variation data set to obtain a positive sequencing data set which is definitely positive variation sites and a negative sequencing data set which is definitely negative variation sites; The analysis processing module comprises: A standard clinical interpretation module adapted to perform standard clinical interpretation of the positive variation data to obtain variation data that is likely to be pathogenic; the orthogonal test analysis sub-module is suitable for carrying out orthogonal test analysis on the mutation data which is possibly pathogenic to obtain a positive sequencing data set which is definitely positive mutation sites and a negative sequencing data set which is definitely negative mutation sites; The model is selected from a random forest classification model.
7. An executable storage medium having stored thereon computer program instructions which, when run on a processor, cause the processor to perform the method of analysing a variation detection result according to claim 4 or 5.
8. An electronic device, comprising: The executable storage medium of claim 7; The processor configured to execute the computer program to implement the method of analyzing a mutation detection result according to claim 4 or 5.

Description

Method for constructing model for analyzing mutation detection result Technical Field The present invention relates to the field of biology. Specifically, the present invention relates to a method for constructing a model for analyzing a mutation detection result. Background Clinical next generation sequencing (cNGS) is widely used to determine molecular diagnostics in patients with genetic diseases. However, known NGS procedures have random and systematic errors in sequencing, alignment, and variant calling steps. As reported variations can affect patient care and treatment, american academy of medical genetics and genomics (ACMG) and american pathologist's College (CAP) suggest orthogonal validation of reported variations to reduce the risk of false positive results. Sanger sequencing has been the major technique for molecular diagnosis of genetic diseases. However, as demonstrated by the growth of common databases such as ClinVar and OMIM, the total number of candidate variants for clinical reporting is steadily increasing, which doubles the cost and turnaround time of testing, making it increasingly impractical to fully measure. Thus, the need for orthogonal testing is becoming increasingly stringent to use a machine-learned model trained on a large number of known data to identify false positive variations in cNGS data. The existing model has the problems that a great amount of cost and turnover time are increased in orthogonal experiments such as Sanger sequencing and the like, the existing model is characterized by being mostly a Boolean mark value, information is lost in comparison with an unmodified quantitative index, false positive variation calling in the existing model training set is relatively less, confidence intervals of certain false positive capturing rates (particularly SNV) are possibly wider, the existing model is insufficient in clinical data due to cost, or is deliberately complicated and suitable for various scenes, but the confidence is insufficient, or the confidence is sufficient, but the fitting risk is larger, and the existing model is suitable for the scene shortage. Thus, current methods for predicting variant false positives remain to be studied. Disclosure of Invention The present invention aims to solve at least one of the technical problems existing in the prior art to at least some extent. For this purpose, in one aspect of the present invention, the present invention proposes a method for constructing a model for analyzing a mutation detection result. According to an embodiment of the invention, the method comprises the steps of obtaining a positive sequencing data set which is definitely positive mutation site and a negative sequencing data set which is definitely negative mutation site, extracting characteristics of the mutation site from the positive sequencing data set and the negative sequencing data set respectively, and constructing a model by utilizing the characteristic results obtained in the previous step, wherein the characteristics comprise at least one of an AD0 value, an AD1 value, an AF0 value, an AF1 value, a GT value, a single value (0, 1,2 and 3 in particular), a DP value, a sequencing depth value, a GQ value, an MQ value, a quality of mapping of the mutation site, and a QUAL value, wherein the AD0 value is the depth of a first allele in the mutation site genotype, the AF0 value is the frequency of the first allele in the mutation site genotype, the AF1 value is the frequency of the second allele in the mutation site genotype, the GT value is a single value (0, 1,2 and 3 in particular, the DP value is the sequencing depth value, the GQ value is the quality of the mutation site genotype, and the quality of the mutation site possibility. The mutation detection analysis software can generate dozens of characteristic parameters, the inventor performs comparison analysis on the characteristic parameters, screens out a group of characteristic parameters, builds a machine learning model on a data set which is definitely positive mutation sites and negative mutation sites by taking the characteristic parameters as attributes, can accurately predict whether the positive mutation data is false positive or not by using the obtained model, can further obtain the genotype of the mutation sites, is beneficial to faster and more accurate positioning to possible mutation, and reduces the cost and turnaround time of orthogonal experiments. In another aspect of the invention, a method of analyzing a mutation detection result is provided. According to an embodiment of the invention, the method comprises the steps of obtaining a candidate positive variation data set, and analyzing the candidate positive variation data set by using a machine learning model obtained by the method for constructing the model for analyzing variation detection results so as to predict whether positive variation data in the candidate positive variation data set are false positive and/or genotyp