EP-4738368-A1 - APPARATUS FOR IDENTIFYING TRUE MUTATION OR SEQUENCING NOISE IN CTDNA SEQUENCING DATA, COMPUTER-READABLE STORAGE MEDIUM, AND USE
Abstract
The disclosure discloses an apparatus, a computer-readable storage medium, and applications for identifying true mutations or sequencing noise in ctDNA sequencing data. This disclosure constructs noise levels at loci using a dataset from healthy individuals and determines true mutation signals by statistically testing the differences between the estimated noise levels and the observed mutation frequencies in actual tumor samples. Compared to the prior art, which employs a unified positive threshold for all loci, the present disclosure utilizes a positive threshold specific to each locus, making it more sensitive in distinguishing low-frequency noise or true mutation signals. Moreover, it does not require data from matched normal control samples of tumor patients, thereby reducing experimental and sequencing costs. It can be applied in the preparation of products for screening or auxiliary screening of cancer patients, as well as products for personalized cancer therapy.
Inventors
- CHE, Dongxue
- YAN, CHENG
- YANG, YUFEI
- YANG, Quanyu
- DU, Changshi
Assignees
- Genetron Health (Beijing) Co, Ltd.
Dates
- Publication Date
- 20260506
- Application Date
- 20240726
Claims (16)
- An apparatus for detecting the noise level in ctDNA sequencing data, wherein the apparatus comprises the following modules: A1) a ctDNA sequencing data processing module: configured to obtain ctDNA sequencing data of healthy subjects, perform quality control on the data, and align the data with the human reference genome to generate alignment files.; A2) a mutation frequency detection module: configured to obtain the sequencing depth at each locus, the number of reads supporting single-nucleotide variant at each locus, and the number of reads supporting insertion or deletion at each locus in the ctDNA sequencing data based on the alignment files. For each locus, the module calculates: (i) the single-nucleotide variant frequency based on the sequencing depth and the number of reads supporting single-nucleotide variant, and (ii) the insertion or deletion frequency based on the sequencing depth and the number of reads supporting insertion or deletion; A3) a single-nucleotide variant noise level model construction module: configured to obtain the single-nucleotide variant noise level at each locus based on the grouping information of single-nucleotide variant feature at each locus and the single-nucleotide variant frequency at each locus; wherein the grouping information of single-nucleotide variant feature at each locus includes the trinucleotide sequence context and the alignment feature at each locus; A4) an insertion or deletion noise level model construction module: configured to obtain the insertion or deletion noise level at each locus based on the grouping information of insertion or deletion feature at each locus and the insertion or deletion frequency at each locus; wherein the grouping information of insertion or deletion at each locus includes the repetitive sequence element information and the segmental duplication information at each locus; A5) a noise level output module: configured to output the noise levels of the ctDNA sequencing data, which include the SNV noise level at each locus and/or the InDel noise level at each locus.
- The apparatus according to claim 1, wherein the single-nucleotide variant noise level model construction module in a3) comprises the following modules: A3-1) a single-nucleotide variant characteristic integration and grouping module: configured to group loci that share the same trinucleotide sequence context and the same alignment feature together, thereby forming groups with identical SNV feature; A3-2) a single-nucleotide variant prior background noise level acquisition module: configured to calculate the average single-nucleotide variant frequency of each group with identical SNV feature using formula 1, based on the mutation frequencies of all loci within the group with identical SNV feature and the group with identical SNV feature described in A3-1): λ tm = ∑ μ ptm n in formula 1, λ tm represents the average single-nucleotide variant frequency of all loci within the group with identical SNV feature, µ ptm represents the single-nucleotide variant frequency at each locus within the group with identical SNV feature described in A3-1), and n represents the number of loci within the group with identical SNV feature described in A3-1); A3-3) a single-nucleotide variant noise level acquisition module: configured to obtain the single-nucleotide variant noise level at each locus using formulas 2 and 3, based on the average single-nucleotide variant frequency of the group with identical SNV feature to which each locus belongs and the single-nucleotide variant frequency at each locus: ε p = ω 1 λ tm + 1 − ω 1 μ p ω 1 = λ tm λ tm + μ p in formulas 2 and 3, p represents a certain locus within the range of the sequencing Panel; λ tm represents the average single-nucleotide variant frequency of all loci within the group with identical SNV feature to which locus p belongs; µ p represents the single-nucleotide variant frequency at locus p; ω 1 represents weight; ε p represents the single-nucleotide variant noise level at locus p.
- The apparatus according to claim 1, wherein the insertion or deletion mutation noise level model construction module in A4) comprises the following modules: A4-1) an insertion or deletion characteristic integration and grouping module: configured to group loci that share the same repetitive sequence element information and the same segmental duplication information together, thereby forming groups with identical InDel feature ; A4-2) an insertion or deletion mutation prior background noise level acquisition module: configured to calculate the average insertion or deletion frequency of each group with identical InDel feature using formula 4, based on the insertion or deletion mutation frequency of all loci within the group with identical InDel feature and the group with identical InDel feature described in A4-1): λ rsl = ∑ μ prsl m in formula 4, λ rsl represents the average insertion or deletion frequency of all loci within the group with identical InDel feature, µ prsl represents the insertion or deletion mutation frequency at each locus within the group with identical InDel feature described in A4-1), and m represents the number of loci within the group with identical InDel feature described in A4-1); A4-3) an insertion or deletion mutation noise level acquisition module: configured to obtain the insertion or deletion noise level at each locus using formulas 5 and 6, based on the average insertion or deletion mutation frequency of the group with identical InDel feature to which each locus belongs and the insertion or deletion frequency at each locus: ω 2 = λ rsl λ rsl + μ pl ε pl = ω 2 λ rsl + 1 − ω 2 μ pl in formulas 5 and 6, p represents a certain locus within the range of the sequencing panel; ω 2 represents the weight; λ rsl represents the average insertion or deletion mutation frequency of all loci within the group with identical InDel feature to which locus p belongs; µ pl represents the insertion or deletion mutation frequency at locus p; ε pl represents the insertion or deletion mutation noise level at locus p.
- An apparatus for identifying whether the mutations in ctDNA sequencing data are true mutations or sequencing noise, comprising the following modules: C1) a ctDNA sequencing data quality control and alignment module: configured to obtain ctDNA sequencing data of the sample to be tested, perform quality control, and align the data with the human reference genome to generate alignment files; C2) a ctDNA sequencing data mutation analysis module: configured to obtain the sequencing depth, the number of reads supporting mutation, and the mutation frequency at each locus based on the alignment files, thereby obtaining mutations in the ctDNA sequencing data of the sample to be tested; C3) a statistical testing module: configured to determine whether the mutations are true mutations or sequencing noise using statistical hypothesis testing methods based on the mutations and noise levels; the mutations are SNVs and/or InDels; the noise levels are SNV noise levels and/or InDel noise levels, and the noise levels are obtained using the apparatus described in any one of claims 1-3.
- The apparatus according to claim 4, wherein the statistical hypothesis test calculates the statistical significance level of the hypothesis test by using formulas 7 and 8: λ p = d p ⋅ δ p P k p λ p = 1 − ∑ j = 0 k p e − λ p λ p j j ! in formulas 7 and 8, p represents a certain locus within range of sequencing panel; δ p is the noise level at locus p; d p is the sequencing depth at locus p; λ p is the expected number of reads supporting noise mutation observed at locus p; k p is the number of reads supporting mutation at locus p in a sample to be tested; j represents the value of the number of reads supporting mutation, j ∈ (0, k p ); P(k p , λ p ) represents the statistical P-value when k p is greater than or equal to λ p .
- A method for detecting the noise levels in ctDNA sequencing data, comprising the following steps: B1) processing ctDNA sequencing data to obtain ctDNA sequencing data of healthy subjects, and align the data with the human reference genome after quality control to generate the alignment files ; B2) performing detection of mutation frequency: based on the alignment files, to obtain the sequencing depth, the number of reads supporting single-nucleotide variant, and the number of reads supporting insertion or deletion at each locus in the ctDNA sequencing data. At each locus, calculate the single-nucleotide variant frequency based on the sequencing depth and the number of reads supporting single-nucleotide variant, and calculate the insertion or deletion mutation frequency based on the sequencing depth and the number of reads supporting insertion or deletion; B3) constructing single-nucleotide variant noise level model to obtain the single-nucleotide variant noise level at each locus based on the grouping information of SNV feature at each locus and the single-nucleotide variant frequency at each locus, wherein the grouping information of SNV feature at each locus includes the trinucleotide sequence context and of the alignment feature; B4) constructing insertion or deletion mutation noise level model to obtain the insertion or deletion noise level at each locus based on the grouping information of InDel feature at each locus and the insertion or deletion frequency at each locus; wherein the grouping information of InDel feature at each locus includes the repetitive sequence element information and the segmental duplication information; B5) outputting the noise level in the ctDNA sequencing data, which include the SNV noise level at each locus and/or the InDel noise level at each locus.
- The method according to claim 6, wherein the constructing single-nucleotide variant noise level model in B3) comprises the following steps: B3-1) integrating and grouping single-nucleotide variant groups based on the trinucleotide sequence context and alignment feature at each locus, to group loci with the same trinucleotide sequence context and the same alignment feature together, thereby forming groups with identical SNV feature; B3-2) acquiring the prior background noise level of single-nucleotide variant to calculate the average single-nucleotide variant frequency of the group with identical SNV feature using formula 1, based on the single-nucleotide variant frequency of all loci within the group with identical SNV feature and the group with identical SNV feature described in B3-1): λ tm = ∑ μ ptm n in formula 1, λ tm represents the average single-nucleotide variant frequency of all loci within the group with identical SNV feature, µ ptm represents the single-nucleotide variant frequency at each locus within the group with identical SNV feature described in B3-1), and n represents the number of loci within the group with identical SNV feature described in B3-1); B3-3) acquiring the single-nucleotide variant noise level at each locus using formulas 2 and 3, based on the average single-nucleotide variant frequency of the group with identical SNV feature to which each locus belongs and the single-nucleotide variant frequency at each locus: ε p = ω 1 λ tm + 1 − ω 1 μ p ω 1 = λ tm λ tm + μ p in formulas 2 and 3, p represents a certain locus within the range of sequencing panel; λ tm represents the average single-nucleotide variant frequency of all loci within the group with identical SNV feature to which locus p belongs; µ p represents the single-nucleotide variant frequency at locus p; ω 1 represents weight; ε p represents the single-nucleotide variant noise level at locus p.
- The method according to claim 6 or 7, wherein the constructing insertion or deletion noise level model in B4) comprises the following steps: B4-1) integrating and grouping insertion or deletion groups based on the repetitive sequence element information and the segmental duplication information at each locus, to group loci with the same repetitive sequence element information and the same segmental duplication information together, thereby forming groups with identical InDel feature; B4-2) acquiring the prior background noise level of insertion or deletion to calculate the average insertion or deletion frequency of the group with identical InDel feature using formula 4, based on the insertion or deletion frequency at each locus of the group with identical InDel feature and the group with identical InDel feature described in B4-1): λ rsl = ∑ μ prsl m in formula 4, λ rsl represents the average insertion or deletion frequency of all loci within the group with identical InDel feature, µ prsl represents the insertion or deletion frequency at each locus within the group with identical InDel feature described in B4-1), and m represents the number of loci within the group with identical InDel feature described in B4-1); B4-3) acquiring the insertion or deletion noise level at each locus using formulas 5 and 6, based on the average insertion or deletion frequency of the group with identical InDel feature to which each locus belongs and the insertion or deletion frequency at each locus: ω 2 = λ rsl λ rsl + μ pl ε pl = ω 2 λ rsl + 1 − ω 2 μ pl in formulas 5 and 6, p represents a certain locus within the range of sequencing panel; ω 2 represents the weight; λ rsl represents the average insertion or deletion frequency of all loci within the group with identical InDel feature to which locus p belongs; µ pl represents the insertion or deletion frequency at locus p; ε pl represents the insertion or deletion noise level at locus p.
- A method for identifying whether the mutations in ctDNA sequencing data are true mutations or sequencing noise, comprising the following steps: D1) acquiring the ctDNA sequencing data of the samples to be tested, performing quality control and aligning the data with the human reference genome to generate alignment files; D2) performing mutation analysis on ctDNA sequencing data to obtain the sequencing depth, the number of reads supporting mutation, and the mutation frequency at each locus based on the alignment files, thereby obtaining the mutations in the ctDNA sequencing data of the sample to be tested; D3) performing statistical testing analysis to determine whether the mutations are true mutations or sequencing noise using a statistical hypothesis testing method based on the mutations and the noise levels; the mutations are SNVs and/or InDels; the noise levels are SNV noise levels and/or InDel noise levels, and the noise levels are obtained using the method described in any one of claims 6-8.
- The method according to claim 9, wherein statistical hypothesis test calculates a statistical significance level of the hypothesis test by using formulas 7 and 8: λ p = d p ⋅ δ p P k p λ p = 1 − ∑ j = 0 k p e − λ p λ p j j ! in formulas 7 and 8, p represents a certain locus within the range of sequencing panel; δ p is the noise level at locus p; d p is the sequencing depth at locus p; λ p is the expected number of reads supporting noise mutation observed at locus p; k p is the number of reads supporting mutation at locus p in a sample to be tested; j represents the value of the number of reads supporting mutation, j ∈ (0, k p ); P(k p , λ p ) represents the statistical P-value when k p is greater than or equal to λ p .
- A computer-readable storage medium storing a computer program, wherein the computer-readable storage medium is a computer-readable storage medium for detecting a noise level in ctDNA sequencing data; the computer program enables a computer to execute steps of the method described in any one of claims 6-8.
- A computer-readable storage medium storing a computer program, wherein the computer-readable storage medium is a computer-readable storage medium for identifying whether the mutations in ctDNA sequencing data are true mutations or sequencing noise; the computer program enables a computer to execute steps of the method described in claim 9 or 10.
- Any one of the following uses of the apparatus described in any one of claims 1-5 or the method described in any one of claims 6-10: E1) use in preparing products for screening or auxiliary screening of cancer patients; E2) use in development or preparing products for personalized cancer therapy.
- Any one of the following uses of the apparatus described in claim 4 or 5: E1) use in preparing products for screening or auxiliary screening of cancer patients; E2) use in development or preparing products for personalized cancer therapy.
- Any one of the following uses of the method described in any one of claims 6-8: E1) use in preparing products for screening or auxiliary screening of cancer patients; E2) use in development or preparing products for personalized cancer therapy.
- Any one of the following uses of the method described in claim 9 or 10: E1) use in preparing products for screening or auxiliary screening of cancer patients; E2) use in development or preparing products for personalized cancer therapy.
Description
TECHNICAL FIELD The disclosure relates to an apparatus, a computer-readable storage medium, and uses for identifying true mutations or sequencing noise in ctDNA sequencing data within the field of bioinformatics. BACKGROUND Liquid biopsy technology, as a branch of in vitro diagnostics, enables the analysis and diagnosis of diseases such as cancer by sampling cerebrospinal fluid, saliva, blood, urine, etc., and can, to a certain extent, circumvent the influence of tissue heterogeneity on diseases. Currently, the detection of circulating tumor DNA (ctDNA) in blood represents the primary research direction within liquid biopsy. Circulating tumor DNA (ctDNA) refers to DNA fragments released by tumor cells that carry tumor mutation information. With the advancement of high-throughput technologies, the objective of ctDNA NGS data detection is to analyze specific mutated genes within tumor cells and subsequently provide personalized therapy plans for cancer patients. As sequencing throughput continues to increase and data volumes grow exponentially, this simultaneously poses significant challenges to mutation detection techniques. The difficulty in detecting ctDNA data lies in the often low frequency of tumor mutations, making it crucial to distinguish between low-frequency tumor mutations and sequencing noise amidst the vast data throughput. Existing detection technologies assume uniform noise levels across loci and employ a unified threshold standard to differentiate mutations from noise, which may reduce sensitivity or specificity at certain loci and even result in false-negative or false-positive loci. Moreover, current mutation detection techniques require the provision of normal control samples paired with cancer patients, directly leading to increased costs. SUMMARY The present disclosure provides an apparatus and method for detecting noise levels in ctDNA sequencing data, primarily utilizing locus-based noise levels to detect mutations. This disclosure constructs background noise levels at loci using datasets from healthy individuals and determines true mutation signals through statistical tests comparing the estimated noise levels with the observed mutation frequencies in actual tumor samples. Compared to existing technologies that employ a uniform positive cutoff value across all loci, the present disclosure utilizes locus-specific positive cutoff values, making it more sensitive in distinguishing low-frequency noise/mutation signals. The technology of the present disclosure can detect SNVs and InDels in ctDNA data from single tumor samples without requiring data from paired normal samples of cancer patients, thereby reducing experimental and sequencing costs. Technical Problem The technical problem to be addressed by the present disclosure is how to accurately identify or detect whether the mutations in ctDNA sequencing data are true mutations or sequencing noise. Technical Solution In order to address the aforementioned technical problems, the present disclosure first provides an apparatus for detecting noise levels in ctDNA sequencing data, which may comprise the following modules: A1) ctDNA Sequencing Data Processing Module: configured to obtain or receive ctDNA sequencing data from healthy subjects, perform quality control, and align the data with the human reference genome to generate alignment files;A2) Mutation Frequency Detection Module: configured to obtain the sequencing depth at each locus, the number of reads supporting single-nucleotide variant at each locus, and the number of reads supporting insertion or deletion at each locus in the ctDNA sequencing data based on the alignment files. For each locus, the module calculates: (i) the single-nucleotide variant frequency based on the sequencing depth and the number of reads supporting single-nucleotide variant, and (ii) the insertion or deletion frequency based on the sequencing depth and the number of reads supporting insertion or deletion.A3) Single-Nucleotide Variant Noise Level Model Construction Module: configured to obtain the single-nucleotide variant noise level at each locus based on the grouping information of single-nucleotide variant feature at each locus and the single-nucleotide variant frequency at each locus; wherein the grouping information of single-nucleotide variant feature at each locus includes the trinucleotide sequence context and the alignment feature at each locus;A4) Insertion or Deletion Noise Level Model Construction Module: configured to obtain the insertion or deletion noise level at each locus based on the grouping information of insertion or deletion feature at each locus and the insertion or deletion frequency at each locus; wherein the grouping information of insertion or deletion feature at each locus includes the repetitive sequence element information and the segmental duplication information at each locus;A5) Noise Level Output Module: configured to output the noise levels of the ctDNA sequencing data, wh