Search

CN-119091965-B - Method for detecting fusion gene based on DNA-based sequencing data

CN119091965BCN 119091965 BCN119091965 BCN 119091965BCN-119091965-B

Abstract

The invention relates to the technical field of gene detection and discloses a method for detecting fusion genes based on DNA-based sequencing data, which comprises the following steps of S1, obtaining DNA sequencing data, filtering by using software fastp to obtain filtered sequencing data, S2, detecting by using software Genefuse to obtain a first fusion gene result and related parameters, S3, detecting by using software Factera to obtain a second fusion gene result and related parameters, S4, detecting by using software Arriba to obtain a third fusion gene result and related parameters, S5, obtaining a fusion gene filtering model, calculating true positive fusion genes by taking the first fusion gene and related parameters, the second fusion gene and related parameters, and the third fusion gene and related parameters as input data, and outputting the results. The invention constructs a fusion gene screening standard based on multiple software, avoids false positive or false negative of a single software detection result, and improves the detection rate and accuracy of fusion genes.

Inventors

  • BAO QIAN
  • Li panshan
  • WANG WENLING

Assignees

  • 杭州洛兮医学检验实验室有限公司

Dates

Publication Date
20260508
Application Date
20240902

Claims (5)

  1. 1. A method for detecting a fusion gene based on DNA-based sequencing data for non-disease diagnostic purposes, comprising the steps of: step S1, acquiring DNA sequencing data, and performing data filtering by using software fastp to acquire filtered sequencing data; S2, performing fusion gene detection on the filtered sequencing data by using software Genefuse to obtain a first fusion gene result and first related parameter information; S3, performing fusion gene detection on the filtered sequencing data by using software Factera to obtain a second fusion gene result and second related parameter information; s4, performing fusion gene detection on the filtered sequencing data by using software Arriba to obtain a third fusion gene result and third related parameter information; S5, obtaining a fusion gene filtering model, taking a first fusion gene result and first related parameters, a second fusion gene result and second related parameters as input data, calculating a true positive fusion gene through the fusion gene filtering model, and outputting a fusion gene result; The fusion gene filtering model step in the step S5 is as follows: S51, respectively obtaining a first fusion gene result and first related parameter information, a second fusion gene result and second related parameter information, and a third fusion gene result and third related parameter information, and obtaining a potential fusion gene table according to the fusion gene result and the related parameter information, wherein the potential fusion gene table comprises fusion gene pairs, fusion genes 1, fusion genes 2, detection software, sequence information and related parameters; Step S52, if the fusion gene pair is detected in two or more fusion gene detection software, setting the fusion gene pair as a fusion gene list, and acquiring related parameter information; Step S53, if the fusion gene pair is detected in only one fusion gene detection software, and the corresponding fusion gene 1 or fusion gene 2 is detected in a plurality of fusion gene detection software, extracting corresponding fusion gene pair sequence information in a plurality of software, acquiring an abnormal comparison sequence according to DNA sequencing data, comparing the abnormal comparison sequence with a reference genome by using software BWA, acquiring a fusion gene pair according to comparison quality, if the comparison results of the plurality of abnormal comparison sequences are consistent, acquiring the fusion gene pair according to the comparison results, setting the fusion gene pair as a fusion gene list, and acquiring related parameter information, if the comparison results of the plurality of abnormal comparison sequences are inconsistent, acquiring the fusion gene pair according to the comparison results of the plurality of abnormal comparison sequences, setting the fusion gene pair as a fusion gene list, and acquiring related parameter information, and if the comparison results of the plurality of abnormal comparison sequences are inconsistent and the comparison quality is consistent, acquiring the fusion gene pair according to the result of the base quality selection and acquiring the fusion gene pair as the fusion gene pair; If the total number of the breakpoint number is less than or equal to 20 and the number of low quality and extremely low quality in the base quality is less than 15% of the length of the whole fusion gene pair, an abnormal comparison sequence fastq file is extracted from DNA sequencing data according to the fusion gene pair sequence, the abnormal comparison sequence fastq file is compared and de-duplicated by using samtools software and picard software, an abnormal comparison sequence bas file is obtained, factera software and Arriba software are respectively used for detecting the fusion gene stq file from the abnormal comparison sequence bas file and the abnormal comparison sequence fastq file, and if the same fusion gene pair is detected in one software, the fusion gene pair is considered to be positive, and the fusion gene pair is set as the true fusion parameter; If the fusion gene pair is detected only in the second fusion gene result and the fusion gene 1 and the fusion gene 2 are not detected in other fusion gene pairs, acquiring relevant parameters of the second fusion gene result, judging that if the break_support is more than or equal to 20 and the break_depth and the property_pair_support are both more than 100, setting the fusion gene pair as a fusion gene list and acquiring relevant parameter information, if the break_support is less than or equal to 10 and is less than or equal to 20 and the break_depth, the property_pair_support and the total_depth are both more than 100, extracting sequence information of the fusion gene pair, extracting an abnormal comparison sequence from DNA sequencing data, using Genefuse software and Arriba software to detect the fusion gene pair from the abnormal comparison sequence, and if the same fusion gene pair is detected in one piece of the fastsoftware, the fusion gene pair is considered to be true positive, and acquiring relevant parameter information; step S56, if the fusion gene pair is detected only in the third fusion gene result and neither fusion gene 1 nor fusion gene 2 is detected in other fusion gene results, acquiring relevant parameters of the third fusion gene result for judgment, if the confidence is high, and spilt _read is more than or equal to 20 and coverage is more than or equal to 200, considering the fusion gene pair as true positive, setting the fusion gene pair as a fusion gene list, and acquiring parameter information; if the confidence is medium or low,10 is less than or equal to split_read <20, and coverage is more than or equal to 200, extracting sequence information of the fusion gene pair, extracting an abnormal comparison sequence from DNA sequencing data according to the information, analyzing the abnormal comparison sequence by using software BWA and software picard to obtain an abnormal comparison sequence, analyzing the abnormal comparison sequence by using software Genefuse and software Factera to detect fusion genes of the abnormal comparison sequence, the fastq file and the abnormal comparison sequence by using the bam file, and if any software detects the same fusion gene pair, setting the fusion gene pair as true positive, setting the fusion gene pair as a fusion gene list, and obtaining related parameter information; and S57, acquiring a fusion gene list and outputting a result.
  2. 2. The method for detecting fusion genes based on DNA-based sequencing data according to claim 1, wherein the first related parameter information in step S2 includes the number of breakpoints and the base quality of the breakpoint sequence, wherein the number of breakpoints is divided into total number and unique number, and the base quality is divided into extremely high quality, medium quality, low quality and extremely low quality.
  3. 3. The method for detecting fusion genes based on DNA-based sequencing data according to claim 1, wherein the second related parameter information of step S3 includes break_ support, break _depth, property_pair_support and total_depth, wherein break_support is divided into break_support1 and break_support2.
  4. 4. The method for detecting fusion genes based on DNA-based sequencing data according to claim 1, wherein the third related parameter information in step S4 includes split_ read, coverage and confidence, wherein split_read is divided into split_read1 and split_read2, coverage is divided into coverage1 and coverage2, and confidence is divided into low, medium, and high.
  5. 5. The method for detecting fusion genes based on DNA-based sequencing data according to claim 1, wherein in the step S53-step S56, the abnormal alignment sequence is extracted by comparing the obtained potential breakpoint sequence information with the filtered sequencing data, and extracting the sequences with the alignment consistency rate of more than 95% as the abnormal alignment sequences.

Description

Method for detecting fusion gene based on DNA-based sequencing data Technical Field The invention relates to the technical field of gene detection, in particular to a method for detecting fusion genes based on DNA-based sequencing data. Background Fusion genes refer to fusion of all or part of the sequences of coding or non-coding regions of two or more different genes together due to a mechanism such as genomic variation, resulting in a new gene, which is caused by structural rearrangements of the chromosome. The fusion gene is closely related to the occurrence and development of tumors, and the diagnosis of biomarkers, the discovery of new therapeutic targets and the understanding of molecular basis of tumor occurrence can be realized by identifying the related fusion genes. At present, the detection methods of the fusion genes commonly used in clinic comprise immunohistochemistry, FISH, PCR and second generation sequencing, the second generation sequencing technology has wide detection range and high detection speed, and can simultaneously verify whether the genes are fused and accurately detect the breakpoint of the fusion genes at the gene level and the transcriptome level, thereby solving the problems of missed detection, incapability of clearly fusing partner genes and the like existing in the conventional detection methods. Among them, the method of detecting fusion genes using the second-generation sequencing includes a DNA-based second-generation sequencing method and an RNA-based second-generation sequencing method. The method for detecting the fusion gene based on the DNA-based comprises the steps of comparing double-ended sequencing sequences with a genome, evaluating whether the distance and the direction of the double-ended sequencing sequences are consistent with database construction information so as to judge whether the double-ended sequencing sequences are fusion genes, wherein common software comprises Genefuse, factera and the like, the method for detecting the fusion gene based on the RNA-based comprises a sequence comparison method and a splicing comparison method, fusion events are identified by searching sequences inconsistent with sequences covering break points, the method for identifying the fusion events by searching for sequences covering the break points is an assembly transcript and then comparing the assembly transcript with a reference genome so as to identify fusion transcripts consistent with chromosome rearrangement, and the common software comprises Arriba and the like. Wherein Genefuse is a DNA-based sequencing tool capable of directly detecting fusion genes from fastq files, by finding reads that can map well to the left and right parts of two different genes, but cannot map completely to the entire reference genome, performing support reading, and by analyzing each support reading, determining whether a fusion gene is present. However, the software can only focus on clinically significant genes for research, and has weak detection capability for unknown fusion genes. Factera is a software tool for finding fusion genes from DNA sequencing, which is mainly used for detecting translocation, inversion and deletion fusion gene types, the software firstly needs to compare and process an original fastq file to obtain a bam file, and secondly uses the bam file as an input file of the software, clusters similar exons into different genomes by searching incorrectly paired reads, finds breakpoints and locates fusion genes. However, the software depends on the bam file, is not sensitive enough to the detection result, and lacks the function of visual detection fusion. Arriba is a fusion gene detection tool using RNA-based sequencing data that can detect inverted and repeated fusion gene types and is fast and sensitive, but difficult to detect for deleted fusion genes. Thus, each piece of software has a disadvantage in detecting the fusion gene, and the single piece of software may be used for detecting the fusion gene in the case of false positive or false negative. Disclosure of Invention Based on the problems, the invention provides a method for detecting fusion genes based on DNA-based sequencing data, which constructs a screening standard based on multiple software, and uses the current mainstream detection software of the fusion genes to filter and screen, thereby effectively avoiding false positive or false negative of single software detection results. A method for detecting a fusion gene based on DNA-based sequencing data, comprising the steps of: step S1, acquiring DNA sequencing data, and performing data filtering by using software fastp to acquire filtered sequencing data; S2, performing fusion gene detection on the filtered sequencing data by using software Genefuse to obtain a first fusion gene result and first related parameter information; S3, performing fusion gene detection on the filtered sequencing data by using software Factera to obtain a second fusion gene