CN-122024842-A - Fusion transcript identification method, device, system and medium based on transcriptome multiple comparison data
Abstract
The application provides a fusion transcript identification method, a device, a system and a medium based on transcriptome multiple comparison data, which are characterized in that a set of full-chain analysis framework oriented to multiple mapping read segments is constructed to realize fusion gene detection. The system comprises a paired reading primary screening module based on a loose mapping strategy, a candidate fusion construction module based on an exon structure, a remapping module based on an enhanced reference transcriptome, a transcript abundance estimation module based on a probability generation model, a fusion scoring module based on fusion site specificity support degree, and a false positive suppression module based on coverage consistency and biological filtration. The application breaks through the dependence of the traditional fusion gene detection method on the unique mapping read section, firstly proposes multiple comparison analysis on the fuzzy mapping read section discarded due to sequence homology, verifies high sensitivity and low false positive rate in both simulated data and real tumor samples, and has obvious technical advancement, clinical applicability and expandability.
Inventors
- CHEN GUANGQUAN
- DU BIN
Assignees
- 上海市第一妇婴保健院
Dates
- Publication Date
- 20260512
- Application Date
- 20251121
Claims (12)
- 1. A fusion transcript identification system based on transcriptome multiplex alignment data comprising: The paired reading primary screening module is used for respectively and independently comparing and screening each pair of terminal reads in the RNA-Seq data based on a loose mapping strategy; The candidate fusion construction module is used for enumerating fusion transcript structures meeting multiple biological constraints for each pair of potential fusion genes based on the exon structure, so as to generate multiple candidate fusion transcript sequences; The remapping module integrates all the candidate fusion transcript sequences into an original RefSeq database to form an enhanced reference database, and the original reads are remapped to the enhanced reference database and the mapping position number of each read pair reported most is set; The transcript abundance estimation module is used for iteratively optimizing the maximum likelihood abundance estimation value of each transcript through a expectation maximization algorithm based on a probability generation model so as to optimally allocate the overall situation of the multi-mapping reads; The fusion scoring module is used for scoring the read pairs crossing the fusion connection sites after the convergence of the expected maximization algorithm; a false positive suppression module for performing a double filtering based on coverage consistency and biological logic to suppress false positive results.
- 2. The transcriptome multiplex alignment data-based fusion transcript identification system according to claim 1, wherein said alignment and screening process of said paired read prescreening module comprises: The method comprises the steps of comparing reads to a reference transcriptome in a single-ended mode by adopting a short read comparison tool, enabling each read to report a non-unique high-confidence mapping position by setting loose parameters, screening out read pairs with two ends mapped to different genes, and filtering false positive results through genome coordinates.
- 3. The transcriptome multiplex alignment data-based fusion transcript identification system according to claim 2, wherein said paired reads prime module employs Bowtie as a short read alignment tool, wherein: the Bowtie alignment tool aligns reads to RefSeq reference transcriptome in single ended mode; And/or the loose parameters set by the Bowtie comparison tool for the read comparison comprise any one or more of the following parameters, namely, the length of the base used for initializing the comparison at the front end of the read and limiting the maximum allowable mismatch quantity in the area, the maximum threshold value of the mass value synthesis of all mismatch positions, the forced check of each base of the read, the requirement of reporting the effective comparison result of all the read meeting the conditions, and the maximum effective comparison quantity threshold value allowed by the read.
- 4. The transcriptome multiple alignment data-based fusion transcript identification system according to claim 1, wherein said candidate fusion construct module is executed by: Constructing a directed graph composed of the exons of the upstream gene and the downstream gene, adding fusion edges pointing from all the exons upstream to all the exons downstream, traversing all the possible paths of the directed graph, verifying three preset biological constraint conditions on the potential fusion structure obtained by traversing, and only retaining the fusion transcript structure meeting all the constraint conditions simultaneously.
- 5. The transcriptome multiplex alignment data-based fusion transcript identification system according to claim 4, wherein said three predetermined biological constraints comprise the following: condition 1, the potential fusion structure contains all exons covered by the read; Condition 2, maintaining the sequential integrity of the internal exons of the upstream and downstream genes so that the path extends continuously from the upstream gene start exon to the fusion point and then from the downstream gene fusion point to the termination exon; Condition 3 deduced insert length from read pairs must not exceed the upper limit quantile of experimental library insert distribution and allow for a preset proportion of outliers.
- 6. The transcriptome multiple alignment data-based fusion transcript identification system according to claim 1, wherein said transcript abundance estimation module is configured to perform the steps of: Constructing a probability generation model, wherein the generation process of each reading pair is assumed to involve hidden variables and observed variables, and the hidden variables comprise a source transcript, an upstream reading start position and a downstream reading end position; the likelihood function is used for quantifying the joint probability of all read pairs in the generation process by integrating transcript abundance, initial position uniformity, insert distribution and sequencing error models; and iterating the abundance estimation value of each transcript by using the likelihood function through a expectation maximization algorithm until convergence conditions are met, so as to obtain global optimal allocation of fuzzy mapping reads.
- 7. The transcriptome multiplex alignment data-based fusion transcript identification system of claim 1, wherein said fusion scoring module scores a read pair spanning a fusion junction site comprising at least one read in the read pair covering both the 3 'sequence of the upstream gene and the 5' sequence of the downstream gene directly spanning the fusion breakpoint.
- 8. The transcriptome multiplex alignment data-based fusion transcript identification system according to claim 1, wherein: The first re-filtering of the false positive suppression module comprises the steps of calculating the physical coverage depth of a fusion connection site, quantitatively comparing the physical coverage depth with the average coverage depth of main body areas of upstream and downstream parent genes, and judging the fusion connection site as a false positive result to be filtered if the coverage depth of the fusion site is lower than the preset percentage of the coverage depth of the main body of the upstream and downstream genes; The second filtering of the false positive suppression module includes fusion event filtering that perfectly matches the read sequence to the U1-U7 spliceosome RNA.
- 9. A method for identifying fusion transcripts based on transcriptome multiplex alignment data comprising: performing independent comparison and screening on each pair of terminal reads in the RNA-Seq data based on a loose mapping strategy; Enumerating fusion transcript structures satisfying a plurality of biological constraints for each pair of potential fusion genes based on the exon structure, thereby generating a plurality of candidate fusion transcript sequences; Integrating all the candidate fusion transcript sequences into an original RefSeq database to form an enhanced reference database, comparing original reads with the enhanced reference database, and setting the mapping position number of each read with the maximum report; Iteratively optimizing maximum likelihood abundance estimation values of each transcript by a expectation maximization algorithm based on a probability generation model so as to optimally allocate the overall situation of the multi-mapping reads; scoring read pairs spanning the fusion junction site after convergence of the desired maximization algorithm; Double filtering is performed based on coverage consistency and biological logic to suppress false positive results.
- 10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the transcriptome multiplex alignment data based fusion transcript identification method of claim 9.
- 11. A computer program product comprising computer program code which, when run on a computer, causes the computer to implement the method for fusion transcript identification based on transcriptome multiple alignment data according to claim 9.
- 12. A computer device comprising a memory, a processor, and a computer program stored on the memory, wherein the processor executes the computer program to implement the transcriptome multiplex alignment data-based fusion transcript identification method of claim 9.
Description
Fusion transcript identification method, device, system and medium based on transcriptome multiple comparison data Technical Field Since the 21 st century, the rapid development of high throughput sequencing technology has thoroughly remodeled the research paradigm of oncology, where RNA sequencing (RNA-Seq) has become the core tool for systematic discovery of drivability genomic variations by virtue of its unbiased, full transcriptome coverage, and quantitative accuracy. Background Since the 21 st century, the rapid development of high throughput sequencing technology has thoroughly remodeled the research paradigm of oncology, where RNA sequencing (RNA-Seq) has become the core tool for systematic discovery of drivability genomic variations by virtue of its unbiased, full transcriptome coverage, and quantitative accuracy. Among the many tumor-associated variations, fusion genes are of increasing interest in the basic research and clinical transformation fields due to their key role in tumorigenesis, progression and therapeutic response. Fusion genes are typically disrupted and religated by chromosomal structural rearrangements (e.g., translocations, inversions, deletions or amplifications) to form chimeric transcripts with aberrant function. Classical cases such as BCR-ABL1 in chronic myelogenous leukemia, TMPRSS2-ERG in prostate cancer and EML4-ALK in non-small cell lung cancer not only reveal the molecular pathogenesis of tumors, but also directly promote the generation of targeted therapeutic drugs (such as imatinib and crizotinib), and become a model of accurate medical treatment. In recent years, with the continuous decrease of sequencing cost and the progress of analysis algorithms, the detection range of fusion genes has been rapidly expanded from hematological tumors to various solid tumors, especially in solid sarcomas (such as high-grade endometrial stromal sarcomas, hystero inflammatory myofibroblastic tumors, etc.), and more novel fusion events have been reported successively, such as ESRRA-C11orf20, MALAT1-GLI1, and fusion involving kinase genes such as FGFR2, NTRK, ALK, etc., and have been proved to have diagnostic, prognostic or therapeutic guidance values in part. However, while RNA-Seq provides unprecedented opportunities for fusion gene detection, data analysis is still facing serious technical challenges. One of the core bottlenecks is the high reproducibility and homology of the transcriptome, the large number of genes belonging to the polygene family, the presence of paralogs with functional redundancy (paralogs) or the expression pseudogene (processed pseudogenes), resulting in a significant proportion of sequencing reads (reads) that cannot be mapped uniquely (uniquely mapped) onto the reference genome or transcriptome. Such "fuzzy map" (ambiguously mapping) reads are often discarded directly in conventional analysis flows, or only their "best" map locations are retained. Although this strategy can circumvent the risk of false positives, it has paid a costly sensitivity-studies have shown that in typical RNA-Seq data up to 40% of potential fusion events can only be detected by fuzzy mapping reads. In particular in sarcoma tumors, many driving fusions involve highly homologous kinase domains (e.g., FGFR family, ROS1/NTRK family) with very high sequence similarity at the breakpoint regions, such that supportive reads naturally have a multi-mapped nature. If a unique mapping is forced, a large number of fusion variants with important clinical significance are inevitably omitted, and the integrity of molecular typing and the accessibility of targeted therapy are severely restricted. Currently, the mainstream Fusion gene detection tools (e.g., topHat-Fusion, deFuse, STAR-Fusion, etc.) commonly employ a "unique map-first" strategy, whose core logic relies on explicit, unambiguous localization across Fusion junction sites (junction-SPANNING READS) or paired end reads (discordant READ PAIRS). Such methods perform well in dealing with fusion of highly expressed, unique regions of the sequence, but are not attractive in the face of homologous gene-mediated fusion. More seriously, when fuzzy mapping is allowed, a read pair supporting true fusion A1-B1 may be mapped to homologous genes A2 and B2 at the same time, so that multiple pseudo fusion candidates of A1-B2, A2-B1, A2-B2 and the like are derived wrongly, and false positive explosion increases. Therefore, how to effectively inhibit false positive noise caused by multi-mapping combination while fully utilizing fuzzy mapping reads to improve detection sensitivity constitutes a key technical problem which is long-pending in the field of fusion gene detection. Although studies have attempted to help solve this problem by introducing long-reading long sequencing (e.g., pacBio, nanopore) or evidence of structural variation at the binding DNA level, these protocols are either limited by high cost, insufficient throughput, or are difficult to genera