CN-122012684-A - Genetic variation and chimera detection method, system and kit

CN122012684ACN 122012684 ACN122012684 ACN 122012684ACN-122012684-A

Abstract

The invention provides a method, a system and a kit for detecting genetic variation and chimera. The method comprises the steps of (a) extracting high molecular weight DNA from a biological sample, (b) marking the high molecular weight DNA by using a connector containing UMI sequences to form a UMI marked DNA library, (c) carrying out targeted enrichment on the UMI marked DNA library by using a probe aiming at a target locus to obtain an enriched long fragment DNA library, (d) carrying out long-reading long sequencing on the enriched long fragment DNA library to obtain sequencing data, and (e) carrying out bioinformatic analysis on the sequencing data, wherein the analysis comprises error correction and mutation detection based on the UMI sequences. The method of the invention is used for simultaneously detecting structural variation and low-frequency chimeric variation in a target genome region.

Inventors

SUN HAIRUI
HE YIHUA
HAO XIAOYAN
FAN JIAQI
LIU RUIMIN

Assignees

首都医科大学附属北京安贞医院

Dates

Publication Date: 20260512
Application Date: 20260212

Claims (10)

1. A targeted long-read long sequencing method integrating Unique Molecular Identifiers (UMI) is characterized in that the method comprises the following steps: (a) Extracting high molecular weight DNA from a biological sample; (b) Labeling the high molecular weight DNA with a linker comprising a UMI sequence to form a UMI-labeled DNA library; (c) Performing targeted enrichment on the UMI-labeled DNA library by using a probe aiming at a target locus to obtain an enriched long fragment DNA library; (d) Performing long-reading long sequencing on the enriched long-fragment DNA library to obtain sequencing data; (e) Performing bioinformatics analysis on the sequencing data, including error correction and mutation detection based on UMI sequences; Wherein the method is used for simultaneously detecting structural variation and low-frequency chimeric variation in a target genome region.
2. The method of sequencing of claim 1, wherein in step (a), the biological sample is selected from the group consisting of peripheral blood, saliva, amniotic fluid, chorionic villi, tumor tissue, skin tissue, semen, and combinations thereof; Preferably, the fragment length of the high molecular weight DNA is mainly distributed over 20kb, more preferably over 40 kb; preferably, the extraction adopts a magnetic bead method or a special kit; Preferably, a quality control step is also included to ensure that the DNA purity meets A260/280>1.5 and A260/230>1.8.
3. The method according to claim 1, wherein in step (b), the UMI sequence is a random nucleotide sequence of 10-30bp, preferably 18bp in length, and the labeling comprises end repair, dA tail addition and linker ligation.
4. The method according to claim 1, wherein in step (c), the probe is a biotinylated capture probe covering the full length region, introns, exons and flanking regulatory regions of the target gene; Preferably, the enrichment employs liquid phase hybridization capture or a CRISPR-Cas system-based targeted cleavage method, preferably, hybridization time is 2-24 hours, more preferably, 4-16 hours; Preferably in step (d) the long read long sequencing employs a high fidelity sequencing platform comprising a pacbi or Oxford Nanopore system, with a sequencing depth of >300x, preferably >500x, which results in sequences read long of thousands to tens of thousands of bases.
5. The sequencing method of claim 1, wherein in step (e), the bioinformatic analysis comprises grouping and multi-sequence alignment of reads based on UMI sequences to generate molecular consensus sequences, followed by detection of structural, SNV, indel and chimeric variants, preferably using statistical models to calculate confidence of the variants, supporting detection of chimeric variants with allele frequencies as low as 0.5%; Preferably the method further comprises a false positive filtering step, scoring and validating the candidate variation using a machine learning tool; Preferably, the target locus comprises a gene associated with a genetic disorder, such as TSC1/TSC2 or an analog thereof, and the method is useful for diagnosing a genetic disorder, including but not limited to a rare disorder, a tumor-associated genetic variation, or a chimeric disorder.
6. A kit comprising a high molecular weight DNA extraction reagent, a linker comprising a UMI sequence, a capture probe for a target locus, streptavidin magnetic beads, and a sequencing reagent; Preferably the kit further comprises spectrophotometric or electrophoresis reagents for quality control, and bioinformatics analysis software or scripts.
7. A system for targeted long-read long sequencing incorporating unique molecular identifiers, the system comprising: A DNA extraction module for extracting high molecular weight DNA from the biological sample; A library construction module for labeling the DNA with UMI linkers; An enrichment module for targeted enrichment of the UMI-labeled DNA library; the sequencing module is used for long-reading long sequencing; an analysis module for UMI-based bioinformatics processing and mutation detection; preferably, the analysis module comprises a processor and a storage medium storing instructions for performing UMI-based error correction, mutation detection, and statistical model calculation; The system preferably further comprises a verification module for verifying candidate variants by ddPCR or targeted ONT sequencing.
8. A computer-implemented bioinformatic analysis method for processing targeted long-read long-sequencing data, the method comprising: (a) Grouping the sequencing reads based on the UMI sequence; (b) Performing multi-sequence comparison on each UMI group to generate a molecular consistency sequence; (c) Aligning the consensus sequence to a reference genome; (d) Detecting structural, SNV, indel and chimeric variations; (e) Evaluating the variation confidence by using a statistical model, and supporting the detection of low-frequency chimeric variation; Preferably, the structural variation detection uses a combination of algorithms, such as Sniffles, pbsv or cuteSV, the chimeric variation detection is based on UMI supported read counts and uses binomial or Beta-binomial distributions to construct a background error rate model to calculate the confidence of low frequency variation; preferably the method further comprises a machine learning filtering step to score the candidate variation.
9. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the bioinformatics analysis method of claim 8.
10. Use of the sequencing method of any one of claims 1 to 5, the kit of claim 6 for the preparation of a reagent or device for diagnosis of a genetic disorder; preferably the genetic disorder includes, but is not limited to, tuberous sclerosis, cancer-related genetic variation or chimeric-related disease, and the use enables detection of complex structural variations and low frequency chimeric variations.

Description

Genetic variation and chimera detection method, system and kit Technical Field The invention relates to the technical field of gene detection and molecular diagnosis, in particular to a high-sensitivity and high-precision method for detecting complex structural variation, repeated sequence abnormality and low-frequency chimeric variation in human genome, which is particularly suitable for diagnosis of difficult cases of genetic diseases such as Tuberous Sclerosis (TSC) and the like. Background Currently, genetic diagnosis of genetic diseases relies mainly on new generation sequencing (Next-Generation Sequencing, NGS) technology with short read length sequencing as a core, and multiple ligation dependent probe amplification (MLPA) technology for detecting large fragment copy number variations (Copy Number Variations, CNVs). Short read long NGS Panel sequencing is the most widely used technique in clinical practice. The technology is excellent in detecting Single Nucleotide Variation (SNV) and small insertion/deletion (Indel) by designing probes to capture or multiplex PCR to amplify exons and adjacent splice regions of a target gene and then performing high throughput sequencing (typically read length 150-300 bp). MLPA technology as a complement to NGS, by ligation and amplification of specific probes, large fragment deletions or duplications (i.e., CNVs) at the gene exon level can be detected efficiently. To detect low proportions of chimeric variants, the prior art uses high depth short read long sequencing methods (e.g., >1000x depth) in combination with unique molecular identifier (Unique Molecular Identifier, UMI) techniques for error correction. The method improves the detection capability of low-frequency SNV and Indel to a certain extent. Long Read Sequencing (LRS) techniques, represented by PacBIO and Oxford Nanopore, can produce Read lengths of thousands to tens of thousands of bases, and can theoretically address many of the limitations of short Read length Sequencing, particularly in terms of structural variation detection. There are reports of the use of LRS for whole genome sequencing to diagnose rare diseases. Although the above techniques have achieved great clinical success, there are still significant technical bottlenecks in facing complex genetic etiologies, resulting in about 10-15% of clinically definite patients not being able to obtain definitive genetic diagnosis (known as "no mutation found", NMI). The limitations of the existing short-read long NGS and MLPA technologies are mainly that complex Structural Variations (SVs) cannot be detected effectively, short-read long cannot span the break points of SVs such as inversion, translocation, complex rearrangement of large fragments, and the like, can only be inferred by relying on indirect signals (such as depth change of reads, segmentation ratio and the like), so that false positive and false negative rates are high, the accurate structures of complex SVs cannot be resolved, MLPA cannot detect SVs with copy-neutral numbers (copy-neutral) such as inversion and balanced translocation, and in addition, for highly repeated sequence regions (such as tandem repeat amplification and pseudogene regions) in a genome, short-read long cannot be compared only to form sequencing "black holes", so that variations of the regions are missed. While high depth short read long sequencing combined with UMI improves detection of low frequency SNV/Indel, the nature of the short read length is unchanged and therefore the above diagnostic difficulties caused by complex SV and repetitive sequences remain unsolved. The existing long-reading long-sequencing application has the defects that the whole genome long-reading long-sequencing (WGS-LRS) is high in cost and is not suitable for large-scale queue screening or clinical routine detection, the LRS is applied to targeted sequencing, so that the technical problem of efficient enrichment of high-quality long DNA fragments is faced, the traditional hybrid capture method mainly comprises short-fragment DNA design, the efficiency is low, in addition, the LRS has a certain error rate, and the sensitivity and the specificity for directly detecting low-frequency chimeric variation are limited under the condition of insufficient error correction. Therefore, there is still a significant disadvantage in the prior art, and there is a need to develop a novel gene detection technique or method capable of overcoming the above-mentioned blind region while being suitable for clinical routine diagnosis. Disclosure of Invention Aiming at the defects and shortcomings of the prior art, the purpose of the scheme is to provide a novel gene detection method with high sensitivity, high accuracy, high efficiency and controllable cost. The method aims at realizing the effects of accurately analyzing complex genetic variation and identifying chimeric variation with ultrahigh sensitivity in single detection simultaneously by creatively in