CN-115831222-B - Three-generation sequencing-based whole genome structure variation identification method

CN115831222BCN 115831222 BCN115831222 BCN 115831222BCN-115831222-B

Abstract

The embodiment of the invention discloses a three-generation sequencing-based whole genome structure variation identification method, which comprises the steps of comparing data to be sequenced with a preset reference genome, sequencing preset comparison files to construct an index, analyzing the base comparison condition of each sequencing reading segment, carrying out SV identification on the base comparison condition of each sequencing reading segment, identifying whether signals of variant DNA fragments are contained or not, correcting error and overlapped variant DNA fragment signals, outputting corrected sequences, comparing the corrected sequences with the preset reference genome, carrying out SV identification, outputting an identification result, and parting the data to be sequenced according to the identification result. The invention can improve detection accuracy and sensitivity and comparison limitation aiming at the high-error three-generation sequencing sequence, improve the identification accuracy of complex or longer variant DNA fragments, accurately distinguish adjacent variant DNA fragments with close distances and accurately type the variant DNA fragments.

Inventors

HU JIANG
WANG YANG
WANG DEPENG

Assignees

北京希望组生物科技有限公司

Dates

Publication Date: 20260508
Application Date: 20221220

Claims (2)

1. The three-generation sequencing-based genome-wide structural variation identification method is characterized by comprising the following steps of: s11, comparing the data to be sequenced with a preset reference genome, and sequencing preset comparison files to construct an index; S21, analyzing the base ratio of each sequencing read, and carrying out SV identification to identify whether signals of variant DNA fragments are contained or not; s31, correcting error and overlapped variant DNA fragment signals; in S31, the correcting the erroneous and overlapping variant DNA fragment signals includes: S311, calculating the comparison depth of the data to be tested according to a window of 500bp, and if the comparison depth of a certain window is far greater than the average depth, filtering out a variant DNA fragment signal caused by the second-level comparison of the window, and reducing a false variant DNA fragment signal caused by the comparison error; s312, clustering the signals of the overlapped variant DNA fragments; S313, aiming at one cluster, calculating the consistency among signals, if the consistency is higher, taking the median value for output, and deleting the cluster; s314, further merging clusters, iterating any two clusters, and merging the two clusters into one cluster if the sequencing reads contained in the signals of the two clusters are overlapped and the number of the overlapped sequencing reads exceeds two; s315, filtering the clusters; S316, outputting one type of variant DNA fragments for each cluster; in S31, the correcting the erroneous and overlapping variant DNA fragment signals further includes: s317, filtering the clusters repeatedly until no clusters exist; in S312, the clustering the variant DNA fragment signals with overlap includes: s3121, sorting the variant DNA fragment signal list from small to large according to chromosome, start coordinate and stop coordinate; s3122, circularly comparing the current variant DNA fragment signal with the previous variant DNA fragment signal, if the two signals are on the same chromosome and the distance is within 500bp, adding the current signal into the cluster of the previous signal, otherwise, taking the current signal as a new cluster starting point; In S313, for one cluster, the consistency between signals is calculated, if the consistency is higher, the median is taken out, and the cluster is deleted, including: S3131, detecting whether the types of the variant DNA fragment signals are consistent, and if not, skipping the clustering; S3132, sequentially detecting the variance of the start coordinates, the variance of the end coordinates and the variance of the length of the variant DNA fragment signals, and skipping the clustering if any variance is greater than 50; S3133, taking the median of the initial coordinate as the starting point position of the variable DNA fragment, the median of the end coordinate as the end point position of the variable DNA fragment, and the median of the length as the length of the variable DNA fragment, and outputting the variable DNA fragment; S3134, deleting the output variant DNA fragment clusters from the cluster list; In S315, filtering the clusters includes: S3151, calculating the minimum initial position and the maximum end position of each cluster variation DNA fragment signal as a signal interval of the cluster; S3152, calculating the average depth of the cluster signal interval based on the depth information obtained in the S311; S3153, if the number of reads contained in the clustered variant DNA fragment signals is less than 30% of the average depth or less than 3, the cluster is considered to be a variant DNA fragment signal caused by the excessively high error rate or the comparison error, and the cluster is deleted; In S316, for each cluster, outputting a variant DNA fragment of one type, including: S3161, calculating the minimum initial position and the maximum end position of each cluster variation DNA fragment signal as the signal interval of the cluster; s3162, detecting whether a signal interval of the cluster can be crossed by any sequencing read in the cluster; S3163, if the coverage exists, taking the sequencing reads as the reference sequence of the cluster, and if the coverage does not exist, carrying out pairwise comparison on the sequencing reads of the cluster, composing and assembling, and outputting one of the longest contig sequences as the reference sequence of the cluster; s3164, comparing all sequencing reads to the clustered reference sequence, filtering out comparison results of signals containing variant DNA fragments, and correcting the reference sequence by using the filtered consistency comparison; s3165, outputting the corrected sequence; in S316, for each cluster, outputting a variant DNA fragment of one type, and further including: s3166, re-comparing all sequencing reads in the cluster to the output variant DNA fragments, filtering out sequencing reads without variant DNA fragment signals, and deleting signals contained in the filtered sequencing reads from the cluster; s3167, recording a sequencing read without a signal of the variant DNA fragment, and taking the sequencing read as a sequencing read supporting the variant DNA fragment contained in the sequence after the correction; s41, outputting the corrected sequence, comparing the corrected sequence with a preset reference genome, performing SV identification, and outputting an identification result; s51, parting the data to be sequenced according to the identification result; In S51, the typing the data to be sequenced according to the identification result includes: s511, if two variant DNA fragments come from the same cluster and the supported sequencing reads do not overlap, setting the two variant DNA fragments as heterozygous variant DNA fragments of different types; S512, if the condition in S511 does not occur, detecting the ratio of the number of sequencing reads supporting the variant DNA fragment to the total number of sequencing reads crossing the signal of the variant DNA fragment, and if the ratio is more than 80%, setting the ratio as homozygous, otherwise setting the ratio as heterozygous.
2. The three-generation sequencing-based whole genome structure variation identification method according to claim 1, wherein in S21, the analyzing the base alignment of each sequencing read to identify whether the signal contains a variant DNA fragment comprises: s211, if the sequencing read contains 30bp or more inserted or deleted fragments, storing the sequencing read and the interval position of the inserted or deleted fragments in a list; s212, if the sequencing read contains 100bp or more of comparison cut fragments, storing the interval positions of the sequencing read and the cut fragments in a list; If the base alignment includes more than one different portion alignment to different positions, the sequencing read and the plurality of alignment pairs are stored in a list S213.

Description

Three-generation sequencing-based whole genome structure variation identification method Technical Field The invention relates to the technical field of genome variation identification, in particular to a three-generation sequencing-based whole genome structure variation identification method. Background Structural variation (structural variation, SV) generally refers to variations in DNA fragments of more than 50bp in length. Depending on the type of mutation, deletions (delections), duplicates (Duplication), insertions (insertions), inversions (inversions), translocations (Translocation), and the like may be used. Structural variations affect the transcription and translation of genes in a variety of ways, thereby causing various genetic diseases. When the coding region of the gene is structurally mutated, the transcription and translation of the gene are changed, and when the non-coding region is structurally mutated, the regulation effect of the gene expression regulation element is affected by the position effect. With the rapid development of sequencing technology and the continuous decrease in sequencing costs, more and more structural variations are found to be associated with human genetic diseases, even cancer. Such as Down syndrome caused by chromosome 3, cat beggar syndrome caused by deletion of the short arm of chromosome 5, learning disorder caused by deletion variation at 17q21.31, etc. Current SV detection methods based on second generation short Read length sequencing can be classified into Read-pair Method (Read-pair Method), read-depth Method (Read-depth Method), split-fragment Method (Split-Read Method), and sequence assembly Method (Sequence assembly Method), or a combination of these methods. However, since the second generation data has a short read length, many structural variations have a length far exceeding the read length, and thus all structural variations cannot be effectively detected, and there are cases where false positives are high in the result. Three-generation long-reading long sequencing (mainly comprising PacBIO single-molecule real time sequencing and Oxford Nanopore Sequencing) which is rapidly developed in the last two years provides a possibility for improving the detection rate and accuracy of structural variation. The software currently used to identify third generation sequencing is pbSV, cuteSV and Sniffles. Typically, the sequencing data is aligned to the genome, signals containing the variant DNA fragments (SVs) are identified, then signals of all variant DNA fragments (SVs) are clustered, and finally, for each cluster, the variant DNA fragment (SV) signals are combined, averaged and output. The operation method is greatly influenced by sequencing errors and comparison errors, particularly the boundary identification of the variant DNA fragments (SV) is inaccurate, complex variant DNA fragments (SV) cannot be identified, variant DNA fragments (SV) of the same type but different lengths cannot be typed, adjacent variant DNA fragments (SV) with close distances cannot be accurately distinguished, and variant DNA fragments (SV) exceeding the length of sequencing data cannot be accurately identified. Accordingly, there is a need for further development and advancement in the art. Disclosure of Invention In view of the above, an object of the embodiments of the present invention is to provide a genome-wide variation identification method based on three-generation sequencing, which can improve detection accuracy and sensitivity and comparison of three-generation sequencing sequences with respect to high errors, improve identification accuracy of complex or long variant DNA fragments (SVs) (exceeding the sequencing data length), accurately distinguish adjacent variant DNA fragments (SVs) with close distances, and accurately classify the variant DNA fragments (SVs). Embodiments of the present invention are implemented as follows: a three-generation sequencing-based whole genome structural variation identification method, comprising: S11, comparing the data to be sequenced with a preset reference genome, and sequencing preset comparison files to construct an index. S21, analyzing the base alignment (cigar character string) of each sequencing read (reads) and identifying the SV to identify whether the signal of the variant DNA fragment (SV) is contained. S31, correcting the error and overlapped variant DNA fragment signals. S41, outputting the corrected sequence (polish), comparing with a preset reference genome, performing SV identification, and outputting an identification result. S51, typing the data to be sequenced according to the identification result. In a preferred embodiment of the present invention, in S21 of the above three-generation sequencing-based genome-wide structural variation identification method, the analyzing the base alignment (cigar string) of each sequencing read (reads) to identify whether the signal of the variant DNA fragment (SV) is include