CN-122024841-A - Method, system, equipment and medium for assembling three-generation sequencing data based on clustering and graph construction

CN122024841ACN 122024841 ACN122024841 ACN 122024841ACN-122024841-A

Abstract

The application discloses a method, a system, equipment and a medium for assembling three-generation sequencing data based on clustering and graph construction, and belongs to the technical field of biological sequence processing. The method comprises the steps of obtaining sequence similarity based on the three-generation sequencing data, clustering sequences by using a clustering algorithm to obtain different clusters, assembling sequences in the same cluster by using a method of constructing a graph to obtain intra-group consensus sequences, combining all intra-group consensus sequences from different clusters, and assembling by using a method of constructing a graph to obtain inter-group consensus sequences. The technical scheme of the application effectively reduces the complexity of assembly through the strategy of sequence clustering, intra-group assembly and inter-group assembly, ensures the accuracy and the integrity of consensus sequences through an optimal overlap graph algorithm and depth pruning, and has remarkable technical advantages.

Inventors

ZHU JIANLONG
YU BIN
ZHANG LE
YU HENGYI

Assignees

欣基(杭州)生物科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. A method for assembling three-generation sequencing data based on clustering and graph construction, which is characterized by comprising the following steps: sequence clustering, namely acquiring sequence similarity based on the third-generation sequencing data, and clustering the sequences by using a clustering algorithm to acquire different clusters; assembling the sequences in the same cluster based on a graph construction method to obtain an intra-group consensus sequence; And (3) assembling among groups, namely combining all the intra-group consensus sequences from different clusters, and assembling the intra-group consensus sequences based on a graph construction method to obtain the inter-group consensus sequences.
2. The method of claim 1, further comprising, prior to the sequence clustering step, the steps of: Sequence filtering, namely filtering out sequences with the length larger than a first preset threshold value and the length smaller than a second preset threshold value, Wherein the first preset threshold and the second preset threshold are determined based on a range of full sequence lengths in the third generation sequencing data.
3. The method of claim 2, wherein the sequence filtering step further comprises filtering out sequences having a length less than a third predetermined threshold.
4. The method of claim 1, wherein in the sequence clustering step, The sequence similarity is calculated based on Strobemer by generating Randstrobes feature vectors for each sequence, constructing an index of Strobemer features, searching candidate sequence pairs sharing Strobemer, calculating bidirectional similarity only for the screened candidate pairs, The clustering is performed in multiple stages, the similarity threshold is gradually lower in different stages, and core points and boundary points are identified for each cluster in each stage.
5. The method according to any one of claims 1-4, further comprising the step of trimming and polishing the sequence after the intra-and/or inter-group assembly steps: Sequence comparison and depth calculation, namely comparing a sequence before clustering with an intra-group consensus sequence or an inter-group consensus sequence by taking the sequence before clustering as a reference sequence, and calculating the depth of each base position; sequence trimming, namely trimming off the region with the depth lower than a fourth preset threshold value; sequence sharpening, the correction of bases at each position based on alignment quality score, depth and/or number of bases.
6. The method of claim 5, wherein the steps of trimming and polishing comprise multiple rounds, the sequence modified and polished from the previous round being used as a reference sequence for the next round of alignment.
7. A system for assembling third generation sequencing data based on clustering and graph construction, comprising the following modules: the data input module is used for obtaining the three-generation sequencing data; the sequence clustering module is connected with the data input module and is used for obtaining sequence similarity based on the third-generation sequencing data, and clustering the sequences by using a clustering algorithm to obtain different clusters; the intra-group assembly module is connected with the sequence clustering module and is used for assembling sequences in the same cluster based on a method of constructing a graph to obtain an intra-group consensus sequence; And the inter-group assembly module is connected with the intra-group assembly module and is used for combining all intra-group consensus sequences from different clusters and assembling the intra-group consensus sequences based on a graph construction method to obtain the inter-group consensus sequences.
8. The system of claim 7, further comprising a sequence trimming and polishing module coupled to the data input module and further coupled to the intra-group assembly module and/or the inter-group assembly module for performing the steps of: Sequence comparison and depth calculation, namely comparing a sequence before clustering with an intra-group consensus sequence or an inter-group consensus sequence by taking the sequence before clustering as a reference sequence, and calculating the depth of each base position; sequence trimming, namely trimming the region with the depth lower than a fourth preset threshold value; sequence sharpening, the correction of bases at each position based on alignment quality score, depth and/or number of bases.
9. An electronic device, comprising: at least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-6.

Description

Method, system, equipment and medium for assembling three-generation sequencing data based on clustering and graph construction Technical Field The application relates to the technical field of biological sequence processing, in particular to a method, a system, equipment and a medium for assembling three-generation sequencing data based on clustering and graph construction. Background Amplicon sequencing is a key technique in molecular biology research that specifically amplifies a region of a target gene by PCR (polymerase chain reaction), allowing sequencing analysis of the region. It is commonly used to reveal the diversity of target genes in a sample, e.g., amplification detection of the 16S/18S/ITS rRNA gene can reveal the environmental microbial community structure. The sequencing of specific genes in human samples (e.g., cancer tissues) can also be performed in clinical studies to reveal genetic variations associated with disease. With the maturation of third generation sequencing technologies, represented by pacbi HiFi and Oxford Nanopore, one skilled in the art can obtain thousands to tens of thousands of complete amplicon reads, aiming at reconstructing from these original sequencing reads the high-fidelity consensus sequence of each gene variant present in the sample. However, the current field of telecommunications lacks high fidelity assembly tools specific to such sequencing data (i.e., long read length, high depth, high similarity, great length variation). Those skilled in the art are generally able to adopt two existing technical routes which are not designed for this purpose: The first technical route is to use tools designed for whole genome slave head assembly, such as Canu or Flye. The core algorithms of these tools (such as overlap-layout-consensus or a-Bruijn plots) are optimized for processing random, low depth genome data with the design goal of solving the problem of repetitive sequences in the genome and extending contigs. However, these algorithms tend to falsely collapse similar variants (e.g., different alleles) into a single or chimeric consensus sequence when faced with amplicon (high depth, highly similar) data. Furthermore, the filtering mechanisms of these tools are highly likely to falsely exclude valid variants in cancer that are too long (carrying large insertions) or too short (carrying large deletions), resulting in loss of critical variant information. The second technical route is to use conventional multiple sequence alignment tools (e.g., MUSCLE or MAFFT). This method globally aligns all reads and then extracts consensus sequences therefrom. The method can obtain good effect under the conditions of pure sample and less total number of reading segments, but the calculation complexity of the algorithm can increase exponentially or higher-order polynomials with the increase of the number of sequences, so that the high-throughput reading segments generated by the third-generation sequencing can take extremely long time to process. Meanwhile, when facing sequences of greatly varying length, the multiple sequence alignment tool introduces a large number of artificial gaps, resulting in failed alignments or in meaningless consensus sequences. In the above-mentioned prior art route, since the tool design dislocation and algorithm do not have scalability, when the problem of assembling large-scale third-generation amplicon data is handled, the existing method has the defects of low assembly efficiency, high chimera rate, failure to construct an accurate consensus sequence, and the like, so that a high-efficiency assembling method specially designed for the scene is needed. Disclosure of Invention In order to solve at least one of the above technical problems, the technical scheme adopted by the application is as follows. The first aspect of the application provides a method for assembling three-generation sequencing data based on clustering and graph construction, which comprises the following steps: sequence clustering, namely acquiring sequence similarity based on the third-generation sequencing data, and clustering the sequences by using a clustering algorithm to acquire different clusters; assembling the sequences in the same cluster based on a graph construction method to obtain an intra-group consensus sequence; And (3) assembling among groups, namely combining all the intra-group consensus sequences from different clusters, and assembling the intra-group consensus sequences based on a graph construction method to obtain the inter-group consensus sequences. In some embodiments of the application, prior to the sequence clustering step, further comprising the steps of: Sequence filtering, namely filtering out sequences with the length larger than a first preset threshold value and the length smaller than a second preset threshold value, Wherein the first preset threshold and the second preset threshold are determined based on a range of full sequence lengths in the thi