CN-121999863-A - Method and device for identifying structural variation of whole genome based on assembly

CN121999863ACN 121999863 ACN121999863 ACN 121999863ACN-121999863-A

Abstract

The invention relates to an assembled whole genome structure variation identification method and device, the method comprises the steps of obtaining a query genome, comparing the query genome with a reference genome to obtain a comparison file, preprocessing the comparison file, carrying out block and cluster analysis of comparison records according to the preprocessed comparison file to obtain a clustering result, carrying out variation identification on the inside of each cluster and between different clusters respectively, carrying out inversion identification on structural variation regions in the comparison records through an iterative algorithm according to the variation identification result, resetting break points to carry out re-clustering, obtaining variation sets of all haplotypes according to the variation identification result and the structural variation regions after re-clustering, and carrying out variation merging and genotyping on the variation sets of all haplotypes to obtain a final variation result. Compared with the prior art, the method has the advantages of being capable of accurately identifying large-scale structural variation, realizing cross-species variation identification and the like.

Inventors

MAO YAFEI
ZHOU FEIFEI
HAN JUNMIN
ZHANG SHILONG

Assignees

上海交通大学

Dates

Publication Date: 20260508
Application Date: 20251231

Claims (10)

1. The whole genome structure variation identification method based on assembly is characterized by comprising the following steps: acquiring a query genome, comparing the query genome with a reference genome, and acquiring a comparison file; preprocessing the comparison file; according to the preprocessed comparison file, performing block and cluster analysis of the comparison record to obtain a cluster result; respectively carrying out variation recognition inside each cluster and among different clusters; According to the mutation recognition result, performing inverted recognition on the structural variable regions in the record through an iterative algorithm, and resetting the break points to perform reclustering; acquiring variation sets of all haplotypes according to the variation recognition result and the reclustered structure variable regions; and carrying out mutation merging and genotyping on the mutation sets of all haplotypes to obtain a final mutation result.
2. The method for identifying the mutation of the whole genome structure based on the assembly according to claim 1, wherein the combination of the mutation and the genotyping are carried out, specifically comprising: And carrying out pairwise comparison judgment on the variation in the variation sets of all haplotypes: If both variants are single nucleotide variants, judging whether the genomic position and the substituted base are consistent, and if so, judging that the two variants are homozygous single nucleotide variants of the diploid; if the positions of the two mutated genomes are overlapped, calculating the editing distance, the kmer similarity and the mutual overlapping degree of the two mutated sequences, and if the calculated editing distance, the kmer similarity and the mutual overlapping degree do not exceed the corresponding preset thresholds, judging that the two mutated sequences are homozygous and mutated and combined, otherwise, judging that the two mutated sequences are independent heterozygous; if only one haplotype query genome exists in the variation set of the haplotype, the variation set is directly used as a final variation result.
3. The method for identifying the mutation of the whole genome structure based on the assembly according to claim 1, wherein the pretreatment process specifically comprises the following steps: if the comparison record exists in the comparison file, the comparison record is filtered out if the comparison bit of the comparison record is positioned in a preset complex genome area; If there is an alignment in the alignment file that is located on the homologous chromosome, the alignment is maintained and marked as a haplotype source.
4. The method for identifying structural variations of whole genome based on assembly according to claim 1, wherein the comparison file is a comparison PAF file or a file obtained by comparing a query genome with a reference genome using minimap 2.
5. The method for identifying the variation of the whole genome structure based on the assembly according to claim 1, wherein the block and cluster analysis of the comparison record specifically comprises the following steps: Dividing a query genome and a reference genome into a plurality of continuous and non-overlapping windows of fixed length, and clustering the alignment records within each window; In the clustering process, if the comparison direction of the read sections of the query genome is recognized to be opposite to the sequence direction of the reference genome, reversing the comparison coordinates of the read sections.
6. The method of claim 1, wherein identifying variations within each cluster comprises identifying insertions, deletions, inversions, duplications, indels, single nucleotide variations and nested inversions.
7. The method for identifying whole genome structural variation based on assembly according to claim 6, wherein the nested inversion identifying process specifically comprises analyzing the cleavage and reconnection patterns of clustered gene sequences, and determining that a nested inversion is identified if the cleavage and reconnection patterns with the nested cleavage and reconnection patterns in opposite directions are detected within an inversion region.
8. The method of claim 1, wherein identifying variations between clusters comprises identifying inversions, translocations, duplications, structural variable regions and large fragment deletions or duplications.
9. The method for identifying structural variation of an assembled whole genome according to claim 1, wherein the identifying of inversions of structural variation in a comparison record and the resetting of break points for reclustering are performed by an iterative algorithm according to a variation identification result, specifically comprising: traversing each structural variable region in the comparison record, extracting the corresponding sequences of the query genome and the reference genome in the structural variable region, re-comparing, and reclassifying the corresponding structural variable regions according to the comparison result to reclassify the corresponding structural variable regions into new structural variable regions.
10. An assembled whole genome structure variation identification device, comprising a memory and a processor, wherein the memory stores a computer program, and the processor invokes the computer program to execute the steps of the method according to any one of claims 1 to 9.

Description

Method and device for identifying structural variation of whole genome based on assembly Technical Field The invention relates to the technical field of genetic variation recognition, in particular to a method and a device for recognizing structural variation of a whole genome based on assembly. Background Current tools for identifying structural variations (Structural Variations, SVs) based on assembly are mainly PAV, SVIM-asm and SyRI, where PAV uses the alignment information between the reference genome and the query sequence in combination, mining variations from CIGAR strings and aligned fragmentation, enabling identification of single nucleotide variations (Single nucleotide variants, SNV), insertions (Insertions, INS), deletions (Deletions, DEL) and inversions (Inversions, INV), but PAV is deficient in run time and large scale inversions identification, and is unable to identify other variations such as structural hypervariable regions (Structurally Divergent Regions, SDRs), repeats (Duplications, DUP) and translocations (Translocations, trans). SVIM-asm analyzes a given ordered BAM file and detects five different types of variation between the reference and query genomes, DEL, INS, tandem and interspersed DUP and INV, but SVIM-asm is unable to recognize SNV and small indels (InDel) and it performs poorly in trans-species inversion recognition. SyRI can be compared through systematic whole genome, a large-scale genome structure relationship is determined first, and on the basis, various variants are finely identified, but the cross-species variant identification performance is poor. Therefore, there is an urgent need to provide a genome-wide SV identification scheme that combines high accuracy, high sensitivity and high computational efficiency, in view of the shortcomings of existing tools in cross-species (e.g., cynomolgus monkey and human) SV identification. Disclosure of Invention The invention aims to overcome the defects of the prior art and provide an assembled whole genome structure variation identification method and device based on high precision, high sensitivity and high calculation efficiency. The aim of the invention can be achieved by the following technical scheme: an assembly-based whole genome structure variation identification method comprises the following steps: acquiring a query genome, comparing the query genome with a reference genome, and acquiring a comparison file; preprocessing the comparison file; according to the preprocessed comparison file, performing block and cluster analysis of the comparison record to obtain a cluster result; respectively carrying out variation recognition inside each cluster and among different clusters; According to the mutation recognition result, performing inverted recognition on the structural variable regions in the record through an iterative algorithm, and resetting the break points to perform reclustering; acquiring variation sets of all haplotypes according to the variation recognition result and the reclustered structure variable regions; and carrying out mutation merging and genotyping on the mutation sets of all haplotypes to obtain a final mutation result. Further, the combination of mutation and genotyping are carried out, and specifically include: And carrying out pairwise comparison judgment on the variation in the variation sets of all haplotypes: If both variants are single nucleotide variants, judging whether the genomic position and the substituted base are consistent, and if so, judging that the two variants are homozygous single nucleotide variants of the diploid; if the positions of the two mutated genomes are overlapped, calculating the editing distance, the kmer similarity and the mutual overlapping degree of the two mutated sequences, and if the calculated editing distance, the kmer similarity and the mutual overlapping degree do not exceed the corresponding preset thresholds, judging that the two mutated sequences are homozygous and mutated and combined, otherwise, judging that the two mutated sequences are independent heterozygous; if only one haplotype query genome exists in the variation set of the haplotype, the variation set is directly used as a final variation result. Further, the pretreatment process specifically includes: if the comparison record exists in the comparison file, the comparison record is filtered out if the comparison bit of the comparison record is positioned in a preset complex genome area; If there is an alignment in the alignment file that is located on the homologous chromosome, the alignment is maintained and marked as a haplotype source. Further, the comparison file is a comparison PAF file, or a file obtained by comparing the query genome with the reference genome by minimap. Further, the block and cluster analysis of the comparison record specifically includes: Dividing a query genome and a reference genome into a plurality of continuous and non-overlapping windows of fixed length, and cl