CN-121983128-A - Multi-stage gene data accurate matching method and system

CN121983128ACN 121983128 ACN121983128 ACN 121983128ACN-121983128-A

Abstract

The application relates to a method and a system for precisely matching multi-stage gene data, which comprises the steps of selecting continuous K bases from a read sequence to be processed as search strings according to a preset initial length K, determining a matching interval of the search strings in a reference genome from a hash table which is built according to the length K in advance through a Kmer-Index algorithm, expanding and matching the search strings base by base through an FM-Index algorithm based on a preset FM Index in the matching interval until the number of matching positions with the reference genome is reduced to a preset matching frequency threshold value, expanding each search string obtained through base by base expansion and matching according to a single base comparison mode until mismatch occurs, and taking the search string with the largest matching base number as a finally determined seed sequence. The accurate matching scheme of the gene data can improve the seed searching performance and the memory use efficiency, and the accurate matching position of the seeds can be quickly searched under the condition of limited hardware resources.

Inventors

TAN GUANGMING
ZHANG ZHONGHAI

Assignees

中国科学院计算技术研究所

Dates

Publication Date: 20260505
Application Date: 20251231

Claims (10)

1. The multi-stage gene data accurate matching method is characterized by comprising the following steps of: according to the preset initial length K, selecting continuous K bases from a read sequence to be processed as a search string, and determining a matching interval of the search string in a reference genome from a hash table constructed according to the length K in advance through Kmer-Index algorithm; In the matching interval, performing base-by-base expansion and matching on the search string based on a pre-constructed FM Index through an FM-Index algorithm until the number of matching positions with a reference genome is reduced to a preset matching frequency threshold; For each search string obtained through base-by-base expansion and matching, expansion is carried out according to a single base comparison mode until mismatch occurs, and the search string with the largest number of matched bases is used as a finally determined seed sequence.
2. The method of claim 1, wherein the length K ranges between [10,20 ].
3. The method of claim 2, wherein the length K is 14, 15, 16 or 17.
4. The method of claim 1, wherein the matching interval is a first-to-last-occurrence position of the search string in the reference genome.
5. The method of claim 1, wherein the number of matches threshold N is set to a natural number less than 9.
6. The method of claim 5, wherein the number of matches threshold is set to 3 or 2.
7. The method of claim 1, wherein the hash table is comprised of key and value intervals, wherein: each linkage corresponds to one of all base sequences of length K in the reference genome; The value interval corresponding to each key is the first and last occurrence of that key in the reference genome.
8. The method of any of claims 1-7, further comprising pre-constructing an FM Index structure for an FM-Index algorithm based on a reference genome.
9. A multi-stage genetic data exact match system comprising: Kmer-Index processing module configured to select continuous K bases from a to-be-processed read sequence as a search string according to a preset initial length K, and determining a matching interval of the search string in a reference genome from a hash table constructed according to the length K in advance through Kmer-Index algorithm; The FM-Index processing module is configured to expand and match the search string base by base based on a pre-constructed FM Index through an FM-Index algorithm in the matching interval until the number of matching positions with a reference genome is reduced to a preset matching frequency threshold; The Direct-Index processing module is configured to expand each search string from the FM-Index processing module in a single base comparison mode until mismatch occurs, and the search string with the largest number of matched bases is used as a finally determined seed sequence.
10. A computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of claims 1 to 8.

Description

Multi-stage gene data accurate matching method and system Technical Field The present application relates to biological sequencing and genomic data analysis, and in particular to methods and systems for performing sequence alignment. Background With the continuous improvement of throughput of sequencing devices and the increase of accurate medical demands of people, how to rapidly and efficiently process genome data is an important point of research in the field of bioinformatics. The gene sequence alignment is a very critical step in the genome data analysis flow, which refers to the process of aligning the read sequence output by the sequencer to the reference genome. Currently mainstream genetic sequence alignment techniques generally follow a "seed-spread" pattern. The method comprises the steps of firstly, in a seed searching stage, rapidly finding short fragments (called seeds) which are highly similar and completely matched between a read sequence to be compared and a reference genome through an accurate matching technology, then, in an expansion stage, taking the positions of the seeds as cores, extending and expanding the seeds to two sides base by base, further matching (mismatch, insertion/deletion are allowed) through a non-accurate matching technology, calculating a matching score, and finally, taking the matching with the largest score as an output result. In this process, the seed lookup phase (i.e., the exact match phase) occupies more than half of the runtime. In the precise matching stage, a short segment with a fixed length is firstly cut from a read sequence to serve as a seed, and then a pre-constructed index structure (such as a hash table, a suffix array, an FM index and the like) is utilized to rapidly locate the precise matching position of the seed in a reference genome. The current mainstream exact matching technology is to improve the seed searching performance through Kmer-Index algorithm or FM-Index algorithm. The Kmer-Index algorithm is based on a hash table to perform a seed search, using as keys all substrings (k-mers) of fixed length k in the reference genome, whose list of positions in the genome is a value, to quickly locate the matching position of the seed. However, this method occupies a huge memory, and requires very high hardware resources, and usually requires a dedicated hardware accelerator. The FM-Index algorithm is based on compressed full text Index of Burows-Wheeler transformation (BWT) for seed search, which converts the character string matching problem into interval search of BWT matrix by utilizing BWT characteristics, and has small memory occupation, so that full genome comparison on a common server and even a personal computer is possible. However, due to the characteristic of irregular memory access, each character iteration involves multiple memory accesses in the search process, so that the search speed is limited by memory delay and bandwidth, and the search delay is high. Disclosure of Invention In this regard, the present application aims to provide an exact matching technique that can improve seed search performance and memory use efficiency, so as to quickly find the exact matching position of the seed under the condition of limited hardware resources. The object of the application is achieved by the following scheme: According to a first aspect of embodiments of the present application, a method for precisely matching multi-stage genetic data is provided. The method comprises the following steps of selecting continuous K bases from a read sequence to be processed according to a preset initial length K as search strings, determining a matching interval of the search strings in a reference genome from a hash table which is built according to the length K in advance through a Kmer-Index algorithm, expanding and matching the search strings base by base through an FM-Index algorithm based on a preset FM Index in the matching interval until the number of matching positions with the reference genome is reduced to a preset matching frequency threshold value, and expanding each search string obtained through base by base expansion and matching according to a single base comparison mode until mismatch occurs and taking the search string with the largest matching base number as a finally determined seed sequence. In the embodiment, under a short search string by utilizing a Kmer-Index algorithm through a multi-stage matching strategy, only Kmer indexes containing a small number of keys are needed to be constructed, the memory efficiency is extremely high, meanwhile, less memory resources are occupied, the FM-Index is utilized to greatly reduce the matching times in the corresponding matching interval in advance in a matching interval which is usually smaller than a reference genome and is determined by utilizing a Kmer-Index algorithm, the problem of low cache hit caused by irregular access is avoided, and the Direct-Index algorithm is utilized to rapidly