US-12620455-B2 - Merging duplicate marking to optimize computer operations for gene sequencing pipeline

US12620455B2US 12620455 B2US12620455 B2US 12620455B2US-12620455-B2

Abstract

In accordance with embodiments, a processing unit performs alignment of a short read (SR) against a reference genome sequence. The processing unit determines whether the SR is aligned. If the SR is not aligned, the processing unit receives the next SR and processes the next SR by repeating. If the SR is aligned, in response to the determination that the SR is aligned with the reference genome sequence at a first position in the reference genome sequence, the processing unit generates a new SR metadata entry corresponding to the SR. The processing unit finds a linked list in a SR metadata collection. The first position of the linked list in the SR metadata collection corresponds to the first position of the reference genome sequence where the SR is aligned. The processing unit performs duplicate marking based on the SR and the linked list.

Inventors

Meysam ROODI
Zahra LAK

Assignees

HUAWEI TECHNOLOGIES CO., LTD.

Dates

Publication Date: 20260505
Application Date: 20201208

Claims (12)

1 . A method comprising: performing, by at least one processing unit, alignment of a short read (SR) against a reference genome sequence, wherein the reference genome sequence comprises a first sequence of base pairs (bps), the reference genome sequence is at least a part of a full genome sequence, and the SR comprises a second sequence of bps; determining, by the at least one processing unit, that the SR is aligned with the reference genome sequence at a first position in the reference genome sequence; generating, by the at least one processing unit, a new SR metadata entry corresponding to the SR; identifying, by the at least one processing unit, a linked list in a SR metadata collection, a first position of the linked list in the SR metadata collection corresponding to the first position of the reference genome sequence where the SR is aligned, wherein the SR is in a pair end comprising the SR and a mate SR of the SR, the new SR metadata entry comprises an average quality score (AQS) of quality scores of the bps in the SR, and the new SR metadata entry further comprises a distance indication value indicating a distance between the first position of the linked list in the SR metadata collection and a second position of a second linked list in the SR metadata collection, the second linked list comprising a second SR metadata entry corresponding to the mate SR; performing, by the at least one processing unit, duplicate marking based on the SR and the linked list, wherein the performing the duplicate marking comprises: identifying, by the at least one processing unit, a group of SR metadata entries in the linked list having a same distance indication value as the new SR metadata entry, identifying, by the at least one processing unit, a head SR metadata entry in the group, wherein an AQS of the head SR metadata entry is the highest in the group, and marking, by the at least one processing unit, one of the new SR metadata entry or the head SR metadata entry as duplicate based on a comparison between the AQS of the head SR metadata entry and the AQS of the new SR metadata entry; inserting, by the at least one processing unit, the new SR metadata entry to the linked list such that SR metadata entries in the linked list are grouped based on distance indication values of the SR metadata entries in the linked list, the head SR metadata entry being positioned at a beginning of the group in the linked list, the inserting comprising: inserting the new SR metadata entry at the beginning of the group as a new head SR metadata entry of the group if the AQS of the new SR metadata entry is higher than the AQS of the head SR metadata entry, or inserting the new SR metadata entry after the head SR metadata entry if the AQS of the new SR metadata entry is less than the AQS of the head SR metadata entry; and generating, by the at least one processing unit, a binary alignment map (BAM) file based on the SR metadata collection without producing an intermediate sequence alignment map (SAM) file and without producing an intermediate unmarked BAM file.
2 . The method of claim 1 , wherein the marking comprises: if the AQS of the new SR metadata entry is higher than the AQS of the head SR metadata entry, marking, by the at least one processing unit, the head SR metadata entry as duplicate; and if the AQS of the new SR metadata entry is lower than the AQS of the head SR metadata entry, marking, by the at least one processing unit, the new SR metadata entry as duplicate.
3 . The method of claim 1 , wherein the SR metadata collection is an array of linked lists, and a size of the array is based on a size of the first sequence of bps in the reference genome sequence.
4 . The method of claim 1 , wherein the linked list includes a plurality of groups of SR metadata entries grouped based on distance indication values such that corresponding SR metadata entries in the linked list that have a same corresponding distance indication value are grouped together into a same corresponding group of the plurality of groups of SR metadata entries.
5 . The method of claim 4 , wherein only a corresponding current head SR metadata entry in each group of the plurality of groups of SR metadata entries is marked as non-duplicate.
6 . The method of claim 1 , further comprising: performing, by the at least one processing unit, alignment of a second SR against the reference genome sequence concurrently with the performing the duplicate marking based on the SR.
7 . A computing device comprising: at least one processor unit; a non-transitory computer readable storage medium storing programming for execution by the at least one processor unit, the programming including instructions to cause the computing device to perform operations including: performing alignment of a short read (SR) against a reference genome sequence, wherein the reference genome sequence comprises a first sequence of base pairs (bps), the reference genome sequence is at least a part of a full genome sequence, and the SR comprises a second sequence of bps; determining that the SR is aligned with the reference genome sequence at a first position in the reference genome sequence; generating a new SR metadata entry corresponding to the SR; identifying a linked list in a SR metadata collection, a first position of the linked list in the SR metadata collection corresponding to the first position of the reference genome sequence where the SR is aligned, wherein the SR is in a pair end comprising the SR and a mate SR of the SR, the new SR metadata entry comprises an average quality score (AQS) of quality scores of the bps in the SR, and the new SR metadata entry further comprises a distance indication value indicating a distance between the first position of the linked list in the SR metadata collection and a second position of a second linked list in the SR metadata collection, the second linked list comprising a second SR metadata entry corresponding to the mate SR; performing duplicate marking based on the SR and the linked list, wherein the performing the duplicate marking comprises: identifying a group of SR metadata entries in the linked list having a same distance indication value as the new SR metadata entry, identifying a head SR metadata entry in the group, wherein an AQS of the head SR metadata entry is the highest in the group, and marking one of the new SR metadata entry or the head SR metadata entry as duplicate based on a comparison between the AQS of the head SR metadata entry and the AQS of the new SR metadata entry; inserting the new SR metadata entry to the linked list such that SR metadata entries in the linked list are grouped based on distance indication values of the SR metadata entries in the linked list, the head SR metadata entry being positioned at a beginning of the group in the linked list, the inserting comprising: inserting the new SR metadata entry at the beginning of the group as a new head SR metadata entry of the group if the AQS of the new SR metadata entry is higher than the AQS of the head SR metadata entry, or inserting the new SR metadata entry after the head SR metadata entry if the AQS of the new SR metadata entry is less than the AQS of the head SR metadata entry; and generating a binary alignment map (BAM) file based on the SR metadata collection without producing an intermediate sequence alignment map (SAM) file and without producing an intermediate unmarked BAM file.
8 . The computing device of claim 7 , wherein the marking comprises: if the AQS of the new SR metadata entry is higher than the AQS of the head SR metadata entry, marking the head SR metadata entry as duplicate; and if the AQS of the new SR metadata entry is lower than the AQS of the head SR metadata entry, marking the new SR metadata entry as duplicate.
9 . The computing device of claim 7 , wherein the SR metadata collection is an array of linked lists, and a size of the array is based on a size of the first sequence of bps in the reference genome sequence.
10 . A non-transitory computer-readable medium having instructions stored thereon that, when executed by at least one processor unit, cause the at least one processor unit to perform operations, the operations comprising: performing alignment of a short read (SR) against a reference genome sequence, wherein the reference genome sequence comprises a first sequence of base pairs (bps), the reference genome sequence is at least a part of a full genome sequence, and the SR comprises a second sequence of bps; determining that the SR is aligned with the reference genome sequence at a first position in the reference genome sequence; generating a new SR metadata entry corresponding to the SR; identifying a linked list in a SR metadata collection, a first position of the linked list in the SR metadata collection corresponding to the first position of the reference genome sequence where the SR is aligned, wherein the SR is in a pair end comprising the SR and a mate SR of the SR, the new SR metadata entry comprises an average quality score (AQS) of quality scores of the bps in the SR, and the new SR metadata entry further comprises a distance indication value indicating a distance between the first position of the linked list in the SR metadata collection and a second position of a second linked list in the SR metadata collection, the second linked list comprising a second SR metadata entry corresponding to the mate SR; performing duplicate marking based on the SR and the linked list, wherein the performing the duplicate marking comprises: identifying a group of SR metadata entries in the linked list having a same distance indication value as the new SR metadata entry, identifying a head SR metadata entry in the group, wherein an AQS of the head SR metadata entry is the highest in the group, and marking one of the new SR metadata entry or the head SR metadata entry as duplicate based on a comparison between the AQS of the head SR metadata entry and the AQS of the new SR metadata entry; inserting the new SR metadata entry to the linked list such that SR metadata entries in the linked list are grouped based on distance indication values of the SR metadata entries in the linked list, the head SR metadata entry being positioned at a beginning of the group in the linked list, the inserting comprising: inserting the new SR metadata entry at the beginning of the group as a new head SR metadata entry of the group if the AQS of the new SR metadata entry is higher than the AQS of the head SR metadata entry, or inserting the new SR metadata entry after the head SR metadata entry if the AQS of the new SR metadata entry is less than the AQS of the head SR metadata entry; and generating a binary alignment map (BAM) file based on the SR metadata collection without producing an intermediate sequence alignment map (SAM) file and without producing an intermediate unmarked BAM file.
11 . The non-transitory computer-readable medium of claim 10 , wherein the marking comprises: if the AQS of the new SR metadata entry is higher than the AQS of the head SR metadata entry, marking the head SR metadata entry as duplicate; and if the AQS of the new SR metadata entry is lower than the AQS of the head SR metadata entry, marking the new SR metadata entry as duplicate.
12 . The non-transitory computer-readable medium of claim 10 , wherein the SR metadata collection is an array of linked lists, and a size of the array is based on a size of the first sequence of bps in the reference genome sequence.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This patent application is a continuation of PCT Patent Application No. PCT/CN2020/078903 filed Mar. 12, 2020 and entitled “Merging Duplicate Marking to Optimize Computer Operations for Gene Sequencing Pipeline,” which claims the benefit of U.S. Provisional Patent Application No. 62/818,519 filed Mar. 14, 2019 and entitled “Merging Duplicate Marking to Optimize Computer Operations for Gene Sequencing Pipeline,” the applications of which are hereby incorporated by reference as if reproduced in their entirety. TECHNICAL FIELD The present disclosure relates to efficient string matching, and, in particular embodiments, to systems and methods for improving computer operations of a gene sequencing system. BACKGROUND Matching a large quantity of characters can be technical challenging. For example, gene sequencing involves matching a huge amount of characters in multiple sub-systems of a gene sequencing system. A conventional gene sequencing system could take days on a single “bare metal server” (e.g., a single tenant physical server) to process a single input gene and consume at least hundreds of Gigabytes (GB) of storage space. Thus, techniques for improving computer operations of the gene sequencing system, particularly in terms of performance and storage resource utilization, are desired. SUMMARY In accordance with embodiments, a processing unit receives a count table and an occurrence table for a reference genome sequence generated using a Burrows Wheeler Transform (BWT) algorithm. The reference genome sequence comprises a sequence of base pairs (bps). The processing unit stores a first part of the occurrence table in a first type of memory. The size of the first part of the occurrence table is determined based on a size of the first type of memory and a first number of bps of short reads (SRs) to be processed using the first type of memory. The processing unit receives a short read (SR) of a sample sequence. The SR comprises the first number of bps and a second number of bps. The processing unit performs alignment of the short read (SR) against the reference genome sequence using the count table and the occurrence table. To perform the alignment, the processing unit may perform alignment of the first number of bps in the SR against the sequence of bps of the reference genome sequence using the first part of the occurrence table stored in the first type of memory. In some embodiments, the processing unit may store a second part of the occurrence table in a second type of memory. The memory access time to the accessing the first part of the occurrence table in the first type of memory may be faster than the memory access time to the accessing the second part of the occurrence table stored in the second type of memory. The performing the alignment of the SR may further comprise performing alignment of the second number of bps in the SR against the sequence of base pairs of the reference genome sequence using the second part of the occurrence table stored in the second type of memory. In some embodiments, the size of the first part of the occurrence table may be determined based on s=(∑ i=0n-1⁢4i+1)×4. s is a minimum number of bytes in the first type of memory. n is the first number of bps of short reads (SRs) to be processed using the first type of memory. In some embodiments, the size of the first part of the occurrence table may be determined based on s=(∑ i=0n-1⁢4i+1)×4×2. s is a minimum number of bytes in the first type of memory. n is the first number of bps of short reads (SRs) to be processed using the first type of memory. In some embodiments, the occurrence table may comprise a plurality of rows. The number of the plurality of rows may equal the number of the sequence of bps in the reference genome sequence. A row of the occurrence table may comprise 4 entries of occurrence numbers corresponding to characters A, C, G, and T, respectively. In some embodiments, the processing unit may initialize a top pointer to 0 and initialize a bottom pointer to a size of the reference genome sequence. Each of the top pointer and bottom pointer points to an entry in the occurrence table. The processing unit may determine the first part of the occurrence table based on topnew=count (char)+occurrence (topold, char) and bottomnew=count (char)+occurrence (bottomold, char). Here, topnew is a new value of the top pointer, topold is an old value of the top pointer, bottomnew is a new value of the bottom pointer, bottomold is an old value of the bottom pointer, count (char) is a total count of characters that are smaller than character char, and occurrence (x, y) is an entry in the occurrence table indexed by row x and column y. In some embodiments, the first part of the occurrence table stored in the first type of memory may comprise n levels of entries. Each entry of an i-th level of entries may have a 14i memory access probability during the performing the alignment (1≤i≤n). The n le