US-12626781-B2 - DNA alignment using a hierarchical inverted index table
Abstract
System and method for constructing a hierarchical index table usable for matching a search sequence to reference data. The index table may be constructed to contain entries associated with an exhaustive list of all subsequences of a given length, wherein each entry contains the number and locations of matches of each subsequence in the reference data. The hierarchical index table may be constructed in an iterative manner, wherein entries for each lengthened subsequence are selectively and iteratively constructed based on the number of matches being greater than each of a set of respective thresholds. The hierarchical index table may be used to search for matches between a search sequence and reference data, and to perform misfit identification and characterization upon each respective candidate match.
Inventors
- Michael B. Doerr
- Jan D. Garmany
- Stephen V. Wood
- Daemon G. Anastas
- Martin A. Hunt
Assignees
- HyperX Holdings LLC
Dates
- Publication Date
- 20260512
- Application Date
- 20240801
Claims (20)
- 1 . A multiprocessor array (MPA) comprising interspersed processors and communication elements coupled to at least one non-transitory computer-readable memory medium, wherein the at least one non-transitory computer-readable memory medium comprises program instructions executable by the MPA to: a) store a hierarchical index table in at least one of the non-transitory computer-readable memory media, wherein the hierarchical index table is based on reference data, wherein the hierarchical index table comprises a plurality of entries at each of a plurality of levels, wherein the plurality of levels comprises a first level, wherein the hierarchical index table comprises a respective first level entry for each of a plurality of subsequences of the reference data having a first length, wherein for n={1, . . . , i}, the hierarchical index table further comprises n+1 level entries for respective level n entries in the hierarchical index table when matching criteria of the respective level n entry to the reference data exceeds a threshold, wherein the hierarchical index table comprises an n+1 level entry for each of a plurality of subsequences of the reference data having a length corresponding to the n+1 level; b) receiving input specifying a plurality of search sequences; and c) searching, in parallel by respective processors of the MPA, the hierarchical index table for one or more matches of a respective search sequence of the plurality of search sequences to the reference data, wherein said searching the hierarchical index table comprises iteratively searching for matches of subsections of the respective search sequence to subsequences in subsequent levels of the hierarchical index table.
- 2 . The MPA of claim 1 , wherein said plurality of subsequences of the reference data having the first length and said plurality of subsequences of the reference data having the length corresponding to the n+1 level each comprise exhaustive sets of subsequences of their respective lengths.
- 3 . The MPA of claim 2 , wherein the hierarchical index table further comprises, for each entry in each level, information specifying a number and a location of matches of the subsequence of the respective entry in the reference data.
- 4 . The MPA of claim 3 , wherein said searching for matches of the subsequences of the n+1 level is performed at locations associated with a corresponding entry in the n level.
- 5 . The MPA of claim 3 , wherein the data indicating the number of matches associated with each entry is stored in a first data structure, and the data indicating the location of the matches associated with each entry is stored in a second data structure, wherein the first and second data structures are each comprised within the hierarchical index table.
- 6 . The MPA of claim 1 , wherein the program instructions are further executable to: for each respective n level entry, store a pointer in at least one of the non-transitory computer-readable memory media that references n+1 level entries that correspond to the respective n level entry.
- 7 . The MPA of claim 1 , wherein the reference data comprises a reference genome and searching the reference data comprises aligning a short read (SR) to the reference genome.
- 8 . A non-transitory computer-readable memory medium comprising program instructions, wherein the program instructions are executable by at least one processor to: a) store a hierarchical index table in the non-transitory computer-readable memory media, wherein the hierarchical index table is based on reference data, wherein the hierarchical index table comprises a plurality of entries at each of a plurality of levels, wherein the plurality of levels comprises a first level, wherein the hierarchical index table comprises a respective first level entry for each of a plurality of subsequences of the reference data having a first length, wherein for n={1, . . . , i}, the hierarchical index table further comprises n+1 level entries for respective level n entries in the hierarchical index table when matching criteria of the respective level n entry to the reference data exceeds a threshold, wherein the hierarchical index table comprises an n+1 level entry for each of a plurality of subsequences of the reference data having a length corresponding to the n+1 level; b) receive input specifying a search sequence; and c) search, by the at least one processor, the hierarchical index table for one or more matches of the search sequence to the reference data, wherein said searching the hierarchical index table comprises iteratively searching for matches of subsections of the respective search sequence to subsequences in subsequent levels of the hierarchical index table.
- 9 . The non-transitory computer-readable memory medium of claim 8 , wherein said plurality of subsequences of the reference data having the first length and said plurality of subsequences of the reference data having the length corresponding to the n+1 level each comprise exhaustive sets of subsequences of their respective lengths.
- 10 . The non-transitory computer-readable memory medium of claim 9 , wherein the hierarchical index table further comprises, for each entry in each level, information specifying a number and a location of matches of the subsequence of the respective entry in the reference data.
- 11 . The non-transitory computer-readable memory medium of claim 10 , wherein said searching for matches of the subsequences of the n+1 level is performed at locations associated with a corresponding entry in the n level.
- 12 . The non-transitory computer-readable memory medium of claim 10 , wherein the data indicating the number of matches associated with each entry is stored in a first data structure, and the data indicating the location of the matches associated with each entry is stored in a second data structure, wherein the first and second data structures are each comprised within the hierarchical index table.
- 13 . The non-transitory computer-readable memory medium of claim 8 , wherein the program instructions are further executable to: for each respective n level entry, store a pointer in the non-transitory computer-readable memory medium that references n+1 level entries that correspond to the respective n level entry.
- 14 . The non-transitory computer-readable memory medium of claim 8 , wherein the reference data comprises a reference genome and searching the reference data comprises aligning a short read (SR) to the reference genome.
- 15 . The non-transitory computer-readable memory medium of claim 8 , wherein the program instructions are further executable to store one or more matches of the search sequence to the reference data in the non-transitory computer-readable memory medium.
- 16 . A method, comprising: a) storing a hierarchical index table in a non-transitory computer-readable memory media, wherein the hierarchical index table is based on reference data, wherein the hierarchical index table comprises a plurality of entries at each of a plurality of levels, wherein the plurality of levels comprises a first level, wherein the hierarchical index table comprises a respective first level entry for each of a plurality of subsequences of the reference data having a first length, wherein for n={1, . . . , i}, the hierarchical index table further comprises n+1 level entries for respective level n entries in the hierarchical index table when matching criteria of the respective level n entry to the reference data exceeds a threshold, wherein the hierarchical index table comprises an n+1 level entry for each of a plurality of subsequences of the reference data having a length corresponding to the n+1 level; b) receiving input specifying a search sequence; and c) searching the hierarchical index table for one or more matches of the search sequence to the reference data, wherein said searching the hierarchical index table comprises iteratively searching for matches of subsections of the respective search sequence to subsequences in subsequent levels of the hierarchical index table.
- 17 . The method of claim 16 , wherein said plurality of subsequences of the reference data having the first length and said plurality of subsequences of the reference data having the length corresponding to the n+1 level each comprise exhaustive sets of subsequences of their respective lengths, wherein the hierarchical index table further comprises, for each entry in each level, information specifying a number and a location of matches of the subsequence of the respective entry in the reference data.
- 18 . The method of claim 16 , further comprising: for each respective n level entry, storing a pointer in memory that references n+1 level entries that correspond to the respective n level entry.
- 19 . The method of claim 16 , wherein the reference data comprises a reference genome and searching the reference data comprises aligning a short read (SR) to the reference genome.
- 20 . The method of claim 16 , further comprising: storing one or more matches of the search sequence to the reference data in the non-transitory computer-readable memory medium.
Description
PRIORITY CLAIM This application is a continuation of U.S. patent application Ser. No. 18/114,065, titled “DNA Alignment using a Hierarchical Inverted Index Table”, filed Feb. 24, 2023, which is a continuation of U.S. patent application Ser. No. 15/331,239, titled “DNA Alignment using a Hierarchical Inverted Index Table”, filed Oct. 21, 2016, now issued as U.S. Pat. No. 11,594,301, which claims benefit of priority to U.S. Application No. 62/244,541 titled “DNA Alignment”, filed on Oct. 21, 2015, which are hereby incorporated by reference in their entirety as though fully and completely set forth herein. This application includes herewith a sequence listing titled “Sequence listing 5860-04513”, created on Mar. 6, 2024, and is 57,344 bytes, all of which is hereby incorporated by reference in its entirety. No new matter has been added. The claims in the instant application are different than those of the parent application or other related applications. The Applicant therefore rescinds any disclaimer of claim scope made in the parent application or any predecessor application in relation to the instant application. The Examiner is therefore advised that any such previous disclaimer and the cited references that it was made to avoid, may need to be revisited. Further, any disclaimer made in the instant application should not be read into or against the parent application or other related applications. FIELD OF THE INVENTION The application generally relates to mapping data patterns onto a reference data set, and more particularly to performing such data alignment or pattern matching in DNA sequencing and DNA alignment applications. DESCRIPTION OF THE RELATED ART Modern technology involves the collection and processing of increasingly large amounts of data. Applications and use cases of so called “Big Data” range from data mining, advertising, machine learning, and DNA sequencing, among many others. In many cases it is necessary to search for matches between a small sample of data and a much larger reference data set. As the size of the reference data set increases, the alignment of sample data to this reference data set (pattern matching) becomes an exponentially more computationally intensive task. An exemplary case of data alignment occurs in the field of DNA alignment. Living organisms are composed of cells and the operation and reproduction of cells is controlled by genetic information that is passed from one generation of cells to the next. Detailed knowledge of the genetic information of species and individual organisms holds great promise for more accurate life sciences supporting improved health care, agriculture, environmental management, and crime solving. One of the obstacles to realizing these benefits is the cost to sequence the genetic information of an organism. The technology to do this has improved dramatically over the last couple of decades so that the goal of reducing the cost to less than US $1000 per individual human appears to be achievable. However, there remain problems of completeness, accuracy, interpretation of the data, and reliable diagnosis of disease. The number of days to acquire genetic information from a bio-sample is also an obstacle to uses that require fast response, such as the suitability of a pharmaceutical that is known to have severe side effects for susceptible individuals for use by an emergency room patient. Accordingly, improved techniques and tools for data alignment, and DNA sequencing in particular, are desired. SUMMARY OF THE INVENTION Various embodiments are disclosed of a system and method for mapping a data pattern onto a significantly larger data set. In some embodiments, the larger data set may be a reference data set. In some embodiments, the larger data set may be the result of de novo sequencing, wherein a plurality of data patterns are used to construct a large data set that is self-consistent with the plurality of data patterns. Many of the embodiments presented herein relate to the specific use case of DNA alignment wherein the reference data set is a reference genome and the data pattern is a string of DNA bases derived from a short read (SR) of a DNA strand. However, the methods detailed herein are generally applicable to the problem of mapping any data pattern onto a larger data set. The description of the methods described herein in reference to DNA alignment is intended to facilitate exposition, and is not meant to be limiting in any way to the scope of the invention. Someone of skill in the art would easily see how the methods described herein may be applied to data alignment or pattern matching processes other than DNA alignment. In one embodiment, a hierarchical index table based on reference data may be generated. The hierarchical index table may comprise the locations in the reference data where each of a plurality of data segments are located. In computer science, an index table of this form may be called an inverted index table. The hierar