US-12626778-B2 - Accelerated hidden Markov models for genotype analysis
Abstract
Disclosed is a configuration for determining a genotyping label composition of a target individual using direct acyclic paths. The configuration includes receiving a phased genotype of the target individual, including a first haplotype and a second haplotype. The configuration initiates a full-ethnicity hidden Markov model (HMM) including nodes with a set of ethnicity labels. The first haplotype is input to determine a first subset of ethnicity labels that match the first haplotype. The second haplotype is input to determine a second subset of ethnicity labels that match the second haplotype. The first and second subsets of ethnicity labels are combined to create a candidate subset of ethnicity labels for the target individual. The configuration initiates a simplified HMM with nodes from the candidate subset of ethnicity labels. The phased genotype of the target individual is input to the simplified HMM to determine genotyping label composition of the target individual.
Inventors
- Keith Daniel Noto
- James Parker Ferry
- Bryan Joseph Johnson
- Alisa Sedghifar
- Yong Wang
- Shiya Song
- Jeffrey Adrion
Assignees
- ANCESTRY.COM DNA, LLC
Dates
- Publication Date
- 20260512
- Application Date
- 20230413
Claims (20)
- 1 . A computer-implemented method comprising: receiving a phased genotype of a target individual, the phased genotype comprising a first haplotype and a second haplotype; initiating a full-ethnicity hidden Markov model (HMM), the full-ethnicity HMM comprising nodes that have a set of ethnicity labels; inputting the first haplotype to the full-ethnicity HMM to determine a first subset of ethnicity labels that match the first haplotype; inputting the second haplotype to the full-ethnicity HMM to determine a second subset of ethnicity labels that match the second haplotype; combining the first and second subsets of ethnicity labels as a candidate subset of ethnicity labels of the target individual; initiating a simplified HMM specific to the target individual, the simplified HMM comprising nodes that are simplified from the set of ethnicity labels to the candidate subset of ethnicity labels of the target individual; inputting the phased genotype of the target individual to the simplified HMM; and determining an ethnicity composition of the target individual using the simplified HMM.
- 2 . The computer-implemented method of claim 1 , wherein the nodes in the simplified HMM represent permutations of different first parent ethnicity labels, second parent ethnicity labels, and switch labels.
- 3 . The computer-implemented method of claim 2 , wherein the switch labels represent a phasing error, the phasing error representative of switching the first and second parent ethnicity labels from one node group to a next node group.
- 4 . The computer-implemented method of claim 1 , wherein the nodes in the full-ethnicity HMM each represent a haplotype ethnicity from the set of ethnicity labels.
- 5 . The computer-implemented method of claim 1 , wherein the phased genotype comprises cross-chromosome haplotypes, and wherein the first haplotype and the second haplotype both include a sequence that has span of genetic loci in a plurality of chromosomes.
- 6 . The computer-implemented method of claim 1 , wherein receiving the phased genotype further comprises: dividing the phased genotype into a plurality of windows, each window comprising a set of single nucleotide polymorphisms (SNPs).
- 7 . The computer-implemented method of claim 6 , wherein determining the ethnicity composition further comprises: determining a path between the nodes in each window of the simplified HMM based on a likelihood of the phased genotype of the target individual traversing nodes along the path; counting a number of a particular label corresponding to a particular ethnicity label in the path; and determining an ethnicity composition of the target individual with respect to the particular ethnicity label based on the number of the particular label counted in the path.
- 8 . The computer-implemented method of claim 7 , wherein determining the likelihood further comprises: determining a label probability, a label switch probability, and a transition probability, the transition probability associated with a particular edge in the path and representing a likelihood of the first node connected by the path from one window transitioning to the second node connected by the path from another window; connecting the nodes with edges, each edge corresponding to a determined transition probability.
- 9 . The computer-implemented method of claim 6 , wherein determining the ethnicity composition further comprises: displaying the likelihood of the target individual having the particular ethnic origin.
- 10 . The computer-implemented method of claim 7 , wherein displaying the likelihood of the target individual having the particular ethnic origin further comprises: determining a minimum label threshold value; filtering ethnicity labels below the determined minimum label threshold value; sorting the filtered ethnicity labels in descending order; determining a delta between the sum of the filtered ethnicity labels and a total number of ethnicity label; and adding a predetermined value to one of the filtered ethnicity labels at a time until the delta is zero.
- 11 . The computer-implemented method of claim 1 , wherein the phased genotype is generated with a global phasing algorithm.
- 12 . The computer-implemented method of claim 1 , wherein the full-ethnicity HMM transition probabilities are determined by reference panels, the reference panel representative of a collection of genotypes from individuals with known ethnicities.
- 13 . The computer-implemented method of claim 12 , wherein one of the reference panels is an admixed panel, the admixed panels including genetic segments inherited from multiple ethnic origins.
- 14 . A non-transitory computer readable medium storing computer code comprising instructions that, when executed by one or more processors, causing the one or more processors to perform steps comprising: receiving a phased genotype of a target individual, the phased genotype comprising a first haplotype and a second haplotype; initiating a full-ethnicity hidden Markov model (HMM), the full-ethnicity HMM comprising nodes that have a set of ethnicity labels; inputting the first haplotype to the full-ethnicity HMM to determine a first subset of ethnicity labels that match the first haplotype; inputting the second haplotype to the full-ethnicity HMM to determine a second subset of ethnicity labels that match the second haplotype; combining the first and second subsets of ethnicity labels as a candidate subset of ethnicity labels of the target individual; initiating a simplified HMM specific to the target individual, the simplified HMM comprising nodes that are simplified from the set of ethnicity labels to the candidate subset of ethnicity labels of the target individual; inputting the phased genotype of the target individual to the simplified HMM; and determining an ethnicity composition of the target individual using the simplified HMM.
- 15 . The non-transitory computer readable medium of claim 14 , wherein the nodes in the simplified HMM represent permutations of different first parent ethnicity labels, second parent ethnicity labels, and switch labels.
- 16 . The non-transitory computer readable medium of claim 15 , wherein the switch labels represent a phasing error, the phasing error representative of switching the first and second parent ethnicity labels from one node group to a next node group.
- 17 . The non-transitory computer readable medium of claim 14 , wherein the phased genotype comprises cross-chromosome haplotypes, and wherein the first haplotype and the second haplotype both include a sequence that has span of genetic loci in a plurality of chromosomes.
- 18 . The non-transitory computer readable medium of claim 14 , wherein receiving the phased genotype further comprises: dividing the phased genotype into a plurality of windows, each window comprising a set of single nucleotide polymorphisms (SNPs).
- 19 . The non-transitory computer readable medium of claim 18 , wherein determining the ethnicity composition further comprises: determining a path between the nodes in each window of the simplified HMM based on a likelihood of the phased genotype of the target individual traversing nodes along the path; counting a number of a particular label corresponding to a particular ethnicity label in the path; and determining an ethnicity composition of the target individual with respect to the particular ethnicity label based on the number of the particular label counted in the path.
- 20 . A system comprising: one or more processors; and a memory configured to store computer code comprising instructions, the instructions, when executed by one or more processors, cause the one or more processors to perform steps comprising: receiving a phased genotype of a target individual, the phased genotype comprising a first haplotype and a second haplotype; initiating a full-ethnicity hidden Markov model (HMM), the full-ethnicity HMM comprising nodes that have a set of ethnicity labels; inputting the first haplotype to the full-ethnicity HMM to determine a first subset of ethnicity labels that match the first haplotype; inputting the second haplotype to the full-ethnicity HMM to determine a second subset of ethnicity labels that match the second haplotype; combining the first and second subsets of ethnicity labels as a candidate subset of ethnicity labels of the target individual; initiating a simplified HMM specific to the target individual, the simplified HMM comprising nodes that are simplified from the set of ethnicity labels to the candidate subset of ethnicity labels of the target individual; inputting the phased genotype of the target individual to the simplified HMM; and determining an ethnicity composition of the target individual using the simplified HMM.
Description
CROSS-REFERENCE TO RELATED APPLICATION The present application claims the benefit of U.S. Provisional Patent Application No. 63/330,538 filed on Apr. 13, 2022, which is hereby incorporated by reference in its entirety. FIELD The disclosed embodiments relate to systems, methods, and/or computer-program products configured for determining parental ethnicity. BACKGROUND Although humans are, genetically speaking, almost entirely identical, small differences in human DNA are responsible for much of the variation between individuals. For example, a sequence variation at one position in DNA between individuals is known as a single-nucleotide polymorphism (SNP). Stretches of DNA inherited together from a single parent are referred to as haplotypes (e.g., one haplotype inherited from the mother and another haplotype inherited from the father). A subset of the SNPs in an individual's genome may be detected with SNP genotyping. Through SNP genotyping, the pair of alleles for a SNP at a given location in each haplotype may be identified. For example, an SNP may be identified as heterozygous (i.e., one allele of each type), homozygous (i.e., both alleles of the same type), or unknown. SNP genotyping identifies the pair of alleles for a given genotype, but does not identify which allele corresponds to which haplotype, i.e., SNP genotyping does not identify the homomorphic chromosome (of the homomorphic pair) to which each allele corresponds. This is partially due to current physical sequencing techniques will typically generate two signals at a heterozygous position. Thus, successful SNP genotyping produces an unordered pair of alleles, where each allele corresponds to one of two haplotypes. In general, most of the SNPs of a haplotype that correspond to a particular chromosome are sourced from a single chromosome from a parent. However, some of the SNPs from the haplotype may correspond to the parent's other homomorphic chromosome due to chromosomal crossover. Because the genetic information in a particular chromosome of an individual mostly corresponds to a single chromosome of a parent, sequences of SNPs tend to stay relatively intact across generations. Efforts to predict an individual's ethnicity have been limited by an inability to determine which allele corresponds to which haplotype (e.g., the homomorphic chromosome to which each allele corresponds). Genotyping is performed by physical sequencing that is unable to distinguish the order of a pair of alleles at a given SNP position. As such, there is currently no way to determine with confidence one or both parents' ethnicity, community, traits, etc. from an individual's genome without the parents' DNA. SUMMARY Disclosed herein relates to example embodiments that determine an ethnicity composition of a target individual. In one embodiment, the computer-implemented method receives a phased genotype of a target individual. The phased genotype includes a first haplotype and a second haplotype. The computer-implemented method may initiate a full-ethnicity hidden Markov model (HMM) including nodes that have a set of ethnicity labels. Each ethnicity label may represent a different ethnic origin. The computer-implemented method may input the first haplotype to the full-ethnicity HMM to determine a first subset of ethnicity labels that match the first haplotype. The computer-implemented method may input the second haplotype to the full-ethnicity HMM to determine a second subset of ethnicity labels that match the second haplotype. The computer-implemented method may combine the first and second subsets of ethnicity labels to create a candidate subset of ethnicity labels of the target individual. The computer-implemented method may initiate a simplified HMM specific to the target individual. The simplified HMM may include nodes that are simplified from the set of ethnicity labels to the candidate subset of ethnicity labels of the target individual. The computer-implemented method may input the phased genotype of the target individual to the simplified HMM. The computer implemented method may determine an ethnicity composition of the target individual using the simplified HMM. In yet another embodiment, a non-transitory computer-readable medium that is configured to store instructions is described. The instructions, when executed by one or more processors, cause the one or more processors to perform a process that includes steps described in the above computer-implemented methods or described in any embodiments of this disclosure. In yet another embodiment, a system may include one or more processors and a storage medium that is configured to store instructions. The instructions, when executed by one or more processors, cause the one or more processors to perform a process that includes steps described in the above computer-implemented methods or described in any embodiments of this disclosure. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates a diagram of a system environm