EP-4736169-A1 - VARIANT CALLING WITH METHYLATION-LEVEL ESTIMATION
Abstract
This disclosure describes methods, non-transitory-computer readable media, and systems that can simultaneously determine estimated methylation-level values for cytosine bases and genotype calls for a target genomic sample. The disclosed system can utilize a Bayesian method on a target genomic sample's nucleotide-read data to generate estimated methylation-level values that indicate genomic coordinates at which the target genomic sample comprises a reference cytosine base or a nucleobase that could be called as a cytosine. The disclosed system can estimate methylation-level values based on prior genotype probabilities and observed nucleobases at a genomic coordinate from a read pileup of a target genomic sample. Based on the estimated methylation-level values and base-call-quality metrics, the disclosed system may generate posterior genotype probabilities for the genomic sample at the genomic coordinate. Based on the posterior genotype probabilities, the disclosed system can generate a genotype call for the target genomic sample.
Inventors
- BAYE, James
- ANDREWS, DANIEL
- Scheffler, Konrad Haarhoff
Assignees
- Illumina, Inc.
Dates
- Publication Date
- 20260506
- Application Date
- 20240626
Claims (20)
- 1. A method comprising: identifying, for a target genomic sample, nucleotide reads comprising one or more nucleobases converted by a methylation sequencing assay; determining an estimated methylation-level value for a cytosine base at a genomic coordinate based on prior genotype probabilities for the target genomic sample at the genomic coordinate and observed nucleobases at the genomic coordinate within the nucleotide reads; generating, utilizing a variant call model, posterior genotype probabilities for the target genomic sample at the genomic coordinate based on the estimated methylation-level value and base-call-quality metrics for the observed nucleobases; and generating, based on the posterior genotype probabilities, a genotype call that the target genomic sample comprises a predicted combination of nucleobases at the genomic coordinate.
- 2. The method of claim 1, further comprising generating a refined methylation-level value for the cytosine base at the genomic coordinate based on the posterior genotype probabilities and the observed nucleobases.
- 3. The method of claim 2, further comprising generating the refined methylation-level value by: determining genotype-specific methylation-level values corresponding with each possible genotype at the genomic coordinate based on the observed nucleobases; and weighting the genotype-specific methylation-level values based on the posterior genotype probabilities.
- 4. The method of claim 1 , wherein generating the genotype call for the target genomic sample is further based on sequencing metrics corresponding to the nucleotide reads.
- 5. The method of claim 1, wherein the posterior genotype probabilities comprise a subset of posterior genotype probabilities for cytosine base in a plus strand or a minus strand.
- 6. The method of claim 1, further comprising determining the estimated methylationlevel value by: determining a probability of each nucleobase at the genomic coordinate on a plus strand and a minus strand; determining probabilities of the observed nucleobases at the genomic coordinate based on numbers of each observed nucleobase, the probability of each nucleobase at the genomic coordinate, and the prior genotype probabilities; and generating an estimated plus-strand-methylation-level value for a nucleobase at the genomic coordinate on the plus strand and an estimated minus-strand-methylation-level value for a nucleobase at the genomic coordinate on the minus strand by performing a Bayesian inversion on the probabilities of the observed nucleobases.
- 7. The method of claim 6, further comprising determining the probability of each nucleobase by: determining that a probability of a given thymine base on the plus strand approximates the estimated methylation-level value; and determining that a probability of a given cytosine base on the plus strand approximates one minus the estimated methylation-level value.
- 8. The method of claim 6, further comprising determining the probability of each nucleobase by: determining that a probability of a given adenine base on the minus strand approximates the estimated methylation-level value; and determining that a probability of a given guanine base on the minus strand approximates one minus the estimated methylation-level value.
- 9. The method of claim 1, wherein the variant call model comprises a Hidden Markov model (HMM) modified to receive an input based the estimated methylation-level value and a base- call-quality metric for a corresponding observed nucleobase.
- 10. The method of claim 1, further comprising generating the genotype call based on determining a predicted combination of nucleobases corresponding to a highest posterior genotype probability.
- 11. The method of claim 1, wherein identifying the nucleotide reads comprising the one or more nucleobases converted by the methylation sequencing assay comprises identifying the nucleotide reads comprising thymine bases or uracil bases converted from cytosine bases by the methylation sequencing assay.
- 12. A system comprising: at least one processor; and a non-transitory computer-readable medium comprising instructions that, when executed by the at least one processor, cause the system to: identify, for a target genomic sample, nucleotide reads comprising one or more nucleobases converted by a methylation sequencing assay; determine an estimated methylation-level value for a cytosine base at a genomic coordinate based on prior genotype probabilities for the target genomic sample at the genomic coordinate and observed nucleobases at the genomic coordinate within the nucleotide reads; generate, utilizing a variant call model, posterior genotype probabilities for the target genomic sample at the genomic coordinate based on the estimated methylation-level value and base-call-quality metrics for the observed nucleobases; and generate, based on the posterior genotype probabilities, a genotype call that the target genomic sample comprises a predicted combination of nucleobases at the genomic coordinate.
- 13. The system of claim 12, further comprising instructions that, when executed by the at least one processor, cause the system to generate a refined methylation-level value for the cytosine base at the genomic coordinate based on the posterior genotype probabilities and the observed nucleobases.
- 14. The system of claim 13, further comprising instructions that, when executed by the at least one processor, cause the system to generate the refined methylation-level value by: determining genotype-specific methylation-level values corresponding with each possible genotype at the genomic coordinate based on the observed nucleobases; and weighting the genotype-specific methylation-level values based on the posterior genotype probabilities.
- 15. The system of claim 12, further comprising instructions that, when executed by the at least one processor, cause the system to generate the genotype call for the target genomic sample further based on sequencing metrics corresponding to the nucleotide reads.
- 16. The system of claim 12, wherein the posterior genotype probabilities comprise a subset of posterior genotype probabilities for cytosine base in a plus strand or a minus strand.
- 17. The system of claim 12, further comprising instructions that, when executed by the at least one processor, cause the system to determine the estimated methylation-level value by: determining a probability of each nucleobase at the genomic coordinate on a plus strand and a minus strand; determining probabilities of the observed nucleobases at the genomic coordinate based on numbers of each observed nucleobase, the probability of each nucleobase at the genomic coordinate, and the prior genotype probabilities; and generating an estimated plus-strand methylation-level value for a nucleobase at the genomic coordinate on the plus strand and an estimated minus-strand methylation-level value for a nucleobase at the genomic coordinate on the minus strand by performing a Bayesian inversion on the probabilities of the observed nucleobases.
- 18. The system of claim 17, further comprising instructions that, when executed by the at least one processor, cause the system to determine the probability of each nucleobase by: determining that a probability of a given thymine base on the plus strand approximates the estimated methylation-level value; and determining that a probability of a given cytosine base on the plus strand approximates one minus the estimated methylation-level value.
- 19. The system of claim 17, further comprising instructions that, when executed by the at least one processor, cause the system to determine the probability of each nucleobase by: determining that a probability of a given adenine base on the minus strand approximates the estimated methylation-level value; and determining that a probability of a given guanine base on the minus strand approximates one minus the estimated methylation-level value.
- 20. The system of claim 12, wherein the variant call model comprises a Hidden Markov model (HMM) modified to receive an input based on the estimated methylation-level value and a base-call-quality metric for a corresponding observed nucleobase.
Description
VARIANT CALLING WITH METHYLATION-LEVEL ESTIMATION CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 63/510,603, entitled “VARIANT CALLING WITH METHYLATION-LEVEL ESTIMATION” filed June 27, 2023, which is incorporated herein by reference in its entirety. BACKGROUND [0002] In recent years, biotechnology firms and research institutions have improved hardware and software for sequencing nucleotides and determining nucleobase calls for genomic samples. For instance, some existing sequencing machines and sequencing-data-analysis software (together “existing sequencing systems”) predict individual nucleobases within sequences by using conventional Sanger sequencing or sequencing-by-synthesis (SBS) methods. When using SBS, existing sequencing systems can monitor many thousands to millions of oligonucleotides being synthesized in parallel from templates to predict nucleobase calls for growing nucleotide reads. In many existing sequencing systems, a camera captures images of irradiated fluorescent tags incorporated into oligonucleotides. After capturing such images, some existing sequencing systems determine nucleobase calls for nucleotide reads corresponding to the oligonucleotides and send base-call data to a computing device with sequencing-data-analysis software, which aligns nucleotide reads with a reference genome. Based on differences between the aligned nucleotide reads and the reference genome, existing systems can further utilize a variant caller to identify variants of a genomic sample, such as single nucleotide variants (SNVs), insertions or deletions (indels), or other variants within the genomic sample. [0003] In addition to improved genomic sequencing, biotechnology firms and research institutions have also improved methods of detecting methylation of cytosine bases at particular genomic regions (e.g., regions encoding or promoting genes) and detecting methylation of larger nucleotide fragments or whole genomes of a sample. For instance, some existing sequencing systems can use sequencing devices and corresponding sequencing-data-analysis software to identify when a methyl or hydroxymethyl group has been added to a cytosine base of a sample’s deoxyribonucleic acid (DNA) — where the methylated cytosine base is often part of a cytosine- guanine-dinucleotide pair in a 5’ — C — phosphate — G — 3’ (CpG) configuration in mammals. For example, existing sequencing systems can detect methylated cytosines by (i) enzymatically converting methylated or unmethylated cytosine bases at CpG or other cytosine sites from a sample nucleotide fragment into uracil bases (e.g., dihydrouracil); (ii) determining base calls of nucleotide reads for the sample using a sequencing device, where the sequencing device detects the uracil bases as thymine bases during polymerase chain reaction (PCR) amplification; and (iii) comparing the base calls from the nucleotide reads to a reference genome or non-enzymatically converted nucleotide reads from the sample. Based on the comparison of nucleotide reads from the sample to a reference genome or the non-enzymatically converted nucleotide reads, existing sequencing systems can identify thymine bases from the nucleotide reads that do not match cytosine bases at CpG or other cytosine sites within the reference genome or the non-enzymatically converted nucleotide reads and thereby detect methylated cytosine bases in a sample nucleotide fragment. [0004] Despite these recent advances, existing sequencing and methylation detection systems face several technical shortcomings. Different types of methylation assays detect methylated cytosines by converting methylated or unmethylated cytosine bases into uracil bases and subsequently, in some cases, into thymine bases. As oligonucleotides extracted from the genomic sample are duplicated as part of the methylation sequencing assay, complementary strands reflect regions of cytosine-to-thymine substitutions by having adenines in place of guanines. While these conversions aid in the detection of methylation, the conversions may also negatively affect performance and accuracy of existing sequencing systems. [0005] For example, existing sequencing and methylation detection systems cannot consistently generate accurate genotype calls when simultaneously calling genotype and methylation level when performing OT conversion-based sequencing. Due in part to OT conversions in methylation assays, existing sequencing systems often produce biased genotype calls and biased methylation-level estimates. Converted methylated or unmethylated cytosine bases often introduce noise into sequence data that, in turn, hinders accurate variant calling. Because of such conversions and noise in methylation assays, existing methylation detection systems often overestimate methylation levels for C/A, C/T, G/A, and G/T genotypes. Furthermore, existing sequencing systems frequently determine