Search

EP-4736170-A1 - MACHINE-LEARNING MODEL FOR RECALIBRATING GENOTYPE CALLS CORRESPONDING TO GERMLINE VARIANTS AND SOMATIC MOSAIC VARIANTS

EP4736170A1EP 4736170 A1EP4736170 A1EP 4736170A1EP-4736170-A1

Abstract

This disclosure describes methods, non-transitory computer readable media, and systems that can utilize a machine-learning model to recalibrate genotype calls (e.g., variant calls) corresponding to germline variants and somatic mosaic variants. For instance, based on sequencing metrics for nucleotide reads of a genomic sample, the disclosed systems can utilize a variant-call-recalibration machine-learning model to generate genotype probabilities for variants within genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants. Further, the disclosed systems can generate genotype calls, such as variant calls corresponding to somatic mosaic variants, based on the generated genotype probabilities.

Inventors

  • Visvanath, Arun
  • PARNABY, Gavin, Derek
  • ROSSI, MASSIMILIANO
  • MEHIO, RAMI
  • CATREUX, SEVERINE
  • CHEN, WEI-TING
  • WANG, YINA
  • MURRAY, Lisa, Joy

Assignees

  • Illumina, Inc.

Dates

Publication Date
20260506
Application Date
20240628

Claims (20)

  1. 1. A system comprising: at least one processor; and a non-transitory computer readable medium storing instructions that, when executed by the at least one processor, cause the system to: determine sequencing metrics for nucleotide reads corresponding to genomic regions of a genomic sample; generate, utilizing a variant-call-recalibration machine-learning model and based on the sequencing metrics, genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants; and generate, for the genomic regions and based on the genotype probabilities, at least a first variant call corresponding to a germline variant in the genomic sample and at least a second variant call corresponding to a somatic mosaic variant in the genomic sample.
  2. 2. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: generate, within a sequencing data file, a germline-variant indicator identifying the first variant call as a germline variant; and generate, within the sequencing data file, a somatic-mosaic-variant indicator identifying the second variant call as a somatic mosaic variant.
  3. 3. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to generate, within a sequencing data file, a variant indicator identifying the first variant call or the second variant call as a variant without an indication of a germline variant or a somatic mosaic variant.
  4. 4. The system of claim 1, wherein the first variant call corresponds to a first genomic coordinate of the genomic sample and the second variant call corresponds to a second genomic coordinate of the genomic sample different than the first genomic coordinate.
  5. 5. The system of claim 1, wherein the first variant call and the second variant call correspond to a same genomic coordinate of the genomic sample.
  6. 6. The system of claim 1, wherein the genomic regions comprise one or more target genomic regions comprising one or more candidate somatic mosaic variants for which the variant- call-recalibration machine-learning model was trained to generate predicted genotype probabilities.
  7. 7. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: generate, utilizing a germline-variant-call-recalibration machine-learning model and based on the sequencing metrics, additional genotype probabilities for germline variants within the genomic regions corresponding to the candidate germline variants; and generate, for the genomic regions and based on the additional genotype probabilities, one or more additional candidate variant calls corresponding to one or more germline variants in the genomic sample.
  8. 8. The system of claim 7, further comprising instructions that, when executed by the at least one processor, cause the system to: compare, for a genomic coordinate for the second variant call, a genotype probability generated by the variant-call-recalibration machine-learning model with an additional genotype probability generated by the germline-variant-call-recalibration machine-learning model; and identify the second variant call as a somatic mosaic variant based on a comparison of the genotype probability and the additional genotype probability.
  9. 9. The system of claim 1, wherein the variant-call-recalibration machine-learning model comprises a first machine-learning sub-model configured to generate a first type of genotype probabilities accounting for a set of candidate germline variants and a second machine-learning sub-model configured to generate a second type of genotype probabilities accounting for a set of candidate somatic mosaic variants.
  10. 10. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: access, based on user input, sequencing data comprising sample nucleotide reads and synthetic nucleotide reads comprising modified nucleobases representing ground-truth somatic mosaic variants; determine the sequencing metrics for the sequencing data by determining sample-read- based sequencing metrics for the sample nucleotide reads and synthetic-read-based sequencing metrics for the synthetic nucleotide reads; and train the variant-call-recalibration machine-learning model to generate, based on the sample-read-based sequencing metrics and the synthetic-read-based sequencing metrics, predicted genotype probabilities for somatic mosaic variants based on comparisons of variant calls and the ground-truth somatic mosaic variants.
  11. 11. The system of claim 10, further comprising instructions that, when executed by the at least one processor, cause the system to generate the synthetic nucleotide reads by modifying existing nucleotide reads to include the ground-truth somatic mosaic variants at one or more variant allele frequencies representative of one or more somatic mosaic variants.
  12. 12. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: identify an admixture of genomic samples that simulates variant-allele frequencies of ground-truth somatic mosaic variants and ground-truth germline variants; access a mixture of nucleotide reads comprising a first set of nucleotide reads from a first genomic sample of the admixture of genomic samples and a second set of nucleotide reads from a second genomic sample of the admixture of genomic samples; determine the sequencing metrics for the nucleotide reads by determining admixture-based sequencing metrics for the mixture of nucleotide reads; and train the variant-call-recalibration machine-learning model to generate, based on the admixture-based sequencing metrics, predicted genotype probabilities for somatic mosaic variants and germline variants based on comparisons of predicted variant calls with the ground-truth somatic mosaic variants and the ground-truth germline variants.
  13. 13. The system of claim 1, further comprising instructions that, when executed by the at least one processor, cause the system to: receive an indication of a user selection of a variant-sensitivity option corresponding to detection of the candidate somatic mosaic variants; and execute the variant-call-recalibration machine-learning model to generate the genotype probabilities instead of a germline-variant-call-recalibration machine-learning model configured to generate a different type of genotype probabilities for candidate germline variants.
  14. 14. The system of claim 1, wherein the variant-call-recalibration machine-learning model comprises one or more of a gradient boost decision tree or a random forest model.
  15. 15. A non-transitory computer-readable medium storing instructions that, when executed by at least one processor, cause a computing device to: determine sequencing metrics for nucleotide reads corresponding to genomic regions of a genomic sample; generate, utilizing a variant-call-recalibration machine-learning model and based on the sequencing metrics, genotype probabilities for variants within the genomic regions corresponding to candidate germline variants and candidate somatic mosaic variants; and generate, for the genomic regions and based on the genotype probabilities, at least a first variant call corresponding to a germline variant in the genomic sample and at least a second variant call corresponding to a somatic mosaic variant in the genomic sample.
  16. 16. The non-transitory computer-readable medium of claim 15, further comprising instructions that, when executed by the at least one processor, cause the computing device to: generate, within a sequencing data file, a germline-variant indicator identifying the first variant call as a germline variant; and generate, within the sequencing data file, a somatic-mosaic-variant indicator identifying the second variant call as a somatic mosaic variant.
  17. 17. The non-transitory computer-readable medium of claim 15, further comprising instructions that, when executed by the at least one processor, cause the computing device to generate, within a sequencing data file, a variant indicator identifying the first variant call or the second variant call as a variant without an indication of a germline variant or a somatic mosaic variant.
  18. 18. The non-transitory computer-readable medium of claim 15, wherein the first variant call corresponds to a first genomic coordinate of the genomic sample and the second variant call corresponds to a second genomic coordinate of the genomic sample different than the first genomic coordinate.
  19. 19. The non-transitory computer-readable medium of claim 15, wherein the first variant call and the second variant call correspond to a same genomic coordinate of the genomic sample.
  20. 20. The non-transitory computer-readable medium of claim 15, wherein the genomic regions comprise one or more target genomic regions comprising one or more candidate somatic mosaic variants for which the variant-call-recalibration machine-learning model was trained to generate predicted genotype probabilities.

Description

MACHINE-LEARNING MODEL FOR RECALIBRATING GENOTYPE CALLS CORRESPONDING TO GERMLINE VARIANTS AND SOMATIC MOSAIC VARIANTS CROSS-REFERENCE TO RELATED APPLICATIONS [0001] This application claims priority to and the benefit of U.S. Provisional Patent Application No. 63/511,605, entitled, “MACHINE-LEARNING MODEL FOR RECALIBRATING GENOTYPE CALLS CORRESPONDING TO GERMLINE VARIANTS AND SOMATIC MOSAIC VARIANTS,” filed on June 30, 2023, and U.S. Provisional Patent Application No. 63/607,446, entitled, “MACHINE-LEARNING MODEL FOR RECALIBRATING GENOTYPE CALLS CORRESPONDING TO GERMLINE VARIANTS AND SOMATIC MOSAIC VARIANTS,” filed on December 7, 2023. Each of the aforementioned applications is hereby incorporated by reference in its entirety. BACKGROUND [0002] In recent years, biotechnology firms and research institutions have improved hardware and software for sequencing nucleotides and determining germline variant calls for genomic samples. For instance, some existing nucleobase sequencing platforms determine individual nucleobases within sequences from germ cells by using conventional Sanger sequencing or by using sequencing-by-synthesis (SBS) methods. When using SBS, existing platforms can monitor millions to billions of nucleic acid polymers being synthesized in parallel to predict nucleobase calls from a larger base call dataset. For instance, a camera in many SBS platforms captures images of irradiated fluorescent tags incorporated into oligonucleotides for determining the nucleobase calls. After capturing such images, existing SBS platforms send base call data (or image-based data) to a computing device to apply sequencing data analysis software that determines a nucleobase sequence for a genomic sample or other nucleic acid polymer. Based on differences between the aligned nucleotide reads and the reference genome, existing data analysis software can further utilize a variant caller to identify germline variants within the germline cells of a genomic sample, such as single nucleotide polymorphisms (SNPs), insertions or deletions (indels), and/or other structural variants, and genotype calls. [0003] Despite these recent advances in sequencing and germline variant calling, existing nucleobase sequencing platforms and sequencing data analysis software (together and hereinafter, existing sequencing systems) often (a) limit variant calling to germline variants only and/or (b) cannot accurately detect both somatic mosaic variant calls and germline variant calls. For example, some existing systems utilize extensive statistical data analysis, such as a Bayesian probabilistic modeling, to implement computational tools (e.g., Java-based tools) for identifying somatic mosaic and germline variant calls within existing sequence data. But such Bayesian-based systems require significant computation time, processing, and resources and can often result in multiple false positives in identifying somatic mosaic variants. Such limits and shortcomings also apply to state- of-the-art machine-leaming-based sequencing systems. In both machine-leaming-based and statistical or probabilistic models, existing sequencing systems exhibit the technical limits of (a) and (b) in part due to the nature of somatic mosaic variants. Germline variants of a genomic sample are inherited by the time of the sample’s zygote from parents and are present in the sample’s germ cells. By contrast, somatic mosaic variants typically constitute mutations that (i) were introduced after zygote formation during cell development (e.g., 1 of 4 early cells), but (ii) were not inherited from the given sample’s parents, and (iii) were not introduced by a form of cancer or tumor in the given sample. Consequently, a relatively small proportion of a given sample’s cells include such somatic mosaic variants. Depending on when in development or which cell type a somatic mosaic variant has been introduced, the variant allele fraction of a somatic mosaic variant in a given sample’s cells can range from 10-50% to much smaller percentages, such as 0.1%. [0004] In addition to the relatively low variant allele fraction of somatic mosaic variants, existing sequencing systems often lack computational models (or other mechanisms) for filtering noise during DNA sequencing. Consequently, as indicated above, existing sequencing systems cannot accurately determine both somatic mosaic variant calls and germline variant calls for a given sample. For instance, existing sequencing systems often determine false-positive somatic mosaic variant calls based on various noise sources common in DNA sequencing, such as sequence specific errors (SSEs) induced by one or more of inverted repeats, homopolymers, nucleotide context; uneven read depth or coverage across genomic regions of a reference genome, where certain genomic regions comprising somatic mosaic variants may lack read coverage (e.g., below 10X or 20X); sequencing platform-specific errors induced by, for example, barcode swapping or allele