US-12619875-B2 - Deep learning-based pathogenicity classifier for promoter single nucleotide variants (pSNVs)

US12619875B2US 12619875 B2US12619875 B2US 12619875B2US-12619875-B2

Abstract

We disclose computational models that alleviate the effects of human ascertainment biases in curated pathogenic non-coding variant databases by generating pathogenicity scores for variants occurring in the promoter regions (referred to herein as promoter single nucleotide variants (pSNVs)). We train deep learning networks (referred to herein as pathogenicity classifiers) using a semi-supervised approach to discriminate between a set of labeled benign variants and an unlabeled set of variants that were matched to remove biases.

Inventors

Sofia KYRIAZOPOULOU PANAGIOTOPOULOU
Kai-How FARH

Assignees

ILLUMINA, INC.

Dates

Publication Date: 20260505
Application Date: 20231117

Claims (20)

1 . A system comprising: at least one processor; and a non-transitory computer readable medium comprising instructions that, when executed by the at least one processor, cause the system to: access a nucleotide sequence corresponding to a promoter region and comprising nucleotides at target base positions and nucleotide bases at flanking base positions that flank each target base position; generate, from the nucleotide sequence corresponding to the promoter region, alternative nucleotide sequences comprising nucleotide variants substituted for the nucleotides at the target base positions; provide, as an input to one or more layers of a trained pathogenicity classifier, encoded data representing the alternative nucleotide sequences corresponding to the promoter region; iteratively process, utilizing one or more blocks of the trained pathogenicity classifier in a single inference invocation, the encoded data representing the alternative nucleotide sequences comprising the nucleotide variants at the target base positions of the nucleotide sequence; generate, for the single inference invocation by the trained pathogenicity classifier, predictions that the nucleotide variants are benign or pathogenic at each target position of the target base positions of the nucleotide sequence based on the nucleotide bases at the flanking base positions; and classify, based on the predictions, a nucleotide variant as pathogenic.
2 . The system of claim 1 , wherein the trained pathogenicity classifier comprises a neural network.
3 . The system of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the system to generate, by the trained pathogenicity classifier, the predictions by generating a pathogenicity likelihood score that each of three nucleotide variants at the target base positions of the alternative nucleotide sequences is benign or pathogenic.
4 . The system of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the system to: access, for the target base positions of the nucleotide sequence corresponding to the promoter region, one or more of a protein binding affinity score, signal for deoxyribonucleic acid (DNA) methylation changes, a signal for histone modifications, a signal for noncoding ribonucleic acid (ncRNA) expression, a signal for chromatin structural changes, a signal for deoxyribonuclease (DNase), or a signal for histone 3 lysine 27 acetylation (H3K27ac); and generate, by the trained pathogenicity classifier, the predictions that the nucleotide variants are benign or pathogenic at each target base position of the nucleotide sequence based further on one or more of the protein binding affinity score, the signal for DNA methylation changes, the signal for histone modifications, the signal for histone modifications, the signal for ncRNA expression, the signal for chromatin structural changes, the signal for DNase, or the signal for H3K27ac.
5 . The system of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the system to provide the nucleotide sequence corresponding to the promoter region as the input to the trained pathogenicity classifier by providing an encoded representation of the nucleotide sequence as input to the trained pathogenicity classifier.
6 . The system of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the system to generate, by the trained pathogenicity classifier, the predictions by: determining a sequence motif within the nucleotide bases at the flanking base positions; and generate the predictions based on the sequence motif within the nucleotide bases at the flanking base positions.
7 . The system of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the system to generate, by the trained pathogenicity classifier, the predictions by: determining a trinucleotide context for the nucleotide variants within the nucleotide bases at the flanking base positions; and generate the predictions based on the trinucleotide context within the nucleotide bases at the flanking base positions.
8 . The system of claim 1 , further comprising instructions that, when executed by the at least one processor, cause the system to: access the nucleotide sequence corresponding to the promoter region comprising single nucleotide variant (SNVs) at the target base positions; and generate, by the trained pathogenicity classifier, the predictions that the SNVs are benign or pathogenic at the target base positions.
9 . The system of claim 1 , wherein the nucleotide bases at the flanking base positions comprise reference bases from a reference genome.
10 . The system of claim 1 , wherein the trained pathogenicity classifier has been trained utilizing input promoter training sequences comprising one or more observed pSNVs sampled from pSNVs identified within a sample genomic dataset and one or more unobserved pSNVs sampled from a pool of substitutionally generated pSNVs at base positions for which pSNVs are not identified within the sample genomic dataset.
11 . A non-transitory computer readable medium comprising instructions that, when executed by at least one processor, cause a system to: access a nucleotide sequence corresponding to a promoter region and comprising nucleotides at target base positions and nucleotide bases at flanking base positions that flank each target base position; generate, from the nucleotide sequence corresponding to the promoter region, alternative nucleotide sequences comprising nucleotide variants substituted for the nucleotides at the target base positions; provide, as an input to one or more layers of a trained pathogenicity classifier, encoded data representing the alternative nucleotide sequences corresponding to the promoter region; iteratively process, utilizing one or more blocks of the trained pathogenicity classifier in a single inference invocation, the encoded data representing the alternative nucleotide sequences comprising the nucleotide variants at the target base positions of the nucleotide sequence; generate, for the single inference invocation by the trained pathogenicity classifier, predictions that the nucleotide variants are benign or pathogenic at each target position of the target base positions of the nucleotide sequence based on the nucleotide bases at the flanking base positions; and classify, based on the predictions, a nucleotide variant as pathogenic.
12 . The non-transitory computer readable medium of claim 11 , wherein the trained pathogenicity classifier comprises a neural network.
13 . The non-transitory computer readable medium of claim 11 , further comprising instructions that, when executed by the at least one processor, cause the system to generate, by the trained pathogenicity classifier, the predictions by generating a pathogenicity likelihood score that each of three nucleotide variants at the target base positions of the alternative nucleotide sequences is benign or pathogenic.
14 . The non-transitory computer readable medium of claim 11 , further comprising instructions that, when executed by the at least one processor, cause the system to: access, for the target base positions of the nucleotide sequence corresponding to the promoter region, one or more of a protein binding affinity score, signal for deoxyribonucleic acid (DNA) methylation changes, a signal for histone modifications, a signal for noncoding ribonucleic acid (ncRNA) expression, a signal for chromatin structural changes, a signal for deoxyribonuclease (DNase), or a signal for histone 3 lysine 27 acetylation (H3K27ac); and generate, by the trained pathogenicity classifier, the predictions that the nucleotide variants are benign or pathogenic at the each target base position of the nucleotide sequence based further on one or more of the protein binding affinity score, the signal for DNA methylation changes, the signal for histone modifications, the signal for histone modifications, the signal for ncRNA expression, the signal for chromatin structural changes, the signal for DNase, or the signal for H3K27ac.
15 . The non-transitory computer readable medium of claim 11 , further comprising instructions that, when executed by the at least one processor, cause the system to provide the nucleotide sequence corresponding to the promoter region as the input to the trained pathogenicity classifier by providing an encoded representation of the nucleotide sequence as input to the trained pathogenicity classifier.
16 . The non-transitory computer readable medium of claim 11 , further comprising instructions that, when executed by the at least one processor, cause the system to generate, by the trained pathogenicity classifier, the predictions by: determining a sequence motif within the nucleotide bases at the flanking base positions; and generate the predictions based on the sequence motif within the nucleotide bases at the flanking base positions.
17 . The non-transitory computer readable medium of claim 11 , further comprising instructions that, when executed by the at least one processor, cause the system to generate, by the trained pathogenicity classifier, the predictions by: determining a sequence motif within the nucleotide bases at the flanking base positions; and generate the predictions based on the sequence motif within the nucleotide bases at the flanking base positions.
18 . A computer-implemented method comprising: accessing a nucleotide sequence corresponding to a promoter region and comprising nucleotides at target base positions and nucleotide bases at flanking base positions that flank each target base position; generating, from the nucleotide sequence corresponding to the promoter region, alternative nucleotide sequences comprising nucleotide variants substituted for the nucleotides at the target base positions; providing, as an input to one or more layers of a trained pathogenicity classifier, encoded data representing the alternative nucleotide sequences corresponding to the promoter region; iteratively processing, utilizing one or more blocks of the trained pathogenicity classifier in a single inference invocation, the encoded data representing the alternative nucleotide sequences comprising the nucleotide variants at the target base positions of the nucleotide sequence; generating, for the single inference invocation by the trained pathogenicity classifier, predictions that the nucleotide variants are benign or pathogenic at each target position of the target base positions of the nucleotide sequence based on the nucleotide bases at the flanking base positions; and classifying, based on the predictions, a nucleotide variant as pathogenic.
19 . The computer-implemented method of claim 18 , wherein generating, by the trained pathogenicity classifier, the predictions comprises: determining a trinucleotide context for the nucleotide variants within the nucleotide bases at the flanking base positions; and generate the predictions based on the trinucleotide context within the nucleotide bases at the flanking base positions.
20 . The computer-implemented method of claim 18 , further comprising: accessing the nucleotide sequence corresponding to the promoter region comprising single nucleotide variants (SNVs) at the target base positions; and generating, by the trained pathogenicity classifier, the predictions that the SNVs are benign or pathogenic at the target base positions.

Description

PRIORITY APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 16/578,210 titled “Deep Learning-Based Pathogenicity Classifier for Promoter Single Nucleotide Variants (pSNVs), which claims priority to or the benefit of U.S. Provisional Patent Application No. 62/734,116, titled, “Deep Learning-Based Pathogenicity Classifier for Promoter Single Nucleotide Variants (pSNVs),” filed Sep. 20, 2018. The provisional application is hereby incorporated by reference for all purposes. U.S. patent application Ser. No. 16/578,210 is a continuation-in-part of U.S. patent Application Ser. No. 16/160,903, titled, “Deep Learning-Based Techniques for Training Deep Convolutional Neural Networks,” filed Oct. 15, 2018, which claims priority to or the benefit of U.S. Provisional Patent Application No. 62/573,144, titled, “Training a Deep Pathogenicity Classifier Using Large-Scale Benign Training Data,” filed Oct. 16, 2017; U.S. Provisional Patent Application No. 62/573,149, titled, “Pathogenicity Classifier Based On Deep Convolutional Neural Networks (CNNS),” filed Oct. 16, 2017; U.S. Provisional Patent Application No. 62/573,153, titled, “Deep Semi-Supervised Learning that Generates Large-Scale Pathogenic Training Data,” filed Oct. 16, 2017; and U.S. Provisional Patent Application No. 62/582,898, titled, “Pathogenicity Classification of Genomic Data Using Deep Convolutional Neural Networks (CNNs),” filed Nov. 7, 2017. The non-provisional and provisional applications are hereby incorporated by reference for all purposes. FIELD OF THE TECHNOLOGY DISCLOSED The technology disclosed relates to artificial intelligence type computers and digital data processing systems and corresponding data processing methods and products for emulation of intelligence (i.e., knowledge based systems, reasoning systems, and knowledge acquisition systems); and including systems for reasoning with uncertainty (e.g., fuzzy logic systems), adaptive systems, machine learning systems, and artificial neural networks. In particular, the technology disclosed relates to using deep neural networks such as deep convolutional neural networks for analyzing data. BACKGROUND The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology. Genetic variations can help explain many diseases. Every human being has a unique genetic code and there are lots of genetic variants within a group of individuals. Most of the deleterious genetic variants have been depleted from genomes by natural selection. It is important to identify which genetics variations are likely to be pathogenic or deleterious. This will help researchers focus on the likely pathogenic genetic variants and accelerate the pace of diagnosis and cure of many diseases. Modeling the properties and functional effects (e.g., pathogenicity) of variants is an important but challenging task in the field of genomics. Despite the rapid advancement of functional genomic sequencing technologies, interpretation of the functional consequences of non-coding variants remains a great challenge due to the complexity of cell type-specific transcription regulation systems. In addition, a limited number of non-coding variants have been functionally validated by experiments. Previous efforts on interpreting genomic variants have mainly concentrated on variants in the coding regions. However, the non-coding variants also play an important role in complex diseases. Identifying the pathogenic functional non-coding variants from the massive neutral ones can be important in genotype-phenotype relationship research and precision medicine. Furthermore, most of the known pathogenic non-coding variants reside in the promoter regions or conserved sites, causing ascertainment bias in the training set because easy or obvious cases known for pathogenic tendencies are likely to be enriched in labeled data sets relative to the entire population of the pathogenic non-coding variants. If left unaddressed, this bias in the labeled pathogenic data would lead to unrealistic model performance, as a model could achieve relatively high test set performance simply by predicting that all core variants are pathogenic and all others are benign. However, in the clinic, such a model would incorrectly classify pathogenic, non-core variants as benign at an unacceptably high rate. Advances in biochemical technologies over the past decades have given rise to next generation sequencing (NGS) platforms that quickly produce genomic data at much lower costs than ever before. Such overwhelmingly large volumes of sequenced DNA r