US-12626782-B2 - Architectures for training neural networks using biological sequences, conservation, and molecular phenotypes

US12626782B2US 12626782 B2US12626782 B2US 12626782B2US-12626782-B2

Abstract

The present disclosure provides methods and systems that can ascertain how genetic variants impact molecular phenotypes. Such methods and systems may use additional conservation information. In an aspect, the present disclosure provides a method for training a molecular phenotype neural network (MPNN), comprising: (a) providing a molecular phenotype neural network (MPNN) comprising one or more parameters; (b) providing a training data set comprising (i) a set of one or more inputs comprising biological sequences and (ii) for each input in the set of one or more inputs, a set of one or more molecular phenotypes corresponding to the input; (c) configuring the one or more parameters of the MPNN based on the training data set to minimize a total loss of the training data set, thereby training the MPNN; and (d) outputting the one or more parameters of the MPNN.

Inventors

Brendan Frey

Assignees

DEEP GENOMICS INCORPORATED

Dates

Publication Date: 20260512
Application Date: 20171213

Claims (20)

1 . A method for training a neural network for processing a test biological sequence, the method comprising: (a) providing the neural network, wherein the neural network comprises at least one intermediate layer and is configured to process an input biological sequence to determine output data comprising: (i) a molecular phenotype corresponding to the input biological sequence, wherein the molecular phenotype comprises a numerical value which quantifies biological molecules of cells, and (ii) a conservation value corresponding to each element of a plurality of elements of the input biological sequence; (b) providing a training data set comprising: (i) a set of input biological sequences, and (ii) for each input biological sequence in the set of input biological sequences, label data comprising: (1) a molecular phenotype corresponding to the input biological sequence, and (2) a conservation value corresponding to each element of a plurality of elements of the input biological sequence; (c) using the training data set to configure a set of parameters of the neural network, such that a total loss of the training data set is minimized based at least in part on minimizing (i) a total loss of the molecular phenotypes and (ii) a total loss of the conservation values, thereby generating a trained neural network; and (d) processing the test biological sequence using the trained neural network, wherein the processing comprises providing the test biological sequence to the trained neural network, to determine a molecular phenotype corresponding to the test biological sequence and a conservation value corresponding to each test element of a plurality of test elements of the test biological sequence.
2 . The method of claim 1 , wherein the trained neural network comprises a single intermediate layer configured to determine the molecular phenotype and the conservation value corresponding to each test element of the plurality of test elements of the test biological sequence.
3 . The method of claim 1 , wherein the trained neural network comprises a plurality of intermediate layers, wherein a last layer of the plurality of intermediate layers is configured to determine the molecular phenotype and the conservation value corresponding to each test element of the plurality of test elements of the test biological sequence.
4 . The method of claim 1 , wherein the trained neural network comprises a plurality of intermediate layers, wherein a first layer of the plurality of intermediate layers is configured to determine the molecular phenotype corresponding to the test biological sequence, and wherein a second layer of the plurality of intermediate layers is configured to determine the conservation value corresponding to each test element of the plurality of test elements of the test biological sequence.
5 . The method of claim 1 , wherein the molecular phenotype corresponding to the test biological sequence is determined based at least in part on the conservation value corresponding to each test element of the plurality of test elements of the test biological sequence determined using the trained neural network.
6 . The method of claim 1 , wherein the input biological sequences comprise deoxyribonucleic acid (DNA) sequences, ribonucleic acid (RNA) sequences, or protein sequences.
7 . The method of claim 1 , wherein the input biological sequences comprise a genetic variant as compared to a reference genome, wherein the genetic variant comprises a substitution, an insertion, a deletion, or a combination thereof.
8 . The method of claim 7 , wherein the genetic variant is selected from the group consisting of a nucleotide variant, a single base substitution, a copy number variation (CNV), a single nucleotide variant (SNV), an insertion or deletion (indel), a fusion, a transversion, a translocation, an inversion, a duplication, an amplification, a truncation, and a combination thereof.
9 . The method of claim 1 , wherein the molecular phenotypes comprise a level or a percentage of transcripts that include an exon, a level or a percentage of transcripts that use an alternative splice site, a level or a percentage of transcripts that use an alternative polyadenylation site, an affinity of an RNA-protein interaction, an affinity of a DNA-protein interaction, a specificity of an RNA-binding protein, a specificity of a DNA-binding protein, a specificity of a microRNA-RNA interaction, a level of protein phosphorylation, a phosphorylation pattern, a distribution of proteins along a strand of DNA containing a gene, a number of copies of gene transcripts, a distribution of proteins along a transcript, a number of proteins, or a combination thereof.
10 . The method of claim 1 , wherein the input biological sequence is a nucleotide sequence.
11 . The method of claim 1 , wherein the element of the plurality of elements is a nucleotide.
12 . A system for training a neural network for processing a test biological sequence, the system comprising: a data storage unit comprising a training data set comprising (i) a set of input biological sequences, and (ii) for each input biological sequence in the set of input biological sequences, label data comprising: (1) a molecular phenotype corresponding to the input biological sequence, and (2) a conservation value corresponding to each element of a plurality of elements of the input biological sequence; and one or more computer processors operatively coupled to the data storage unit, wherein the one or more computer processors are individually or collectively programmed to: (a) provide the neural network, wherein the neural network comprises at least one intermediate layer and is configured to process an input biological sequence to determine output data comprising: (i) a molecular phenotype corresponding to the input biological sequence, wherein the molecular phenotype comprises a numerical value which quantifies biological molecules of cells, and (ii) a conservation value corresponding to each element of a plurality of elements of the input biological sequence; (b) use the training data set to configure a set of parameters of the neural network, such that a total loss of the training data set is minimized at least in part by minimizing (i) a total loss of the molecular phenotypes and (ii) a total loss of the conservation values, thereby generating a trained neural network; and (c) provide the trained neural network with a test biological sequence, wherein the trained neural network is configured to determine a molecular phenotype corresponding to the test biological sequence and a conservation value corresponding to each test element of the plurality of test elements of the test biological sequence.
13 . The system of claim 12 , wherein the trained neural network comprises a single intermediate layer configured to determine the molecular phenotype and the conservation value corresponding to each test element of the plurality of test elements of the test biological sequence.
14 . The system of claim 12 , wherein the trained neural network comprises a plurality of intermediate layers, wherein a last layer of the plurality of intermediate layers is configured to determine the molecular phenotype and the conservation value corresponding to each test element of the plurality of test elements of the test biological sequence.
15 . The system of claim 12 , wherein the trained neural network comprises a plurality of intermediate layers, wherein a first layer of the plurality of intermediate layers is configured to determine the molecular phenotype corresponding to the test biological sequence, and wherein a second layer of the plurality of intermediate layers is configured to determine the conservation value corresponding to each test element of the plurality of test elements of the test biological sequence.
16 . The system of claim 12 , wherein the one or more computer processors are individually or collectively programmed to further determine the molecular phenotype corresponding to the test biological sequence based at least in part on the conservation value corresponding to each test element of the plurality of test elements of the test biological sequence determined using the trained neural network.
17 . The system of claim 12 , wherein the input biological sequences comprise deoxyribonucleic acid (DNA) sequences, ribonucleic acid (RNA) sequences, or protein sequences.
18 . The system of claim 12 , wherein the input biological sequences comprise a genetic variant as compared to a reference genome, wherein the genetic variant comprises a substitution, an insertion, a deletion, or a combination thereof.
19 . The system of claim 18 , wherein the genetic variant is selected from the group consisting of a nucleotide variant, a single base substitution, a copy number variation (CNV), a single nucleotide variant (SNV), an insertion or deletion (indel), a fusion, a transversion, a translocation, an inversion, a duplication, an amplification, a truncation, and a combination thereof.
20 . The system of claim 12 , wherein the molecular phenotypes comprises a level or a percentage of transcripts that include an exon, a level or a percentage of transcripts that use an alternative splice site, a level or a percentage of transcripts that use an alternative polyadenylation site, an affinity of an RNA-protein interaction, an affinity of a DNA-protein interaction, a specificity of an RNA-binding protein, a specificity of a DNA-binding protein, a specificity of a microRNA-RNA interaction, a level of protein phosphorylation, a phosphorylation pattern, a distribution of proteins along a strand of DNA containing a gene, a number of copies of gene transcripts, a distribution of proteins along a transcript, a number of proteins, or a combination thereof.

Description

CROSS-REFERENCE This application is a continuation-in-part claiming priority to PCT International Application PCT/CA2016/050689, filed Jun. 15, 2016, and U.S. Provisional Application No. 62/433,664, filed Dec. 13, 2016, each of which is entirely incorporated herein by reference. BACKGROUND Precision medicine, genetic testing, therapeutic development, drug target identification, patient stratification, health risk assessment and connecting patients with rare disorders can benefit from accurate information about how biological sequence variants are different or are similar in their molecular phenotypes. Biological sequence variants, also called variants, impact function by altering molecular phenotypes, which are aspects of biological molecules that participate in biochemical processes and in the development and maintenance of human cells, tissues, and organs. In the context of medicine and the identification and understanding of genetic variants that cause disease, exonic variants that change amino acids or introduce stop codons have traditionally been the primary focus. Yet, since variants may act by altering regulatory processes and changing a variety of molecular phenotypes, techniques that focus on relating genetic variants to changes in molecular phenotypes are valuable. Over the past decade, this has led to molecular phenotype-centric approaches that go beyond traditional exon-centric approaches. This change in approach is underscored by several observations: while evolution is estimated to preserve at least 5.5% of the human genome, only 1% accounts for exons; biological complexity often cannot be accounted for by the number of genes (e.g. balsam poplar trees have twice as many genes as humans); differences between organisms cannot be accounted for by differences between their genes (e.g. less than 1% of human genes are distinct from those of mice and dogs); increasingly, disease-causing variants have been found outside of exons. Analyzing how variants impact molecular phenotypes is challenging. In traditional molecular diagnostics, an example workflow may be as follows: a blood or tissue sample is obtained from a patient; variants (mutations) are identified, such as by sequencing the genome, sequencing the exome; running a gene panel; or applying a microarray; the variants are manually examined for their potential impact on molecular phenotype (e.g., by a technician), using literature databases and internet search engines; and a diagnostic report is prepared. Manually examining the variants may be costly and prone to human error, which may lead to incorrect diagnosis and potential patient morbidity Similar issues may arise in therapeutic design, where there is uncertainty about the potential targets and their molecular phenotype mechanisms. Insurance may be increasingly reliant on variant interpretation to identify disease markers and drug efficacy. Since the number of possible variants may be extremely large, evaluating them manually may be time-consuming, highly dependent on previous literature, and involve experimental data that has poor coverage and therefore can lead to high false negative rates, or “variants of uncertain significance.” Automating or semi-automating the analysis of variants and environmental contexts and their impact on molecular phenotypes and disease phenotypes is thus beneficial. SUMMARY As recognized herein, a key unmet need in precision medicine is the ability to automatically or semi-automatically analyze biological sequence variants by examining their impact on molecular phenotypes, such as, for example, determining associations between genetic variants and gross phenotypes, such as disease, using the molecular phenotypes induced by the genetic variants. To do this, it may be beneficial to develop methods and systems that can ascertain how genetic variants impact molecular phenotypes, which are intermediate biochemical attributes within cells that impact gross phenotype. The present disclosure provides methods and systems that may advantageously use such additional conservation information to increase performance and accuracy. In one aspect, a system for linking two or more biologically related variants derived from biological sequences is provided, the system comprising: one or more molecular phenotype neural networks (MPNNs), each MPNN comprising: an input layer configured to obtain one or more values digitally representing a variant in the two or more biologically related variants; one or more feature detectors, each configured to obtain input from at least one of: (i) one or more of the values in the input layer and (ii) an output of a previous feature detector; and an output layer comprising values representing a molecular phenotype for the variant, comprising one or more numerical elements obtained from one or more of the feature detectors; and a comparator linked to the output layer of each of the one or more MPNNs, the comparator configured to compare the molecula