CN-122003714-A - Prediction of mRNA properties using large language transducer models

CN122003714ACN 122003714 ACN122003714 ACN 122003714ACN-122003714-A

Abstract

Methods, computer systems, and apparatus, including computer programs encoded on computer storage media, for predicting mRNA characteristics. The system obtains data representing a sequence of codons of an mRNA molecule, generates an input token vector by numerically encoding the sequence of codons, and generates an embedded feature vector by processing the input token vector using an embedded machine learning model having a first set of model parameters.

Inventors

S. Yage
Z. Ba Joseph
LI SIZHEN
S. Noruzzad
S. Moaye Depp

Assignees

赛诺菲

Dates

Publication Date: 20260508
Application Date: 20240726
Priority Date: 20240516

Claims (20)

1. A computer-implemented method for predicting a characteristic of an mRNA molecule, the method comprising: obtaining data representative of an mRNA molecule comprising (i) a 5 'untranslated region (UTR), (ii) a coding sequence (CDS), and (iii) a 3' UTR; Generating a first input token vector by numerically encoding a nucleotide sequence of a 5 'UTR of the mRNA molecule, generating a second input token vector by numerically encoding a codon sequence of a CDS of the mRNA molecule, and generating a third input token vector by numerically encoding a nucleotide sequence of a 3' UTR of the mRNA molecule; Generating a first embedded feature vector by processing the first input token vector using a first embedded machine learning model, generating a second embedded feature vector by processing the second input token vector using a second embedded machine learning model, and generating a third embedded feature vector by processing the third input token vector using a third embedded machine learning model, wherein the first embedded machine learning model, the second embedded machine learning model, and the third embedded machine learning model have been trained using a first training process over a set of training mRNA sequences; Generating a joint embedding by combining the first embedded feature vector, the second embedded feature vector, and the third embedded feature vector, and The joint embedding is processed using a property-predicting machine learning model to generate an output that predicts one or more properties of the mRNA molecule, wherein the property-predicting machine learning model has been trained on a set of labeled training examples using a second training process.
2. The method of claim 1, wherein: generating the first input token vector includes mapping each nucleotide of the nucleotide sequence of the 5' UTR to a respective value and generating the first input token vector by concatenating the values; Generating the second input token vector includes mapping each codon of the CDS's sequence of codons to a respective value and generating the second input token vector by concatenating the values, and Generating the third input token vector includes mapping each nucleotide of the nucleotide sequence of the 3' UTR to a corresponding value and generating the third input token vector by concatenating the values.
3. The method of any preceding claim, wherein generating the joint embedding comprises: performing a first pooling operation on the first embedded feature vector to generate a first embedding; performing a second pooling operation on the second embedded feature vector to generate a second embedding; Performing a third pooling operation on the third embedded feature vector to generate a third embedding, and Concatenating the first embedding, the second embedding, and the third embedding to generate the joint embedding.
4. The method of claim 3, wherein each of the first pooling operation, the second pooling operation, and the third pooling operation is an average pooling operation.
5. The method of any preceding claim, wherein the first training process comprises: Initializing values of parameters of a first machine learning model including the first embedded machine learning model, the second embedded machine learning model, and the third embedded machine learning model, and The first machine learning model is trained by minimizing a pre-training penalty function that includes one or more pre-training penalties defined for one or more pre-training tasks.
6. The method of claim 5, wherein the one or more pre-training tasks comprise a Masking Language Model (MLM) learning task for predicting one or more masking codons or nucleotides within a known mRNA molecule.
7. The method of claim 6, wherein the one or more pre-training losses comprise an MLM loss function defined as: , Wherein, the A series of sequences is represented and, Representing the first machine learning model prediction at a given input sequence Is not masked in part of (a) Time token Present at specific masking positions Is a probability of (2).
8. The method of any one of claims 5 to 7, wherein the one or more pre-training tasks comprise a Homologous Sequence Prediction (HSP) task for predicting whether two training mRNA sequences belong to an organism in the same homologous class.
9. The method of claim 8, wherein the one or more pre-training losses comprise an HSP loss function defined as: , Wherein, the A real data tag indicating whether two input token sequences represent mRNA codon sequences belonging to the same cognate class, and Representing the predicted probability that the two input token sequences represent mRNA codon sequences belonging to the same homology class.
10. The method of any one of claims 5 to 9, wherein the pre-training loss function combines MLM loss and HSP loss.
11. The method of any preceding claim, wherein the second training process comprises: initializing values of parameters of a second machine learning model including the characteristic predictive machine learning model, and The second machine learning model is trained by minimizing a downstream loss function that includes one or more predicted losses defined for one or more characteristic prediction tasks.
12. The method of any one of claims 5 to 10, wherein the pre-training loss function or the downstream loss function further comprises a contrast loss that aims to maximize similarity between insertions of different regions within the same mRNA sequence, while minimizing similarity between insertions of different regions from different mRNA sequences.
13. The method of claim 12, wherein the alignment loss comprises a first alignment loss that aims at maximizing similarity between intercalation of 5 'UTR and CDS within the same mRNA sequence, while minimizing similarity between intercalation of 5' UTR and CDS from two different mRNA sequences.
14. The method of claim 13, wherein the first contrast loss is calculated by , Wherein, the Is the batch size of a batch of training samples, And Normalized embedding for 5' UTR and CDS generation respectively, Is cosine similarity, sum Is a temperature parameter.
15. The method of claim 12 or 13, wherein the contrast penalty comprises a second contrast penalty that aims to maximize similarity between the intercalation of a 3 'UTR and CDS within the same mRNA sequence, while minimizing similarity between the intercalation of a 3' UTR and CDS from two different mRNA sequences.
16. The method of claim 15, wherein the contrast loss is calculated as a combined contrast loss that combines the first contrast loss and the second contrast loss.
17. The method of claim 16, wherein the combined contrast loss is calculated as an average of the first contrast loss and the second contrast loss.
18. The method of any preceding claim, wherein the one or more characteristics of the mRNA molecule include the level of expression of the mRNA molecule in a particular type of cell or tissue.
19. The method of claim 18, wherein the mRNA molecule is a component of a vaccine and is encoded for expression of one or more antigenic proteins of a target pathogen, and the predicted characteristic of the mRNA molecule characterizes the expression level of the antigenic proteins of the target pathogen in the particular type of cell or tissue.
20. The method of any preceding claim, wherein the one or more characteristics of the mRNA molecule comprise stability under one or more environmental conditions.

Description

Prediction of mRNA properties using large language transducer models Cross Reference to Related Applications The present application claims priority from U.S. provisional patent application No. 63/516,226 filed on day 28 of 7 in 2023, U.S. provisional patent application No. 63/648,338 filed on day 16 of 5 in 2024, european patent application EP24305758.5 filed on day 16 in 2024, and european patent application EP24306271.8 filed on day 26 in 2024, the disclosures of all of which are hereby incorporated by reference in their entirety. Technical Field The present description relates generally to predicting characteristics of mRNA molecules using machine learning models (such as large language transducer models). Background MRNA or messenger RNA is an RNA molecule that plays a key role in gene expression and protein synthesis. The main function of mRNA is to carry genetic instructions from DNA to the ribosome where proteins are synthesized. mRNA is typically single stranded and can be hundreds to thousands of nucleotides in length. Full-length mRNA sequences include the 5 'untranslated region (UTR), coding sequence (CDS), and 3' UTR. The 5' UTR is the non-coding sequence at the beginning of the mRNA molecule. The 3' UTR is a non-coding sequence located at the end of an mRNA molecule. CDS consists of a sequence of codons, each of which consists of three nucleotides specifying a particular amino acid or start or stop signal during protein synthesis. The codon sequence determines the order of amino acid assembly during translation. Although the 5 'and 3' UTRs are not translated, they can play an important role in mRNA stability, localization and translation regulation. A machine learning model is a computational model that learns patterns and relationships in data and then uses that knowledge to represent data in different spaces and make predictions or decisions on new data. Neural networks are machine learning models that employ one or more layers of nonlinear units to predict output for a received input. The output of each hidden layer serves as an input to the next layer (i.e., the next hidden layer or output layer) in the network. Each layer of the network generates an output from the received inputs based on the current values of the respective set of parameters. Disclosure of Invention The present disclosure describes methods, computer systems, and apparatus, including computer programs encoded on computer storage media, for predicting a characteristic of an mRNA molecule. In one aspect, the present disclosure provides a predictive method for predicting one or more characteristics of an mRNA molecule. The method may be implemented by a system comprising one or more computers. Typically, the system generates a token representation by numerically encoding a codon sequence of an mRNA sequence, uses unsupervised learning to generate embedded features of the mRNA sequence using an embedded machine learning model (e.g., a large language model), and further uses supervised learning to predict mRNA characteristics for downstream tasks. By pre-training a large language model, the system enables the model to generate high performance embeddings that capture meaningful representations, codon interactions, and sequence-level patterns that are essential for understanding and predicting various mRNA properties in downstream tasks. Downstream tasks may include, for example, (1) predicting mRNA expression, (2) analyzing mRNA stability, and (3) predicting mRNA degradation. The two-step process including pre-training and fine tuning based on downstream tasks allows for the generation of high quality predictions of mRNA characteristics based on limited tagged data. To perform the predictive method, the system obtains data representing a codon sequence of an mRNA molecule, generates an input token vector by numerically encoding the codon sequence, and generates an embedded feature vector by processing the input token vector using an embedded machine learning model having a first set of model parameters. The first set of model parameters has been updated using a first training process that includes a first machine learning model embedded in the machine learning model. The first training process is performed based on a dataset specifying known subsequences of mRNA molecules, and the first machine learning model is configured to perform one or more pre-training tasks. The system processes the embedded feature vector using a feature prediction machine learning model to generate an output that predicts one or more features of the mRNA molecule. The property prediction machine learning model has a second set of model parameters that have been updated based on a plurality of training examples using a second training process that includes a second machine learning model of the property prediction machine learning model. Each respective training sample includes (i) a respective training input specifying a representation of a respectiv