CN-122029604-A - Machine learning implemented estimation of differences between protein sequences

CN122029604ACN 122029604 ACN122029604 ACN 122029604ACN-122029604-A

Abstract

The present disclosure provides a method that may include generating a training dataset comprising a pair of sample molecules. The sample molecule pair includes a first molecule and a second molecule. A property calculation model is trained to determine sequence differences comprising differences between the amino acid sequence of the first molecule and the amino acid sequence of the second molecule. The attribute calculation model is further trained to generate a relative embedding that represents the sequence differences and associates the sequence differences with the sample molecule pairs. The attribute calculation model is further trained to determine attribute differences corresponding to differences between values of the attribute represented by the first molecule and values of the attribute represented by the second molecule based on the relative embedding. A trained attribute calculation model may be applied to determine attribute differences for a pair of input molecules.

Inventors

A. P. Liffel Law
Frey, N.C.
V. Grigoryevich
E.LI
LIANG WEIQING
LIN YAOYU
S. Vaslaki
P. H.O. Pinheiro
N. Tagasovska
A. M. Watkins
K.Chu

Assignees

基因泰克公司
豪夫迈·罗氏有限公司

Dates

Publication Date: 20260512
Application Date: 20241004
Priority Date: 20231006

Claims (20)

1. A computer-implemented method, comprising: generating a training data set comprising a pair of sample molecules, the pair of sample molecules comprising a first molecule and a second molecule; training a property computation model based at least on the training dataset, wherein the property computation model is trained to at least Determining a sequence difference between the first molecule and the second molecule, the sequence difference comprising a difference between an amino acid sequence of the first molecule and an amino acid sequence of the second molecule, Generating a relative embedding representing the sequence differences and correlating the sequence differences with the sample molecule pairs, and Determining a property difference for the sample molecule pair based at least on the relative embedding of the sample molecule pair, wherein the property difference for the molecule pair corresponds to a difference between a value of a property exhibited by the first molecule and a value of a property exhibited by the second molecule; receiving a pair of input molecules, and The attribute computation model is applied to determine attribute differences for the pair of input molecules.
2. The method of claim 1, wherein the property comprises expression, affinity, specificity, developability, or biological activity.
3. The method according to any one of claims 1 to 2, wherein the pair of sample molecules is generated by pairing at least sample molecules from a set of sample molecules having a known value for the property.
4. A method according to any one of claims 1 to 3, wherein generating the training data set comprises Pairing a first sample molecule and a second sample molecule from a set of sample molecules to generate said sample molecule pair, and Pairing the first sample molecule with a third sample molecule from the set of sample molecules to generate another sample molecule pair.
5. The method of any of claims 1-4, wherein the attribute calculation model comprises a language model, and wherein training the attribute calculation model comprises training the language model to generate the relative embedding of the sample molecule pairs.
6. The method of claim 5, wherein the language model is trained to at least Generating a first intercalate for said first sample molecule and a second intercalate for said second sample molecule for said pair of sample molecules, and The relative embeddings of each sample molecule pair are generated by determining at least a difference between the first embeddings and the second embeddings.
7. The method of claim 6, wherein the difference between the first and second insertions corresponds to the sequence difference comprising the difference in the amino acid sequence of the first molecule and the amino acid sequence of the second molecule.
8. The method of any of claims 5-7, wherein the attribute calculation model further comprises a neural network coupled to the language model, and wherein training the attribute calculation model comprises training the neural network to determine the attribute differences for each sample molecule pair based at least on the relative embedding of each sample molecule pair.
9. The method of any one of claims 1 to 8, further comprising: One or more mutations associated with improvement of the property are identified based at least on the property differences of the pair of input molecules.
10. The method of claim 9, wherein the one or more mutations comprise a point mutation, wherein the pair of input molecules differ at a single position in each corresponding amino acid sequence.
11. The method of any one of claims 9 to 10, wherein the one or more mutations comprise a combination of a plurality of point mutations in which the pair of input molecules differ at a plurality of positions in each corresponding amino acid sequence.
12. The method of any one of claims 1 to 11, wherein each sample molecule comprises an antibody or a portion of the antibody.
13. The method of any one of claims 1 to 12, wherein each sample molecule comprises a variable region, an antigen binding region, a heavy chain and/or a light chain of an antibody.
14. A system, comprising: at least one data processor, and At least one memory storing instructions that, when executed by the at least one data processor, cause operations comprising the method according to any one of claims 1 to 13.
15. A non-transitory computer-readable medium storing instructions which, when executed by the at least one data processor, cause operations comprising the method of any one of claims 1 to 13.
16. A method of identifying a mutation, comprising: Applying an attribute calculation model to determine a difference in the value of the attribute exhibited by the first input molecule and the value of the attribute exhibited by the second input molecule; determining the amino acid sequence of the first input molecule and the amino acid sequence of the second input molecule; identifying a first mutation by comparing said amino acid sequence of said first input molecule with said amino acid sequence of said second input molecule, and Identifying the first mutation improves the value of the attribute based at least on the difference in the value of the attribute exhibited by the first input molecule and the second input molecule.
17. The method of claim 16, wherein the property comprises expression, affinity, specificity, developability, or biological activity.
18. The method of any one of claims 16 to 17, further comprising: Applying the attribute calculation model to determine the difference in the values of the attributes exhibited by the first and third input molecules; determining the amino acid sequence of the third molecule; Identifying a second mutation by comparing at least the amino acid sequence of the first input molecule with the amino acid sequence of the third input molecule, and Identifying the second mutation improves the value of the attribute based at least on the difference in the value of the attribute exhibited by the first and third input molecules.
19. The method of claim 18, wherein the first mutation comprises one type of amino acid residue occupying a first position in the amino acid sequence of the first input molecule, and wherein the second mutation comprises the same type of amino acid residue occupying a second position in the amino acid sequence of the first input molecule.
20. The method of any one of claims 18 to 19, wherein the first mutation comprises a position in the amino acid sequence of the first input molecule that is occupied by an amino acid residue of a first type, and wherein the second mutation comprises the same position in the amino acid sequence of the first input molecule that is occupied by an amino acid residue of a second type.

Description

Machine learning implemented estimation of differences between protein sequences Cross Reference to Related Applications The present application claims priority to U.S. provisional application No. 63/588,671, entitled "MACHINE LEARNING ENABLED ESTIMATION OF DIFFERENCES BETWEEN PROTEIN SEQUENCES" and filed on 6/10/2023, the disclosure of which is incorporated herein by reference in its entirety. Technical Field The subject matter described herein relates generally to molecular design, and more particularly to a computational model for determining differences in properties exhibited by two different protein molecules and techniques for enhancing molecular properties using the computational model. Background A molecule is a group of two or more atoms that are joined together by chemical bonds. The molecules form the smallest identifiable unit into which the pure substance can be separated while still retaining the composition and chemical properties of the substance. Various properties of a molecule, including its ability to function as a therapeutic agent, may depend on the composition and structure (or three-dimensional structure) of the molecule. In contrast, macromolecules (also known as biopharmaceuticals, biologicals, or biologicals) are molecules having molecular weights ranging between about 3000 daltons and 150,000 daltons. Macromolecular drugs are often derivatives of natural human proteins that regulate many important cellular functions such as enzymatic reactions, molecular trafficking, regulation and execution of many biological pathways, cell growth, proliferation, nutrient uptake, morphology, movement, intercellular communication, and the like. Examples of therapeutic proteins include antibodies, chimeric Antigen Receptors (CARs), enzymes, hormones, cytokines, and the like. A single macromolecule may typically have more than 1,300 amino acid residues joined by peptide bonds to form one or more polypeptides. Because of its size and complexity, macromolecular drugs are recombinantly produced by engineered cells, rather than being chemically synthesized as most small molecule drugs. Furthermore, macromolecular therapeutic agents are often delivered by injection or infusion, as oral administration is ineffective. Development of macromolecular drugs may require designing one or more sequences of amino acid residues capable of binding to a target (e.g., protein, nucleic acid, etc.) that are sufficiently specific and free of undesirable traits such as immunogenicity, self-association, instability, etc. Disclosure of Invention Systems, methods, and articles (including computer program products) for machine learning-implemented estimation of differences between protein molecules are provided. In one aspect, a system is provided that includes at least one processor and at least one data processor. The at least one memory may store instructions that, when executed by the at least one processor, cause operations. The operations may include generating a training data set comprising a pair of sample molecules, the pair of sample molecules comprising a first molecule and a second molecule, training a property calculation model based at least on the training data set, wherein the property calculation model is trained to determine at least a sequence difference between the first molecule and the second molecule, the sequence difference comprising a difference between an amino acid sequence of the first molecule and an amino acid sequence of the second molecule, generating a relative embedding representing and associating the sequence difference with the pair of sample molecules, and determining a property difference for the pair of sample molecules based at least on the relative embedding of the pair of sample molecules, wherein the property difference for the pair of molecules corresponds to a difference between a value of a property exhibited by the first molecule and a value of a property exhibited by the second molecule, receiving a pair of input molecules, and applying the property calculation model to determine the property difference for the pair of input molecules. In another aspect, a computer-implemented method for machine learning-implemented estimation of differences between protein molecules is provided. The method may include generating a training data set including a pair of sample molecules, the pair of sample molecules including a first molecule and a second molecule, training a property calculation model based at least on the training data set, wherein the property calculation model is trained to determine at least a sequence difference between the first molecule and the second molecule, the sequence difference including a difference between an amino acid sequence of the first molecule and an amino acid sequence of the second molecule, generating a relative embedding representing and associating the sequence difference with the pair of sample molecules, and determining a property difference for