CN-122003715-A - Method and device for discovering enzymes

CN122003715ACN 122003715 ACN122003715 ACN 122003715ACN-122003715-A

Abstract

Methods and apparatus for discovering enzymes are disclosed. The enzyme discovery method is an enzyme discovery method performed by a computing device comprising a processor and a memory. The method may include the steps of a processor loading a first encoder trained based on functional characteristics of an enzyme into a memory, a processor loading a second encoder trained based on structural characteristics of a compound into the memory, the processor constructing a combination network for combining the first encoder and the second encoder through transfer learning, the processor inputting amino acid sequences of candidate enzymes and map data of a target compound into the first encoder and the second encoder, calculating reaction probabilities of reactions between the candidate enzymes and the target compound through the combination network and storing the reaction probabilities in the memory, and the processor selecting a final candidate enzyme performing reactions on a biosynthetic pathway for producing the target compound from the candidate enzymes according to the calculated reaction probabilities, and storing biosynthetic pathway data optimizing the biosynthetic pathway using the final candidate enzyme in the memory.

Inventors

XU CHENGYUN
Lv runjiu
Xu Xianning
LI GUIHUANG
ZHAO CHENGYIN

Assignees

株式会社LG化学

Dates

Publication Date: 20260508
Application Date: 20250717
Priority Date: 20240718

Claims (20)

1. A method for searching for enzymes performed by a computing device comprising a processor and a memory, comprising: loading, by the processor, a first encoder that performs learning based on functional features of the enzyme into the memory; loading, by the processor, a second encoder that performs learning based on structural features of a compound into the memory; Constructing, by the processor, a combining network that combines the first encoder and the second encoder through transfer learning; Inputting, by the processor, the amino acid sequence of the candidate enzyme and the map data of the target compound in the first encoder and the second encoder, and calculating, by the processor, a reaction probability that the candidate enzyme and the target compound will react through the combination network to store the calculated reaction probability in the memory, and Selecting, by the processor, a final candidate enzyme among candidate enzymes for performing a reaction on a biosynthetic pathway for producing the target compound based on the calculated reaction probability, and storing, by the processor, biosynthetic pathway data that optimizes the biosynthetic pathway using the final candidate enzyme in the memory.
2. The method of claim 1, wherein constructing the combining network comprises combining, by the processor, the first feature extracted from the first encoder and the second feature extracted from the second encoder into one vector through a fully connected layer, and calculating the reaction probability to store the calculated reaction probability in the memory comprises non-linearly transforming, by the processor, the vector by applying a non-linear activation function to the fully connected layer to calculate the reaction probability.
3. The method of claim 1, wherein the first encoder learns the reaction characteristics of the enzyme by predicting an Enzyme Committee (EC) numbered task from the amino acid sequence of the enzyme, and performs multi-class prediction for multiple classes of the enzyme committee numbered using comparative learning.
4. A method according to claim 3, wherein the first encoder is learned to generate two positive samples by data enhancement, maximize similarity between the positive samples by contrast loss function, and minimize similarity between negative samples.
5. The method of claim 1, wherein the second encoder maps the map of the target compound to a continuous numerical-based potential space based on a generative model and extracts features of connected units in connected unit trees that are larger than atomic units.
6. The method of claim 5, wherein the second encoder further comprises a network that predicts the energy of a reactant based on the mapped values of the potential space.
7. The method of claim 1, wherein the first encoder comprises an enzyme function prediction model that predicts enzyme function by predicting functional annotation of a protein, and The method further comprises the steps of: Acquiring, by the processor, learning data related to a functional characteristic of the enzyme; Introducing, by the processor, a plurality of base losses for a plurality of active layers corresponding to a plurality of levels, respectively; Applying, by the processor, data enhancement to the enzyme function prediction model, and introducing, by the processor, parasitic losses; defining, by the processor, a total loss function based on the plurality of base losses and the parasitic loss, and The enzyme function prediction model is learned by the processor based on the learning data and the total loss function.
8. The method of claim 7, wherein the plurality of base losses includes a first base loss to an Mth base loss, wherein M is an integer greater than or equal to 2, and Defining the total loss function includes defining the total loss function (l) according to equation 1 below: (equation 1) Wherein, in the above equation 1, Is a kth basic loss (where k is an integer greater than or equal to 1 and less than or equal to M) among the plurality of basic losses, It is the said additional loss that is to be taken, Is a predetermined weight of the kth basic loss, and Is a predetermined weight of the parasitic loss.
9. The method of claim 8, wherein the parasitic loss is defined according to equation 2 below: (equation 2) Wherein, in the above equation 2, And Is the embedded vector that is just facing, Is that And Is used for the cosine similarity of the (c), Is a parameter of the temperature of the liquid, Is of batch size, and Indicating when And 1 at that time.
10. The method of claim 8, wherein the parasitic loss is defined according to equation 3 below: (equation 3) Wherein, in the above equation 3, Is a multi-level item of value, And Is the embedded vector that is just facing, Is that And Is used for the cosine similarity of the (c), Is a parameter of the temperature of the liquid, Is of batch size, and Indicating when And 1 at that time.
11. An apparatus for searching for enzymes, comprising: one or more non-transitory computer readable media, the one or more non-transitory computer-readable media include instructions; and One or more of the processors of the present invention, the one or more processors perform operations by executing the instructions, wherein the operations include: loading a first encoder that performs learning based on functional features of the enzyme into the medium; Loading a second encoder that performs learning based on structural features of the compound into the medium; Constructing a combining network combining the first encoder and the second encoder through transfer learning; Inputting the amino acid sequence of a candidate enzyme and map data of a target compound in the first encoder and the second encoder, and calculating a reaction probability that the candidate enzyme and the target compound will react through the combination network to store the calculated reaction probability in the medium, and Selecting a final candidate enzyme among candidate enzymes for performing a reaction on a biosynthetic pathway for producing the target compound based on the calculated reaction probability, and storing biosynthetic pathway data for optimizing the biosynthetic pathway using the final candidate enzyme in the medium.
12. The apparatus of claim 11, wherein constructing the combining network comprises combining the first feature extracted from the first encoder and the second feature extracted from the second encoder into one vector through a fully connected layer, and calculating the reaction probability comprises non-linearly transforming the vector by applying a non-linear activation function to the fully connected layer to predict the reaction probability.
13. The apparatus of claim 11, wherein the first encoder learns the reaction characteristics of the enzyme by predicting an Enzyme Committee (EC) numbered task from the amino acid sequence of the enzyme, and performs multi-class prediction for multiple classes of the enzyme committee numbered using comparative learning.
14. The apparatus of claim 13, wherein the first encoder is learned to generate two positive samples by data enhancement, maximize similarity between the positive samples by contrast loss function, and minimize similarity between negative samples.
15. The apparatus of claim 11, wherein the second encoder maps the map of the target compound to a continuous numerical-based potential space based on a generative model and extracts features of connected units in connected unit trees that are larger than atomic units.
16. The apparatus of claim 15, wherein the second encoder further comprises a network that predicts the energy of a reactant based on the mapped values of the potential space.
17. The apparatus of claim 11, wherein the first encoder comprises an enzyme function prediction model that predicts enzyme function by predicting functional annotation of a protein, and The operations further comprise: acquiring learning data related to a functional feature of the enzyme; Introducing a plurality of basic losses for a plurality of active layers corresponding to a plurality of levels, respectively; Applying data enhancement to the enzyme function prediction model and introducing parasitic losses; defining a total loss function based on the plurality of base losses and the parasitic loss, and The enzyme function prediction model is learned based on the learning data and the total loss function.
18. The apparatus of claim 17, wherein the plurality of base losses includes a first base loss to an Mth base loss, wherein M is an integer greater than or equal to 2, and Defining the total loss function includes defining the total loss function (l) according to equation 1 below: (equation 1) Wherein, in the above equation 1, Is a kth basic loss (where k is an integer greater than or equal to 1 and less than or equal to M) among the plurality of basic losses, It is the said additional loss that is to be taken, Is a predetermined weight of the kth basic loss, and Is a predetermined weight of the parasitic loss.
19. The apparatus of claim 18, wherein the parasitic loss is defined according to equation 2 below: (equation 2) Wherein, in the above equation 2, And Is the embedded vector that is just facing, Is that And Is used for the cosine similarity of the (c), Is a parameter of the temperature of the liquid, Is of batch size, and Indicating when And 1 at that time.
20. The apparatus of claim 18, wherein the parasitic loss is defined according to equation 3 below: (equation 3) Wherein, in the above equation 3, Is a multi-level item of value, And Is the embedded vector that is just facing, Is that And Is used for the cosine similarity of the (c), Is a parameter of the temperature of the liquid, Is of batch size, and Indicating when And 1 at that time.

Description

Method and device for discovering enzymes Technical Field Cross Reference to Related Applications The present application claims priority and benefit from korean patent application No. 10-2024-0095089 filed on month 7 and 18 of 2024 to korean intellectual property office, and korean patent application No. 10-2024-0100992 filed on month 7 and 30 of 2024 to korean intellectual property office, the entire contents of which are incorporated herein by reference. The present disclosure relates to a method and apparatus for searching enzymes, and more particularly, to a method and apparatus for searching enzymes that searches for candidate enzymes and searches for enzymes by selecting a final candidate enzyme. Background As global interest in sustainability increases, research on methods for mitigating climate change by reducing greenhouse gases is active. Bio-based materials (bio-based materials) can play an important role in reducing greenhouse gas emissions by replacing conventional fossil fuel-based materials. Polylactic acid (PLA), which is an example of a bio-based material, may be a kind of bio-plastic and may be produced by converting carbohydrates, such as glucose or starch, into lactic acid by microbial fermentation from renewable plant sources, such as corn, sugarcane or potato, and synthesizing and polymerizing lactide. The Bio-based 1, 4-butanediol (Bio-BDO) as another example of the Bio-based material may be 1, 4-butanediol (1, 4-BDO) generated from renewable biomass, and may be generated through a microbial fermentation process without depending on fossil fuel. Biobased 1, 4-butanediol (Bio-BDO) can be used in a variety of applications, for example, conversion to Tetrahydrofuran (THF) by dehydration. The industrial strain may be a microbial strain (microbial strain) selected and optimized for large-scale production of industrially useful products or materials, and may be critical to the production of biobased materials, and may play a key role in converting biomass into various high value-added products. Conventional methods for developing industrial strains include mutation induction, gene recombination, metabolic engineering, and adaptive evolution. However, efforts are underway to maximize research efficiency through data-based prediction and optimization using rapidly evolving artificial intelligence techniques. Disclosure of Invention Technical problem The problem to be solved by the present disclosure is to provide a method and a device for searching for enzymes, which can search for enzymes capable of reacting with a specified target compound using artificial intelligence techniques based on functional characteristics of the enzymes and structural characteristics of the compound. Other problems to be solved by the present disclosure are to provide a method and apparatus for searching for enzymes, which can provide prediction of functions of enzymes, thereby enabling to improve prediction accuracy for function notes having a hierarchical structure, and solve the data imbalance problem. Technical proposal A method for searching for an enzyme according to an embodiment of the present disclosure includes loading, by a processor, a first encoder performing learning based on functional characteristics of the enzyme into a memory, loading, by the processor, a second encoder performing learning based on structural characteristics of a compound into the memory, constructing, by the processor, a combined network combining the first encoder and the second encoder through transfer learning (TRANSFER LEARNING), inputting, by the processor, an amino acid sequence of a candidate enzyme and map data of a target compound into the first encoder and the second encoder, and calculating, by the processor, reaction probabilities of the candidate enzyme and the target compound through the combined network to store the calculated reaction probabilities in the memory, and selecting, by the processor, a final candidate enzyme performing a reaction on a biosynthetic pathway for producing the target compound among the candidate enzymes based on the calculated reaction probabilities, and storing, by the processor, biosynthetic pathway data optimizing the biosynthetic pathway using the final candidate enzyme in the memory. Constructing the combining network may include combining, by the processor, the first feature extracted from the first encoder and the second feature extracted from the second encoder into one vector through the full connection layer, and calculating the reaction probability to store the calculated reaction probability in the memory may include non-linearly transforming, by the processor, the vector by applying a non-linear activation function to the full connection layer to calculate the reaction probability. The first encoder may learn reaction characteristics of the enzyme by predicting a task of an Enzyme Committee (EC) number according to an amino acid sequence of the enzyme, and may perform multi-