US-12620453-B2 - Method, apparatus, and computer program for predicting interaction of compound and protein

US12620453B2US 12620453 B2US12620453 B2US 12620453B2US-12620453-B2

Abstract

Provided are a method, an apparatus, and a computer program for predicting interaction between a compound and a protein. A method for predicting interaction between a compound and a protein according to some embodiments of the present disclosure may comprises the steps of: acquiring learning data composed of compound data for learning, protein data for learning, and interaction scores; constructing a deep-learning model by using the acquired learning data; and predicting interaction of a given compound and protein through the constructed deep-learning model. Through the learning of the deep-learning mode with the exclusion of amino acid sequences associated with protein domains having a negative influence on interactions from amino acid sequences of proteins for learning, the interaction between a given compound and protein in the in vivo environment can be accurately predicted.

Inventors

Jin Woo Choi
Yi Rang KIM

Assignees

ONCOCROSS CO., LTD.

Dates

Publication Date: 20260505
Application Date: 20201214
Priority Date: 20200102

Claims (12)

1 . A method for predicting interaction between a compound and a protein in a computing device, the method comprising: acquiring training data composed of compound data for learning, protein data for learning, and interaction scores; constructing a deep-learning model using the acquired training data, the deep-learning model including a first neural network receiving a 2-D image generated from the protein data for learning, a second neural network receiving the compound data for learning, and a third neural network receiving a computation result of the second neural network and predicting an interaction score; predicting interaction between a given compound and protein, wherein the protein data for learning includes amino acid sequences of the protein for learning, and the constructing includes generating first training data by excluding amino acid sequences associated with a first protein domain having a negative influence on the interaction in vivo from the amino acid sequences of the protein for learning; and training the deep-learning model using the first training data, wherein training the deep-learning model includes extracting a plurality of n-gram sequences from amino acid sequences of the first training data, generating the 2-D image by mapping the plurality of n-gram sequences on a 2-D plane formed by two axes corresponding to an amino acid type and an amino acid sequence and setting pixel values of the 2-D image based on the number of n-gram sequences appearing in the amino acid sequences, and performing the training by entering the 2-D image into the first neural network, extracting local sequence patterns from the 2-D image that affect the interaction with a compound, and backpropagating errors based on the difference of the predicted interaction between the compound for learning and the protein for learning output through the third neural network from the interaction scores.
2 . The method of claim 1 , wherein the first protein domain includes a transmembrane domain.
3 . The method of claim 1 , wherein the constructing includes generating second training data consisting of the compound data for learning, amino acid sequence data associated with the first protein domain, and a first interaction score; and training the deep-learning model using the second training data, wherein the first interaction score is determined based on the extent to which the first protein domain negatively affects the interaction.
4 . The method of claim 1 , wherein the constructing includes generating second training data consisting of the compound data for learning, amino acid sequence data associated with a second protein domain, and a first interaction score; and training the deep-learning model using the second training data, wherein the first interaction score is determined based on the extent to which the second protein domain positively affects the interaction.
5 . The method of claim 4 , wherein the second protein domain includes an extracellular domain.
6 . The method of claim 1 , wherein the first neural network includes a Recurrent Neural Network (RNN) layer and a neural network layer, wherein the RNN layer receives an n-gram vector extracted from an amino acid sequence and outputs an embedding vector of the corresponding amino acid sequence, and the neural network layer receives the embedding vector and performing neural computations.
7 . The method of claim 1 , wherein the constructing includes: selecting, from the acquired training data, a first plurality of proteins for learning whose interaction score with a specific compound for learning is equal to or greater than a threshold and a second plurality of proteins for learning whose interaction scores are equal to or less than the threshold; extracting a first common sequence by comparing amino acid sequences of the first plurality of proteins for learning; extracting a second common sequence by comparing amino acid sequences of the second plurality of proteins for learning; training the deep-learning model using second training data consisting of the first common sequence, the specific compound data for learning, and a first interaction score; and training the deep-learning model using third training data consisting of the second common sequence, the specific compound data for learning, and a second interaction score, wherein the first interaction score is set to a value higher than an average interaction score of the first plurality of proteins for learning, and the second interaction score is set to a value lower than an average interaction score of the first plurality of proteins for learning.
8 . The method of claim 7 , wherein the extracting the first common sequence includes: extracting candidate common sequences by comparing amino acid sequences of the first plurality of proteins for learning; acquiring a predicted interaction score between the candidate common sequences and the specific compound data for learning through the deep-learning model; and selecting a sequence for which the predicted interaction score is equal to or greater than a threshold among the candidate common sequences.
9 . The method of claim 7 , wherein the training the deep-learning model using the second training data includes: determining a sample weight based on a length of the first common sequence and a frequency of appearance of the first common sequence in the first plurality of proteins for learning; and training the deep-learning model based on the determined sample weight.
10 . The method of claim 1 , wherein the constructing includes: selecting a first protein for learning whose interaction score with a specific compound for learning is equal to or greater than a threshold and a second protein for learning whose interaction score is equal to or less than the threshold by analyzing the acquired training data; extracting a non-common sequence by comparing the amino acid sequence of the first protein for learning and the amino acid sequence of the second protein for learning; acquiring a predicted interaction score between the non-common sequence and the specific compound for learning through the deep-learning model and determining an interaction score for learning based on the predicted interaction score; and training the deep-learning model using second training data consisting of the non-common sequence, the specific compound data for learning, and the determined interaction score.
11 . An apparatus for predicting interaction between a compound and a protein, the apparatus comprising: a memory storing one or more instructions; and a processor configured to perform, by executing the stored one or more instructions, an operation of acquiring training data consisting of compound data for learning, protein data for learning, and interaction scores, an operation of constructing a deep-learning model using the acquired training data, the deep-learning model including a first neural network receiving a 2-D image generated from the protein data for learning, a second neural network receiving the compound data for learning, and a third neural network receiving a computation result of the second neural network and predicting an interaction score, and an operation of predicting interaction between a given compound and protein through the constructed deep-learning model, wherein the protein data for learning includes amino acid sequences of the protein for learning, the constructing operation includes an operation of generating first training data by excluding amino acid sequences associated with a first protein domain having a negative influence on the interaction in vivo from the amino acid sequences of the protein for learning; and an operation of training the deep-learning model using the first training data, wherein training the deep-learning model includes extracting a plurality of n-gram sequences from amino acid sequences of the first training data, generating the 2-D image by mapping the plurality of n-gram sequences on a 2-D plane formed by two axes corresponding to an amino acid type and an amino acid sequence, and performing the training by entering the 2-D image into the first neural network, extracting local sequence patterns from the 2-D image that affect the interaction with a compound, and backpropagating errors based on the difference of the predicted interaction between the compound for learning and the protein for learning output through the third neural network from the interaction scores.
12 . A computer program stored in a computer-readable recording medium, the computer program, being combined with a computing device, comprising: acquiring training data consisting of compound data for learning, protein data for learning, and interaction scores; constructing a deep-learning model using the acquired training data, the deep-learning model including a first neural network receiving a 2-D image generated from the protein data for learning, a second neural network receiving the compound data for learning, and a third neural network receiving a computation result of the second neural network and predicting an interaction score; and predicting interaction between a given compound and protein through the constructed deep-learning model, wherein the protein data for learning includes amino acid sequences of the protein for learning, the constructing includes generating first training data by excluding amino acid sequences associated with a first protein domain having a negative influence on the interaction in vivo from the amino acid sequences of the protein for learning; and training the deep-learning model using the first training data, wherein training the deep-learning model includes extracting a plurality of n-gram sequences from amino acid sequences of the first training data, generating the 2-D image by mapping the plurality of n-gram sequences on a 2-D plane formed by two axes corresponding to an amino acid type and an amino acid sequence, and performing the training by entering the 2-D image into the first neural network, extracting local sequence patterns from the 2-D image that affect the interaction with a compound, and backpropagating errors based on the difference of the predicted interaction between the compound for learning and the protein for learning output through the third neural network from the interaction scores.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a national phase application of PCT Application No. PCT/KR2020/018235, filed on Dec. 14, 2020, which claims priority to Korean Patent Application Nos. 10-2020-0000318, filed on Jan. 2, 2020 and 10-2020-0169587, filed on Dec. 7, 2020. The entire disclosure of the applications identified in this paragraph is incorporated herein by reference. TECHNICAL FIELD The present disclosure relates to a method, an apparatus, and a computer program for predicting disease. More specifically, the present disclosure relates to a method for predicting the presence or extent of interaction between a given compound and protein using a deep-learning model, an apparatus for performing the method, and a computer program in which the method is implemented. BACKGROUND ART By using computational methods and bio-informatics, researchers may find new uses of existing compounds or predict the uses of new compounds. This approach is widely used in the discovery of new drugs. The discovery and development of new drugs always takes a lot of time and money and goes through a complex process. Accordingly, in recent years, research has been actively carried out to combine disciplines from various fields such as bio-informatics, chemi-informatics, computer science, and computer-aided drug discovery/design (CADD) to reduce the time required for the discovery and development of new drugs and to enhance the effects of new drugs. However, since the prior art employs a rule-based approach, it is impossible to predict a situation in which a rule may not be defined beyond human recognition. SUMMARY The technical object to be achieved through some embodiments of the present disclosure is to provide a method for accurately predicting the presence or extent of interaction between a given compound and protein using a deep-learning model, an apparatus for performing the method, and a computer program in which the method is implemented. Another technical object to be achieved through some embodiments of the present disclosure is to provide a method for accurately predicting the presence or extent of interaction between a compound and a protein in the in vivo environment using a deep-learning model, an apparatus for performing the method, and a computer program in which the method is implemented. Technical objects of the present disclosure are not limited to those described above, and other technical objects not mentioned above may also be clearly understood from the descriptions given below by those skilled in the art to which the present disclosure belongs. To achieve the technical object above, a method for predicting interaction between a compound and a protein according to some embodiments of the present disclosure comprises, as a method for predicting interaction between a compound and a protein in a computing device, acquiring training data composed of compound data for learning, protein data for learning, and interaction scores; constructing a deep-learning model by using the acquired training data; predicting interaction between a given compound and protein through the constructed deep-learning model, wherein the protein data for learning may include amino acid sequences of the protein for learning, and the constructing may include generating first training data by excluding amino acid sequences associated with a first protein domain having a negative influence on the interaction from the amino acid sequences of the protein for learning; and training the deep-learning model based on the first training data. In some embodiments, the first protein domain may include a transmembrane domain. In some embodiments, the constructing includes generating second training data consisting of the compound data for learning, amino acid sequence data associated with the first protein domain, and a first interaction score; and training the deep-learning model using the second training data, wherein the first interaction score may be determined based on the extent to which the first protein domain negatively affects the interaction. In some embodiments, the constructing includes: selecting a first plurality of proteins for learning whose interaction score with a specific compound for learning is above a threshold and a second plurality of proteins for learning whose interaction score is below the threshold from the acquired training data; comparing amino acid sequences of the first plurality of proteins for learning and extracting a first common sequence; comparing amino acid sequences of the second plurality of proteins for learning and extracting a second common sequence; training the deep-learning model using second training data consisting of the first common sequence, the specific compound data for learning, and a first interaction score; and training the deep-learning model using third learning data consisting of the second common sequence, the specific compound data for learning, and a second intera