KR-20260065060-A - SYSTEM, METHOD AND PROGRAM FOR PREDICTING PROTEIN-LIGAND BINDING AFFINITY USING SE(3)-INVARIANT AND PHYSICS-BASED NETWORK MODEL
Abstract
One embodiment of the present invention provides a protein-ligand binding affinity prediction system comprising: a receiving unit that receives atomic information and three-dimensional coordinate data of a protein-ligand complex; a graph preprocessing module that performs a preprocessing operation based on the atomic information and the three-dimensional coordinate data to generate a graph structure; an SE(3)-invariance transformation module that performs an invariance transformation of the graph structure into an embedding layer and a hidden representation based on a first inductive bias so that the binding affinity of a substance or system in three-dimensional space is invariant with respect to rotation or translation; a physical knowledge-based network model that predicts the binding affinity using the embedding layer and the hidden representation as inputs; and an output unit that outputs the predicted binding affinity.
Inventors
- 박상현
- 최승연
- 서상민
Assignees
- 연세대학교 산학협력단
Dates
- Publication Date
- 20260508
- Application Date
- 20241031
Claims (15)
- In a protein-ligand binding affinity prediction system, A receiver that receives atomic information and three-dimensional coordinate data of a protein-ligand complex; A graph preprocessing module that generates a graph structure by performing preprocessing based on the above atomic information and the above three-dimensional coordinate data; SE(3)-invariance transformation module that transforms the graph structure into an embedding layer and a hidden representation based on a first inductive bias so that the coupling affinity of the material or system in three-dimensional space becomes invariant with respect to rotation or translation; A physical knowledge-based network model that predicts the coupling affinity using the embedding layer and the hidden representation as inputs; and An output unit that outputs the predicted bond affinity; comprising Protein-ligand binding affinity prediction system.
- In paragraph 1, The above physical knowledge-based network model is, (a) a step of calculating protein-ligand interactions using the above hidden representation; (b) a step of predicting the binding affinity using the protein-ligand interaction; and (c) A step of calculating a loss function based on a second inductive bias that a binding state is formed at the point where the binding free energy of the protein-ligand complex is minimized; The learning that minimizes the above loss function, including Protein-ligand binding affinity prediction system.
- In paragraph 2 The above graph preprocessing module is, Based on the above atomic information, each atom is defined as a node, and the physicochemical characteristics of the atom are included in each node, and Generate a position matrix of each atom including the above 3D coordinates, and It is a method of defining edges based on the distance between atoms using the KNN (K-Nearest Neighbors) algorithm and representing interactions with neighboring atoms by connecting them with edges. Protein-ligand binding affinity prediction system.
- In paragraph 3 The above SE(3)-invariance conversion module is The above physicochemical features are encoded through the following mathematical formulas 1, 2, and 3 to be converted into an embedding layer and a hidden representation, and Updating the above hidden representation through the following mathematical formulas 4, 5, 6, 7, and 8 Protein-ligand binding affinity prediction system. [Mathematical Formula 1] [Mathematical Formula 2] [Mathematical Formula 3] Here Is and It is concatenating, and represents the initial hidden representation. Linear() represents the linear transformation function. R represents the real space, indicating that the corresponding vector or matrix consists of real numbers. N and M represent the number of atoms in the protein and ligand, respectively. D E represents the dimensionality of the physicochemical characteristics possessed by each atom. [Mathematical Formula 4] Here, represents the hidden representation of the i-th node in the l+1th embedding layer. represents the hidden representation of the i-th node in the l-th embedding layer. represents the Euclidean distance between atoms i and j. represents an edge defined based on the distance between atoms i and j using the KNN (K-Nearest Neighbors) graph algorithm. represents the learnable parameters of the network model. [Mathematical Formula 5] [Mathematical Formula 6] [Mathematical Formula 7] [Mathematical Formula 8] Here, , , represents the query, key, and value matrix for the attention operation.
- In paragraph 4, The above step (a) is, Calculating the protein-ligand interaction matrix using the above hidden representation through the following mathematical formula 9 Protein-ligand binding affinity prediction system. [Mathematical Formula 9] Here, represents the final hidden representation of the atoms constituting the ligand, and It refers to the final hidden representation of the atoms that make up the protein.
- In paragraph 5 The above step (b) is, Predicting the bond affinity based on the sum of the van der Waals interaction energies for each atomic pair calculated through the mathematical formula 10 below Protein-ligand binding affinity prediction system. [Mathematical Formula 10] Here, represents the sum of van der Waals interaction energies, and represents the interaction strength or weight (constant) between the i-th ligand atom and the j-th protein atom. represents the van der Waals radius between the i-th ligand atom and the j-th protein atom.
- In paragraph 6, The above step (c) is, Calculated through the mathematical formula 12 below and calculated through the mathematical formula 13 below Calculating a loss function that is the sum of, Protein-ligand binding affinity prediction system. [Mathematical Formula 12] [Mathematical Formula 13] Here, y represents the binding affinity based on actual experimental data, and represents the predicted bond affinity. represents a learnable parameter. represents the radius of the ligand atom. M represents the protein-ligand interaction matrix.
- In a method for predicting protein-ligand binding affinity, (A) A step of receiving atomic information and three-dimensional coordinate data of a protein-ligand complex; (B) A graph preprocessing step that generates a graph structure by performing preprocessing based on the above atomic information and the above 3D coordinate data; (C) SE(3)-invariance transformation step of transforming the graph structure into an embedding layer and a hidden representation based on a first inductive bias so that the coupling affinity of the material or system in three-dimensional space is invariant with respect to rotation or translation; and (D) a step of predicting the coupling affinity using a physical knowledge-based network model with the embedding layer and the hidden representation as inputs; comprising Method for predicting protein-ligand binding affinity.
- In paragraph 8, The above physical knowledge-based network model is, (E) A step of calculating protein-ligand interactions using the above hidden representation; (F) a step of predicting the binding affinity using the protein-ligand interaction; and (G) A step of calculating a loss function based on a second inductive bias that a binding state is formed at the point where the binding free energy of the protein-ligand complex is minimized; The learning that minimizes the above loss function, including Method for predicting protein-ligand binding affinity.
- In Paragraph 9, The above (B) step is, Based on the above atomic information, each atom is defined as a node, and the physicochemical characteristics of the atom are included in each node, and Generate a position matrix of each atom including the above 3D coordinates, and It is a method of defining edges based on the distance between atoms using the KNN (K-Nearest Neighbors) algorithm and representing interactions with neighboring atoms by connecting them with edges. Method for predicting protein-ligand binding affinity.
- In Paragraph 10, The above (C) step The above physicochemical features are encoded through the following mathematical equations 14, 15, and 16 and converted into an embedding layer and a hidden representation, and Updating the above hidden representation through the following mathematical formulas 17, 18, 19, 20, and 21 Method for predicting protein-ligand binding affinity. [Mathematical Formula 14] [Mathematical Formula 15] [Mathematical Formula 16] Here Is and It is concatenating, and represents the initial hidden representation. Linear() represents the linear transformation function. R represents the real space, indicating that the corresponding vector or matrix consists of real numbers. N and M represent the number of atoms in the protein and ligand, respectively. D E represents the dimensionality of the physicochemical characteristics possessed by each atom. [Mathematical Formula 17] Here, represents the hidden representation of the i-th node in the l+1th embedding layer. represents the hidden representation of the i-th node in the l-th embedding layer. represents the Euclidean distance between atoms i and j. represents an edge defined based on the distance between atoms i and j using the KNN (K-Nearest Neighbors) graph algorithm. represents the learnable parameters of the network model. [Mathematical Formula 18] [Mathematical Formula 19] [Mathematical Formula 20] [Mathematical Formula 21] Here, , , represents the query, key, and value matrix for the attention operation.
- In Paragraph 11, The above (E) step is, Calculating the protein-ligand interaction matrix using the above hidden representation through the following mathematical formula 22 Method for predicting protein-ligand binding affinity. [Mathematical Formula 22] Here, represents the final hidden representation of the atoms constituting the ligand, and It refers to the final hidden representation of the atoms that make up the protein.
- In Paragraph 12, The above (F) step is, Predicting the bond affinity based on the sum of the van der Waals interaction energies for each atomic pair calculated through the mathematical formula 23 below Method for predicting protein-ligand binding affinity. [Mathematical Formula 23] Here, represents the sum of van der Waals interaction energies, and represents the interaction strength or weight (constant) between the i-th ligand atom and the j-th protein atom. represents the van der Waals radius between the i-th ligand atom and the j-th protein atom.
- In Paragraph 13, The above (G) step is, Calculated through the mathematical formula 24 below and calculated through the mathematical formula 25 below Calculating a loss function that is the sum of the sums, Method for predicting protein-ligand binding affinity. [Mathematical Formula 24] [Mathematical Formula 25] Here, y represents the binding affinity based on actual experimental data, and represents the predicted bond affinity. represents a learnable parameter. represents the radius of the ligand atom. M represents the protein-ligand interaction matrix.
- In a computer-readable recording medium, A computer-readable recording medium having a program recorded thereon for executing a method for predicting protein-ligand binding affinity according to any one of claims 8 to 14.
Description
System, method and program for predicting protein-ligand binding affinity using SE(3)-invariant and physics-based network model The present invention relates to a system, method, and program for predicting protein-ligand binding affinity, and more specifically to a technology for predicting binding affinity between a protein and a ligand by utilizing a network model based on geometric and physicochemical principles. Predicting protein-ligand binding affinity (BA) is an essential process in drug screening and plays a crucial role in selecting drug candidates from among numerous molecular structures. With recent advancements in machine learning and deep learning, various methodologies for predicting protein-ligand binding affinity have been proposed. In particular, studies modeling the three-dimensional structure of proteins, such as AlphaFold, have significantly enhanced the potential for predicting binding affinity based on the 3D structure of protein-ligand complexes. These approaches can be broadly classified into ML-based, CNN-based, and GNN-based methods. ML-based methods predict binding affinities using models such as Support Vector Regression (SVR) and Random Forest through predefined rules based on interactions between proteins and ligand atoms. However, these methods have limitations in that they do not adequately reflect spatial correlations between atoms. CNN (Convolutional Neural Network)-based methods convert complexes into a 3D grid and predict coupling affinities through a 3D CNN model, but empty spaces in the grid can lead to computational inefficiency and wasted memory. Additionally, prediction performance may be unstable as it does not consider distance recognition and rotation invariance. Graph Neural Network (GNN)-based methods define atoms of proteins and ligands as nodes in a graph and connect pairs of atoms within a specific distance as edges to process them using a GNN model. Some studies predict binding affinities by directly inputting 3D coordinates, but these methods are sensitive to rotation and translation or struggle to handle geometric configurations not seen during training. Meanwhile, although deep learning models have achieved great success in various fields such as computer vision and natural language processing, they still struggle to extract interpretable information, particularly in cases where they predict values that are impossible or fail to maintain physical consistency. This is even more pronounced in fields where data collection is difficult, such as physical chemistry. To address this, Physics-Informed Neural Networks (PINNs) are being actively researched. PINN integrates physical laws and domain knowledge into network models, enabling them to be applied as inductive bias during the learning process. This ensures that the model implicitly satisfies these laws during training and inference, thereby securing model robustness even on noisy datasets and improving generalization performance. However, existing PINN models have limitations in that they only consider the connectivity information between proteins and ligands, failing to adequately model the geometric information between the two structures. Due to these issues, existing models exhibit poor predictive performance on independent datasets, which may limit their practicality in the drug development process. For example, because it is difficult to accurately predict binding affinity for novel protein-ligand complexes, time and costs may increase during the early stages of drug development. Therefore, there is a need to develop a network model that provides performance invariant to geometric transformations and can more accurately reflect the interactions between protein-ligand complexes. FIG. 1 is a schematic diagram of a protein-ligand binding affinity prediction system and method using a deep learning model based on SE (3) invariance and physical knowledge according to an embodiment of the present invention. FIG. 2 is a block diagram illustrating the configuration of a protein-ligand binding affinity prediction system according to an embodiment of the present invention. FIG. 3 is a reference diagram showing the learning process and coupling affinity prediction method of a physical knowledge-based network model according to an embodiment of the present invention. Figure 4 is a reference diagram showing the results of performance comparison according to each inductive bias removal of the SPIN model according to one embodiment of the present invention. Figure 5 is a reference diagram showing the results of comparing ranking power in a virtual screening experiment using a SPIN model according to one embodiment of the present invention. Figure 6 is a reference diagram verifying the possibility of analyzing and interpreting protein-ligand interactions using a SPIN model according to one embodiment of the present invention. FIG. 7 is a flowchart showing a method for predicting protein-ligand binding affinity accordin