CN-118522343-B - Protein-target affinity prediction method, system, storage medium and terminal based on small sample learning

CN118522343BCN 118522343 BCN118522343 BCN 118522343BCN-118522343-B

Abstract

The invention discloses a protein-target affinity prediction method, a system, a storage medium and a terminal based on small sample learning, belonging to the technical field of protein prediction, comprising S1, protein sequence feature extraction; S2, extracting the SMILES code characteristics of a drug target, S3, fusing the characteristic data extracted in the step S1 and the step S2 to obtain protein-target characteristic data, S4, constructing a protein-target affinity prediction model, and training the protein-target affinity prediction model by using the protein-target characteristic data, wherein the protein-target characteristic data is divided into a training set, a verification set and a test set, dynamic training is carried out by using the verification set to supplement the training set in the training process, and supplementing the training set is stopped when the number of the verification sets is reduced to a certain threshold value, and S5, the test set data is put into the trained protein-target affinity prediction model to obtain a protein-target affinity prediction result. The method can effectively solve the problem of overfitting caused by too few training samples in the protein-target affinity prediction process, and can provide a new view for resisting overfitting risks in the problem of small sample learning.

Inventors

ZHU QINSHENG
Hu Bangxun
LI XIAOYU
QIAN WEIZHONG
HUANG JUAN
Tian Xuwei
YU LIANHUI

Assignees

电子科技大学
喀什地区电子信息产业技术研究院

Dates

Publication Date: 20260512
Application Date: 20240418

Claims (9)

1. A protein-target affinity prediction method based on small sample learning, comprising the steps of: s1, extracting protein sequence characteristics; s2, extracting SMILES code characteristics of a drug target; S3, fusing the characteristic data extracted in the step S1 and the step S2 to obtain protein-target characteristic data; s4, constructing a protein-target affinity prediction model, and training the protein-target affinity prediction model by using the protein-target characteristic data, wherein the protein-target characteristic data is divided into a training set, a verification set and a test set, dynamic training is performed by using a mode of supplementing the training set by the verification set in the training process, and supplementing the training set is stopped when the number of the verification sets is reduced to a certain threshold value, and the step S4 further comprises: nesting a layer of circulation outside the protein-target affinity prediction model, and training by using a method of circulating a shallow neural network; recording the evaluation index parameters of the model from iteration to last time in each cycle, judging whether the maximum cycle times are reached, if so, exiting the cycle, and selecting the optimal evaluation index parameters in all the cycles as the parameters of the model; S5, putting the test set data into a trained protein-target affinity prediction model to obtain a protein-target affinity prediction result.
2. The method for predicting protein-target affinity based on small sample learning of claim 1, wherein the protein sequence feature extraction comprises: After protein sequences were quantized using One-Hot encoding, their features were extracted using convolutional neural networks.
3. The method for predicting protein-target affinity based on small sample learning of claim 1, wherein the drug target SMILES code feature extraction comprises: Converting the SMILES code into a molecular object; And calling rdkit corresponding functions in the library to extract molecular descriptor features and molecular geometric features of the drug target.
4. The method for predicting protein-target affinity based on small sample learning of claim 1, wherein the ratio of training set, validation set and test set is 3:1:1.
5. The method for predicting protein-target affinity based on small sample learning of claim 1, wherein the dynamic training using the validation set to supplement the training set comprises: in each iteration of the model, s data are randomly extracted from the validation set into the training set.
6. The method for predicting protein-target affinity based on small sample learning of claim 5, wherein stopping supplementing the training set when the number of validation sets falls below a threshold comprises: when the number of validation sets reaches 5% of the total number of samples, the supplemental training set is stopped.
7. A small sample learning-based protein-target affinity prediction system, comprising: a protein feature extraction module configured to extract protein sequence features; the target feature extraction module is configured to extract the feature of the drug target SMILES code; The feature fusion module is configured to fuse the protein features and the target features to obtain protein-target feature data; a predictive model construction and training module configured to construct a protein-target affinity predictive model and train the protein-target affinity predictive model using the protein-target feature data; the protein-target characteristic data is divided into a training set, a verification set and a test set, wherein in the training process, dynamic training is carried out by using a mode of supplementing the training set by the verification set, and supplementing the training set is stopped when the number of the verification sets is reduced to a certain threshold value; recording the evaluation index parameters of the model from iteration to last time in each cycle, judging whether the maximum cycle times are reached, if so, exiting the cycle, and selecting the optimal evaluation index parameters in all the cycles as the parameters of the model; and the protein-target affinity prediction module is configured to put the test set data into a trained protein-target affinity prediction model to obtain a protein-target affinity prediction result.
8. A computer storage medium having stored thereon computer instructions which, when executed, perform the relevant steps of a small sample learning based protein-target affinity prediction method according to any one of claims 1-6.
9. A terminal comprising a memory and a processor, the memory having stored thereon computer instructions that when executed by the processor perform the relevant steps of a small sample learning-based protein-target affinity prediction method according to any one of claims 1-6.

Description

Protein-target affinity prediction method, system, storage medium and terminal based on small sample learning Technical Field The invention relates to the technical field of protein prediction, in particular to a protein-target affinity prediction method, a system, a storage medium and a terminal based on small sample learning. Background In protein-target affinity prediction, too few available samples is a common problem. Protein-target affinity prediction requires a large amount of affinity data for known protein-ligand complexes to train the model and learn therefrom the pattern of protein-ligand interactions. However, experimental determination of protein-ligand complexes is expensive, time consuming and challenging, resulting in a relatively small set of available training data. Too few samples may cause problems in that the model may be too focused on noise and special samples in the training set to capture the true characteristics of the data because the training set is small. The method of increasing the number of iterations to learn the model better is not suitable because the higher number of training iterations will cause the model to memorize more details in the training set, further exacerbating the over-fitting phenomenon, resulting in a decrease in the generalization ability of the model on new data. Second, because of the limited number of samples available for training, neural network models are not adequately trained, resulting in easy overfitting. Thus, there are currently two major drawbacks in small sample learning problems such as protein-target affinity prediction. The first is the over-fitting problem due to the excessive number of iterations, and the second is the over-fitting problem due to the insufficient available training data. Disclosure of Invention The invention aims to overcome the problems existing in the existing protein-target affinity prediction and provides a protein-target affinity prediction method, a system, a storage medium and a terminal based on small sample learning. On the basis of not remarkably increasing the time complexity and the space complexity of the original algorithm, the over-fitting risk can be effectively resisted, and a new visual angle can be provided for resisting the over-fitting risk in the small sample learning problem. The aim of the invention is realized by the following technical scheme: in a first aspect, a method for predicting protein-target affinity based on small sample learning is provided, comprising the steps of: s1, extracting protein sequence characteristics; s2, extracting SMILES code characteristics of a drug target; S3, fusing the characteristic data extracted in the step S1 and the step S2 to obtain protein-target characteristic data; S4, constructing a protein-target affinity prediction model, and training the protein-target affinity prediction model by using the protein-target characteristic data, wherein the protein-target characteristic data is divided into a training set, a verification set and a test set; S5, putting the test set data into a trained protein-target affinity prediction model to obtain a protein-target affinity prediction result. As a further improvement of the present invention, the step S4 further includes: nesting a layer of circulation outside the protein-target affinity prediction model, and training by using a method of circulating a shallow neural network; and recording the evaluation index parameters of the model when iterating to the last time in each cycle, judging whether the maximum cycle times are reached, if so, exiting the cycle, and selecting the optimal evaluation index parameters in all the cycles as the parameters of the model. As a still further improvement of the present invention, the protein sequence feature extraction includes: After protein sequences were quantized using One-Hot encoding, their features were extracted using convolutional neural networks. As a further improvement of the present invention, the drug target SMILES code feature extraction includes: Converting the SMILES code into a molecular object; And calling rdkit corresponding functions in the library to extract molecular descriptor features and molecular geometric features of the drug target. As a further improvement of the invention, the ratio of the training set, the verification set and the test set is 3:1:1. As a further improvement of the present invention, the dynamic training by supplementing the training set with the verification set includes: in each iteration of the model, s data are randomly extracted from the validation set into the training set. As a still further improvement of the present invention, said stopping the supplementing of the training set when the number of the verification sets falls to a certain threshold value includes: when the number of validation sets reaches 5% of the total number of samples, the supplemental training set is stopped. In a second aspect, there is provided a small sa