CN-121999913-A - Drug and target interaction prediction method based on drug sequence descriptor
Abstract
The invention discloses a drug and target interaction prediction method based on a drug sequence descriptor, and belongs to the technical field of drug target prediction. The invention adopts descriptors of two drug smiles sequences, and the two descriptors are added to enable the model to more accurately predict results, which benefits from the fact that the two descriptors contain structural information and chemical information related to tasks, and for protein sequences, the invention adds the characteristics of a large language model ESM-2 and the characteristics of a BERT large language model, the ESM-2 model and the BERT model have strong characterization capability of output vectors because the model weights of the ESM-2 model and the BERT model are derived from tens of thousands of protein sequences, and the interaction between protein characteristics and drug characteristics of the source sequences is realized through a designed cross attention layer with shared weights, so that the prediction precision of the model is further improved, and experiments show that the model designed by the invention has better multiple index effects on multiple data sets.
Inventors
- LU KUN
- HU WEI
- PAN XUEJUAN
- ZHAO XINGQIANG
- GAO YUN
- WANG BING
Assignees
- 安庆师范大学
Dates
- Publication Date
- 20260508
- Application Date
- 20260116
Claims (10)
- 1. A method for predicting drug-target interactions based on a drug sequence descriptor, comprising the steps of: s1, data preprocessing Coding a medicine smiles sequence and a protein sequence through preset dictionary type data; S2, acquiring a medicine sequence descriptor and protein sequence characteristics of a large language model Characterizing a drug smiles sequence by adopting two drug molecule descriptors, and extracting characteristics of the protein sequence by adopting two large protein language models; s3: embedding layer generating an embedding vector Mapping the discrete data coded in the step S1 to a continuous vector space through embedding layers, converting the discrete data into a high-dimensional vector, and then carrying out permute operation to convert characteristic dimensions to obtain processed drug characteristics and protein characteristics; S4, feature extraction Carrying out hierarchical feature extraction on the drug features and the protein features output in the step S3 through three one-dimensional convolution layers respectively, and converting feature dimensions through permute operation to adapt to a follow-up attention mechanism; s5, feature interaction The cross attention mechanism with shared weight is adopted to interact the features extracted in the step S4, the medicine side key vector and the value vector are calculated based on protein features, the protein side key vector and the value vector are calculated based on medicine features, and the weights of the first full-connection layers of the multi-head attention layers on the two sides are shared; s6, feature stitching Adjusting the dimensionality of the features after the interaction in the step S5, regulating and controlling the specific gravity of the original features and the interaction features through selective forgetting coefficients, carrying out treatment by a maximum pooling layer, reducing the dimensionality, respectively treating and splicing the drug features and the protein features obtained after the dimensionality reduction by a full-connection layer, and inputting the drug features and the protein features into the MLP to obtain a prediction result; s7, model training verification and evaluation Dividing the data set, selecting a loss function training for executing the interaction prediction model of the steps S1-S6, evaluating the model effect by adopting a preset evaluation index, and outputting a prediction result.
- 2. The method for predicting drug-target interactions based on drug sequence descriptors according to claim 1, wherein in the step S1, the coding correspondence of dictionary type data is as follows: the corresponding relation between Char and Data is that the medicine smiles sequence codes #→29、%→30、)→31、(→1、+→32、-→33、5→37、7→38、9→39、=→40、@→8、/→34、.→2、1→35、0→3、3→36、2→4、4→5、6→6、8→7、B→9、E→43、D→10、G→44、F→11、I→45、H→12、K→46、M→47、L→13、C→42、O→48、f→23、N→14、P→15、S→49、R→16、U→50、T→17、W→51、V→18、Y→52、[→53、y→64、Z→19、]→54、\→20、a→55、c→56、b→21、e→57、d→22、g→58、i→59、h→24、m→60、l→25、o→61、n→26、s→62、r→27、u→63、t→28、A→41; Protein sequence coding, wherein the corresponding relation between Char and Data is that A→1、C→2、B→3、E→4、D→5、G→6、F→7、I→8、H→9、K→10、M→11、L→12、O→13、N→14、Q→15、P→16、S→17、R→18、U→19、T→20、W→21、V→22、Y→23、X→24、Z→25; Where Char refers to a specific character in the drug smiles sequence or protein sequence, and Data refers to a unique digitally encoded value assigned to each character.
- 3. The method according to claim 1, wherein in the step S2, the two drug molecule descriptors are RDKit d molecular fingerprint and Pubchempy molecular fingerprint, respectively, wherein RDKit d molecular fingerprint is used for characterizing topological structure characteristics of the molecule, namely connection mode of atoms and bonds, obtained by traversing each atom in the molecule and considering sub-image paths within a set range, and Pubchempy molecular fingerprint is a binary vector with length 881 bit, each bit corresponds to a predefined specific chemical substructure or chemical property, and has definite interpretability.
- 4. The method for predicting interaction between a drug and a target based on a drug sequence descriptor according to claim 1, wherein in the step S2, two large protein language models are an ESM-2 model and a ProteinBERT model respectively, wherein the ESM-2 model is a transform-based 6-layer pretrained model, target learning of the dependence among amino acids is achieved through masking language modeling, the masking M comprises 15% of positions i in a sequence x, the model predicts identities of the amino acids x i in the mask according to surrounding context, the ProteinBERT model combines language modeling with gene ontology annotation prediction by adopting a pretraining scheme, the architecture comprises local and global representations, the global representation is obtained through a global attention layer and is used for representing the characteristics comprising protein structures and biophysical properties, and the local representation is parallelly operated through convolution kernels of different sizes to form a multi-branch convolution, so that local characteristics of different scales in proteins are simulated.
- 5. The method according to claim 1, wherein in the step S3, the embedding layer outputs a protein having a characteristic dimension of (N, 1000,64), a drug having a characteristic dimension of (N, 100,64), and the permute (0,2,1) process outputs a protein having a characteristic dimension of And drug characteristics The dimensions of (a) are (N, 64,1000) and (N, 64,100), respectively, where N is the number of samples.
- 6. The method for predicting interaction between a drug and a target based on a drug sequence descriptor according to claim 5, wherein in the step S4, the convolution kernels of three one-dimensional convolution layers for drug feature extraction are 4, 6 and 8, respectively, the convolution kernels of three one-dimensional convolution layers for protein feature extraction are 4, 8 and 12, respectively, a PReLU activation function is used in the feature extraction process to retain part of negative gradient information, after hierarchical feature extraction, the drug feature dimension is (N, 160,979), the protein feature dimension is (N, 160,85), and then permute (2,0,1) operation is performed to obtain a drug feature map of an input attention mechanism And target protein profile 。
- 7. The method of predicting drug-target interactions based on drug sequence descriptors of claim 6, wherein in said step S5, the cross-attention mechanism with shared weights satisfies: Query vector for a drug Key vector Value vector The calculation process is as follows: ; query vector for proteins Key vector Value vector The calculation process of (2) is as follows: ; Wherein, the 、 、 Weight matrix of the first full-connection layer of the multi-headed attention layer, respectively, and used on the drug side and the protein side 、 、 Identical; the characteristics are obtained after the cross-attention extraction and the Permute operation And Feature dimension after interaction 、 Keep consistent, noted as output features And 。
- 8. The method of predicting drug-target interactions based on drug sequence descriptors of claim 7, wherein in said step SS6, a signature will be output And (3) with Feature dimension adjustment to AND by permute operations And (3) with Is the same.
- 9. The method of predicting drug-target interactions based on drug sequence descriptors of claim 8, wherein in said step S6, a pair of selective forgetting coefficients is used 、 And (3) with 、 The regulation and control are carried out according to the following formula: ; wherein W is a selective forgetting coefficient; For the regulated characteristics 、 Performing dimension reduction on the pooled features by using a method squeeze after treatment of the maximum pooling layer to obtain the drug features with dimensions of (N, 160) Protein characterization The formula is as follows: 。
- 10. The method of predicting drug-target interactions based on drug sequence descriptors of claim 9, wherein in said step S6, the prediction result expression is as follows: ; Wherein, the , The MLP comprises four full-connection layers, the number of corresponding neurons is 1024, 512, 2, the linear represents the full-connection layer, the concat represents the splicing operation, the ESM represents the feature vector from the ESM-2 large language model, the BERT represents the feature vector from the ProteinBERT large language model, 、 Representing the interactive features after a cross-attention mechanism.
Description
Drug and target interaction prediction method based on drug sequence descriptor Technical Field The invention relates to the technical field of drug target prediction, in particular to a drug and target interaction prediction method based on a drug sequence descriptor. Background Drug development is a key research direction in biomedical science, and pharmaceutical enterprises invest a great deal of funds for related exploration every year. Drug-target interaction (DTI) recognition is central in the drug development process. Traditional methods rely on laboratory chemistry and biology experiments to verify DTI, which are cumbersome and costly. The DTI is accurately predicted by a calculation method, so that the experimental verification range can be remarkably reduced, the research and development efficiency is improved, the expenditure is saved, and even the unknown action mechanism between the medicine and the target point can be revealed. In recent years, deep learning and machine learning techniques have been widely used in bioinformatics as they exhibit excellent performance in the fields of natural language processing, computer vision, and the like. Meanwhile, the gene sequencing technology is continuously advanced, the sequencing cost is continuously reduced, and the scale of the generated biological data is exponentially increased. The method provides an important basis for mining potential valuable information in data and promoting biotechnology development by adopting a machine learning and deep learning method. Currently, a number of DTI-related datasets, such as DrugBank, KIBA and Davis, etc., have been disclosed. Using these data, researchers have developed a variety of DTI prediction methods based on machine learning and deep learning, providing a new technological approach for drug discovery. Excellent fingerprints can encode key chemical and physical properties of the molecule, such as hydrophobicity, hydrogen bond donor/acceptor capability, aromaticity, number of rotatable bonds, etc. These properties are directly related to whether the molecule is able to bind to the target protein (i.e., a "drug-like") or not. The representation of ESM, however, naturally contains evolutionary constraint information that is not available from multiple sequence alignments, which is critical to understanding the functional regions and binding sites of proteins. Disclosure of Invention The invention aims to solve the technical problems that coding representation is carried out on the medicine sequence characteristics and the protein sequence characteristics, when the representative characteristics are obtained, the two characteristics are interacted, the gap between different characteristics is reduced, the common information between the characteristics is found, automatic prediction is realized through a machine, the research cost of a laboratory is reduced, and the medicine and target interaction prediction method based on the medicine sequence descriptor is provided. The invention solves the technical problems through the following technical proposal, and the invention comprises the following steps: s1, data preprocessing Coding a medicine smiles sequence and a protein sequence through preset dictionary type data; S2, acquiring a medicine sequence descriptor and protein sequence characteristics of a large language model Characterizing a drug smiles sequence by adopting two drug molecule descriptors, and extracting characteristics of the protein sequence by adopting two large protein language models; s3: embedding layer generating an embedding vector Mapping the discrete data coded in the step S1 to a continuous vector space through embedding layers, converting the discrete data into a high-dimensional vector, and then carrying out permute operation to convert characteristic dimensions to obtain processed drug characteristics and protein characteristics; S4, feature extraction Carrying out hierarchical feature extraction on the drug features and the protein features output in the step S3 through three one-dimensional convolution layers respectively, and converting feature dimensions through permute operation to adapt to a follow-up attention mechanism; s5, feature interaction The cross attention mechanism with shared weight is adopted to interact the features extracted in the step S4, the medicine side key vector and the value vector are calculated based on protein features, the protein side key vector and the value vector are calculated based on medicine features, and the weights of the first full-connection layers of the multi-head attention layers on the two sides are shared; s6, feature stitching Adjusting the dimensionality of the features after the interaction in the step S5, regulating and controlling the specific gravity of the original features and the interaction features through selective forgetting coefficients, carrying out treatment by a maximum pooling layer, reducing the dimensionality, res