CN-115713970-B - Transcription factor identification method based on transducer-Encoder and multi-scale convolutional neural network

CN115713970BCN 115713970 BCN115713970 BCN 115713970BCN-115713970-B

Abstract

The invention discloses a transcription factor identification method based on a transducer-Encoder and a multi-scale convolutional neural network. The method comprises the steps of firstly extracting global features of a protein sequence by using a transducer-Encoder, then further extracting multi-scale local features from the global features by using a multi-scale convolutional neural network, and finally fusing the extracted features to output the probability that the protein sequence is a transcription factor. The invention uses the transcription factor identification method of the transducer-Encoder and the multiscale convolutional neural network based on the multi-layer multi-head attention mechanism, can finish the identification work of whether the unknown protein sequence is the transcription factor with high precision, can rapidly judge the transcription factor only by the protein sequence, and greatly improves the protein labeling efficiency.

Inventors

LIU JUAN
YANG ZHIHUI

Assignees

武汉大学

Dates

Publication Date: 20260508
Application Date: 20221116

Claims (7)

1. A transcription factor identification method based on a transducer-Encoder and a multi-scale convolutional neural network is characterized by comprising the following steps: Step 1, constructing a training set, namely collecting protein sequences from a protein database, and marking each protein sequence as a transcription factor or a non-transcription factor according to corresponding protein annotation information; Said step 1 comprises the sub-steps of: 1.1 Selecting protein sequences which do not contain nonstandard amino acids, namely B, O, U and Z from a protein database to form a data set S1; 1.2 Removing sequences with the length exceeding 1000 from the S1, only reserving sequences with the length less than or equal to 1000, filling the protein sequences with the length less than 1000 with zeros to the length of 1000, and finally obtaining a protein sequence data set S2; 1.3 according to GO annotation information of each protein in the protein database, respectively endowing each protein sequence in S2 with a label of transcription factor '1' or non-transcription factor '0', and finally obtaining a training data set Wherein Representing the first of the dataset A strip of protein sequences; Is that Is used for the identification of the tag of (c), {0,1}; Is that Is of a size of (2); step 2, constructing a network structure, namely constructing a transcription factor prediction model by constructing a network structure combining a transducer-Encoder and a multi-scale convolutional neural network, wherein the transducer-Encoder is used for obtaining the first transcription factor Strip protein sequence Global features of (a) Multi-scale convolutional neural networks for use in basis of Carrying out transcription factor prediction and identification; in the step 2, a protein sequence is set as Representing protein sequences Middle (f) Amino acids at each position are obtained by using a transducer-Encoder Global features of (a) The specific steps of (a) are as follows: 2.1 by embedding operations, give The specific method of embedding vectors, embedding is as follows: 2.1.1 Firstly, randomly initializing different amino acid types, and then according to the corresponding amino acid types Is not identical to the amino acid sequence of each amino acid Embedding generate corresponding vectors; 2.1.2 Extracting positional information of amino acids in a protein sequence using positional coding, wherein the positional coding is identifying the amino acids at different positions in the protein by sine and cosine functions, wherein the first The positional coding formula for the amino acids is shown below: Wherein pos represents the position of an amino acid in a protein sequence, d represents the dimension of an embedding vector, and k is a natural number; 2.1.3 each amino acid Embedding and corresponding position codes to obtain protein Embedding vectors of the sequence; 2.2 Obtaining protein sequences After embedding vector of (a), taking the vector as input of a transducer-Encoder, excavating attention score between every two amino acids by using a attention mechanism, and performing cross multiplication on the attention score and the embedding vector to obtain the whole protein sequence Global features of (a) ; Training the prediction model, namely training the network constructed in the step 2 by using the training set obtained in the step 1 to obtain a trained transcription factor prediction model; And 4, predicting the transcription factor, namely predicting whether the unknown protein sequence is the transcription factor by using the prediction model obtained in the step 3, and outputting a prediction result.
2. The method of claim 1, wherein in step 1.3, if GOterm of "transcription factor" is included in GO annotation of protein, or both "transcription regulation" and "DNA binding" GOterm are included, the protein sequence is transcription factor and assigned "1", otherwise, the protein sequence is non-transcription factor and assigned "0".
3. The method according to claim 1, wherein the network structure in the step 2 comprises a transducer-Encoder structure and a multi-scale convolutional neural network structure which are formed in series; The transducer-Encoder structure only retains Encoder part of the transducer and is formed by stacking 6 Encoder blocks, wherein each Encoder block contains 12 attention head, and the transducer-Encoder is used for extracting global features from an input protein sequence; the multi-scale convolutional neural network comprises four parallel convolutional sub-networks with different dimension convolutional kernels, two full-connection layers and an output layer, wherein the convolutional layers comprise a plurality of one-dimensional convolutional operations corresponding to the different dimension convolutional kernels respectively to obtain a plurality of different dimension convolutional features, a pooling layer pools the plurality of convolutional features respectively to obtain features with reduced dimensions, the pooled features are spliced and sent to the full-connection layers, and a prediction result obtained by calculation of the full-connection layers is output by the output layer.
4. The method of claim 3, wherein in step 2, the convolutional subnetwork is formed of a convolutional layer, a normalizing layer, Layer(s) Layer composition; Wherein the first The calculation formula of the output of the sub-network is as follows: Wherein x is the input of the convolution layer; The splicing output of the four sub-networks is as follows: 。
5. The method of claim 1, wherein in step 2, the multi-scale convolutional neural network uses one-dimensional convolution kernels of different sizes to be used for extracting features between local protein sequences of different lengths.
6. The method of claim 5, wherein in step 2, the length of the convolution kernel is the same as the embedded dimension of the protein, the width is set to 4-20, and the convolution kernels are sequentially 4,8,12 and 16.
7. The method according to claim 1, wherein in the step 2, the method is based on Performing transcription factor prediction the identification method comprises the following steps: based on protein sequences Global features of (a) The convolution operation with different scales is carried out to obtain Local features of (2) Wherein And splicing all local features for the local features corresponding to different convolution kernels, and finally inputting the local features into a full-connection layer, wherein a loss function selects cross enteropy, and the specific formula is as follows: Wherein the method comprises the steps of The probability that the model predicts that the sample is the transcription factor is obtained, y is a sample label, if the sample belongs to a positive example, the value is 1, otherwise, the value is 0; for all protein sequences, the final output probability is obtained by SoftMax function The predicted class of protein i was calculated by the following formula : Such as Then the protein sequence is represented The transcription factor is the non-transcription factor, otherwise.

Description

Transcription factor identification method based on transducer-Encoder and multi-scale convolutional neural network Technical Field The invention relates to the field of protein function annotation, in particular to a transcription factor identification method based on a transducer-Encoder and a multi-scale convolutional neural network, and the transcription factor is a protein with a special function, so the invention belongs to the application of deep learning in the field of protein function annotation. Background A transcription factor (Transcription Factor) is a protein molecule with a special structure that performs the function of regulating gene expression. Transcription factors regulate the expression of a target gene by specifically binding to a DNA sequence, promoting or inhibiting the transcription process of a particular DNA to RNA. Traditionally, the method for identifying and recognizing the transcription factors by biochemical experiments is time-consuming and expensive and cannot be used on a large scale, the homologous search method adopting BLAST cannot identify whether the proteins with different homology with the known proteins in the database are the transcription factors, the prediction method adopting traditional machine learning can be used for recognizing whether the proteins are the transcription factors or not based on protein structure or sequence information, but the characteristics related to the transcription factors need to be designed manually, stronger field knowledge is needed, the prediction precision is not high, and deep learning has the advantage of being capable of directly learning the characteristics of protein sequences, but the existing method is mostly based on convolutional neural networks to construct a prediction model. Because of the limitation of convolution kernels, the method can automatically learn characteristic representation, but only can learn local characteristics of the relationship between amino acids with a relatively short distance, and cannot learn global characteristics of the relationship between amino acids with a relatively long distance, so that the prediction accuracy of a model is affected. Disclosure of Invention Aiming at the technical problems, the invention provides a transcription factor identification method based on a transducer-Encoder and a multi-scale convolutional neural network, which can simultaneously extract global and local information in a protein sequence and automatically obtain comprehensive representation characteristics of transcription factors, thereby further improving prediction accuracy. The technical scheme provided by the invention is as follows: A transcription factor identification method based on a transducer-Encoder and a multi-scale convolutional neural network comprises the following steps: Step 1, constructing a training set, namely collecting protein sequences from a protein database, and marking each protein sequence as a transcription factor or a non-transcription factor according to corresponding protein annotation information; Step 2, constructing a network structure, namely constructing a transcription factor prediction model by constructing a network structure combining a transducer-Encoder and a multi-scale convolutional neural network, wherein the transducer-Encoder is used for obtaining the global characteristic of an ith protein sequence X iMulti-scale convolutional neural network for base onCarrying out transcription factor prediction and identification; Training the prediction model, namely training the network constructed in the step 2 by using the training set obtained in the step 1 to obtain a trained transcription factor prediction model; And 4, predicting the transcription factor, namely predicting whether the unknown protein sequence is the transcription factor by using the prediction model obtained in the step 3, and outputting a prediction result. Further, the step1 includes the following substeps: 1.1 selecting protein sequences which do not contain nonstandard amino acids, namely B, O, U and Z from a protein database to form a data set S1; 1.2 removing sequences with the length exceeding 1000 from the S1, and only reserving sequences with the length less than or equal to 1000, filling the protein sequences with the length less than 1000 with zeros until the length is 1000, and finally obtaining a protein sequence data set S2; 1.3, according to GO annotation information of each protein in the protein database, each protein sequence in S2 is respectively endowed with a label of a transcription factor of '1' or a non-transcription factor of '0', and finally a training dataset S= (X i,ci) |i=1, & N is obtained, wherein X i represents the ith protein sequence in the dataset, c i is a label of X i, c i epsilon {0,1}, and N is the size of S. Further, in the step 1.3, if the GO note of the protein contains "transcription factor" GO term, or contains both "transcription regulation" and "DN