CN-121983141-A - Protein sequence fluorescence intensity prediction method, system, equipment and storage medium

CN121983141ACN 121983141 ACN121983141 ACN 121983141ACN-121983141-A

Abstract

The application belongs to the field of protein fluorescence intensity prediction, and particularly relates to a protein sequence fluorescence intensity prediction method, a system, equipment and a storage medium, wherein the method comprises the following steps: after feature extraction is carried out on the training data set, random noise is combined, training is carried out on a generator and a discriminator to obtain a generated countermeasure model, the generated countermeasure model is used for supplementing data to obtain a prediction data set, then semantic features of the prediction data set are extracted, local mode enhancement is carried out on hidden layer features, final features are obtained after fusion, the final features are input into a prediction network to train the prediction network to obtain an optimal prediction network, and an actual protein sequence is input into the optimal prediction network to obtain fluorescence intensity corresponding to the actual protein sequence. The application has the effect of improving the accuracy of the fluorescence intensity of the predicted protein sequence.

Inventors

XIANG HUAIJIN
XIONG WEI

Assignees

华南理工大学

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. The protein sequence fluorescence intensity prediction method is characterized by comprising the following steps: acquiring a protein sequence training data set with labels, wherein the protein training data set comprises an amino acid sequence of protein and a corresponding fluorescence intensity value; inputting a protein sequence training data set into a protein language model to obtain a semantic representation vector, and fusing the semantic representation vector with the fluorescence intensity value to obtain a condition vector; Setting random noise, splicing the random noise with the condition vector to obtain a spliced vector, inputting the spliced vector into a generator to obtain generated data, and inputting the generated data and the condition vector into a discriminator to obtain a generated data score and a condition vector score; Calculating according to the generated data to obtain generator loss, calculating according to the generated data score and the condition vector score to obtain a discriminator loss, obtaining optimal generator parameters and optimal discriminator parameters when the discriminator loss and the generator loss are minimum, updating the generator and the discriminator by using the optimal generator parameters and the optimal discriminator parameters to obtain an optimal generator and an optimal discriminator, wherein the optimal generator and the optimal discriminator form a generated countermeasure model; setting a target protein sequence set, inputting the target protein sequence set into a generated countermeasure model to obtain a target fluorescence intensity value set, forming a supplementary data set by the target protein sequence set and a corresponding target fluorescence intensity value set, and taking the supplementary data set and the training data set as prediction data sets; Inputting the predicted data set into a protein language model to obtain hidden layer characteristics of the last layer; extracting CLS features and average pooling features from the hidden layer features, and splicing the CLS features and the average pooling features to obtain semantic features; carrying out nonlinear projection on the semantic features to obtain nonlinear projection features, carrying out local mode enhancement on the hidden layer features to obtain specific features, and fusing the nonlinear projection features and the specific features to obtain final features; inputting the final characteristics into a prediction network to obtain a prediction result, obtaining a loss value according to the prediction result and a fluorescence intensity value, obtaining an optimal prediction network parameter when the loss value is minimum, and updating the prediction network by using the optimal prediction network parameter to obtain the optimal prediction network; Inputting the actual protein sequence into the optimal prediction network to obtain the fluorescence intensity corresponding to the actual protein sequence.
2. The method for predicting protein sequence fluorescence intensity according to claim 1, wherein the calculating the generator loss from the generated data comprises: calculating the discriminator rewards and the generator rewards of the generated data, and summing the discriminator rewards and the generator rewards to obtain the expected rewards; Calculating average rewards corresponding to the generated data; and calculating the generator loss according to the average rewards and the expected rewards.
3. The method of claim 1, wherein fusing the semantic representation vector with the fluorescence intensity values to obtain a condition vector comprises: Fusing the semantic characterization vector with the fluorescence intensity value to obtain a condition vector expressed as: Wherein, the As a result of the condition vector being a vector of conditions, For the semantic token vector to be a vector, Is a layer of embedded projections that can be learned, The splice is indicated as being a function of the splice, As a set of real numbers, To be used in The dimension after the conversion into the vector is carried out, The dimensions of the vector are characterized for semantics.
4. The method for predicting the fluorescence intensity of a protein sequence according to claim 1, wherein the setting random noise, splicing the random noise with the condition vector to obtain a spliced vector, inputting the spliced vector into a generator to obtain generated data, inputting the generated data and the condition vector into a discriminator to obtain a generated data score and a condition vector score, comprises: setting random noise, splicing the random noise with the conditional vector to obtain a spliced vector, wherein the spliced vector is expressed as: Wherein, the For the purpose of stitching the vectors, As a result of the random noise, Is a condition vector; inputting the spliced vector into a generator to obtain a data matrix; Sampling each position in the data matrix to obtain a sampling result; Calculating Gumbel-Softmax distribution of the sampling result to obtain generation data, and inputting the generation data and the condition vector into a discriminator to obtain a generation data score and a condition vector score.
5. The method for predicting the fluorescence intensity of a protein sequence according to claim 1, wherein the step of performing nonlinear projection on the semantic features to obtain nonlinear projection features, performing local mode enhancement on the hidden layer features to obtain specific features, and fusing the nonlinear projection features and the specific features to obtain final features comprises the steps of: Performing nonlinear projection on the semantic features to obtain nonlinear projection features, wherein the nonlinear projection features are expressed as follows: Wherein, the As a feature of the non-linear projection, In order to activate the function, 、、 And Is a parameter that can be learned and is, Is a semantic feature; processing the hidden layer features through a lightweight convolutional network to perform local mode enhancement to obtain specific features, wherein the specific features are expressed as follows: Wherein, the For a one-dimensional convolution output, In the case of a one-dimensional convolution operation, In order to hide the layer characteristics, For the convolution kernel size, For the number of channels output by the convolutional layer, For a global maximum pooling operation, Is a specific feature; And fusing the nonlinear projection characteristic and the specific characteristic to obtain a final characteristic.
6. The method for predicting protein sequence fluorescence intensity according to claim 1, wherein the extracting CLS features and average pooling features from the hidden layer features, and splicing the CLS features and average pooling features to obtain semantic features, comprises: extracting CLS features and average pooling features from the hidden layer features; splicing the CLS features and the average pooling features to obtain semantic features, wherein the semantic features are expressed as follows: Wherein, the As a result of the semantic features, In order to be a feature of the CLS, For the purpose of averaging the pooling characteristics, Is spliced.
7. The method for predicting the fluorescence intensity of a protein sequence according to claim 1, wherein the inputting the final feature into a prediction network to obtain a prediction result, obtaining a loss value according to the prediction result and the fluorescence intensity value, obtaining an optimal prediction network parameter when the loss value is minimum, updating the prediction network by using the optimal prediction network parameter to obtain the optimal prediction network, and comprises the following steps: inputting the final characteristics into a prediction network to obtain a prediction result; obtaining main task loss according to the prediction result and the fluorescence intensity value; Obtaining a protein class according to the final characteristics; Obtaining class loss according to the protein class and the prediction result; carrying out weighted summation on the main task loss and the category loss to obtain a loss value; And acquiring the optimal prediction network parameter when the loss value is minimum, and updating the prediction network by using the optimal prediction network parameter to obtain the optimal prediction network.
8. A protein sequence fluorescence intensity prediction system, comprising: The acquisition module is used for acquiring a protein sequence training data set with labels, wherein the protein training data set comprises the amino acid sequence of protein and a corresponding fluorescence intensity value; the fusion module is used for inputting the protein sequence training data set into a protein language model to obtain a semantic representation vector, and fusing the semantic representation vector with the fluorescence intensity value to obtain a condition vector; the setting module is used for setting random noise, splicing the random noise with the condition vector to obtain a spliced vector, inputting the spliced vector into a generator to obtain generated data, and inputting the generated data and the condition vector into a discriminator to obtain a generated data score and a condition vector score; The generated countermeasure model training module is used for calculating to obtain generator loss according to the generated data, calculating to obtain discriminator loss according to the generated data score and the condition vector score, obtaining optimal generator parameters and optimal discriminator parameters when the discriminator loss and the generator loss are minimum, updating the generator and the discriminator by using the optimal generator parameters and the optimal discriminator parameters to obtain an optimal generator and an optimal discriminator, and forming a generated countermeasure model by the optimal generator and the optimal discriminator; The supplementing module is used for setting a target protein sequence set, inputting the target protein sequence set into a generated countermeasure model to obtain a target fluorescence intensity value set, forming a supplementing data set by the target protein sequence set and the corresponding target fluorescence intensity value set, and taking the supplementing data set and the training data set as a prediction data set; The hidden layer feature extraction module is used for inputting the predicted data set into a protein language model to obtain hidden layer features of the last layer; The splicing module is used for extracting CLS features and average pooling features from the hidden layer features, and splicing the CLS features and the average pooling features to obtain semantic features; The enhancement module is used for carrying out nonlinear projection on the semantic features to obtain nonlinear projection features, carrying out local mode enhancement on the hidden layer features to obtain specific features, and fusing the nonlinear projection features and the specific features to obtain final features; The prediction model training module is used for inputting the final characteristics into a prediction network to obtain a prediction result, obtaining a loss value according to the prediction result and the fluorescence intensity value, obtaining an optimal prediction network parameter when the loss value is minimum, and updating the prediction network by using the optimal prediction network parameter to obtain an optimal prediction network; and the prediction module is used for inputting the actual protein sequence into the optimal prediction network to obtain the fluorescence intensity corresponding to the protein sequence.
9. A terminal device comprising a memory and a processor, characterized in that the memory stores a computer program capable of running on the processor, which processor, when loaded and executed, employs the method according to any of claims 1-7.
10. A computer readable storage medium having a computer program stored therein, which, when loaded and executed by a processor, employs the method of any of claims 1 to 7.

Description

Protein sequence fluorescence intensity prediction method, system, equipment and storage medium Technical Field The invention belongs to the field of protein fluorescence intensity prediction, and particularly relates to a protein sequence fluorescence intensity prediction method, a system, equipment and a storage medium. Background Protein fluorescent labeling technology is an indispensable core tool in the fields of cell imaging, functional research, drug screening and the like, and is used for realizing visualization, positioning, quantification and interaction analysis of dynamic behaviors of proteins in living cells or tissues by covalently or non-covalently binding fluorescent groups (such as green fluorescent protein GFP, FITC, cy, cy5 and the like) to target proteins. As life science research progresses toward high throughput, dynamics and quantification, demands for accuracy, stability and predictability of protein fluorescent labeling are becoming increasingly stringent. In recent years, artificial intelligence, particularly a pre-trained language model, has a great potential in the field of biological sequence analysis, and provides a new paradigm for automatically learning functions and structural features from massive protein sequence data. However, protein fluorescence intensity prediction requires a large amount of labeling data, and labeling from protein sequences to fluorescence intensities can only be performed manually, so that the data is slowly increased and the data amount is small, so that the data which can be learned by a prediction model is limited, and the prediction accuracy is not high. Disclosure of Invention The invention aims to solve the technical problem of providing a protein sequence fluorescence intensity prediction method, a system, equipment and a storage medium, wherein the data of a protein sequence are enriched by generating an antagonism network, so that a trained model can predict more accurate fluorescence intensity. A method for predicting fluorescence intensity of a protein sequence, comprising: acquiring a protein sequence training data set with labels, wherein the protein training data set comprises an amino acid sequence of protein and a corresponding fluorescence intensity value; inputting a protein sequence training data set into a protein language model to obtain a semantic representation vector, and fusing the semantic representation vector with the fluorescence intensity value to obtain a condition vector; Setting random noise, splicing the random noise with the condition vector to obtain a spliced vector, inputting the spliced vector into a generator to obtain generated data, and inputting the generated data and the condition vector into a discriminator to obtain a generated data score and a condition vector score; Calculating according to the generated data to obtain generator loss, calculating according to the generated data score and the condition vector score to obtain a discriminator loss, obtaining optimal generator parameters and optimal discriminator parameters when the discriminator loss and the generator loss are minimum, updating the generator and the discriminator by using the optimal generator parameters and the optimal discriminator parameters to obtain an optimal generator and an optimal discriminator, wherein the optimal generator and the optimal discriminator form a generated countermeasure model; setting a target protein sequence set, inputting the target protein sequence set into a generated countermeasure model to obtain a target fluorescence intensity value set, forming a supplementary data set by the target protein sequence set and a corresponding target fluorescence intensity value set, and taking the supplementary data set and the training data set as prediction data sets; Inputting the predicted data set into a protein language model to obtain hidden layer characteristics of the last layer; extracting CLS features and average pooling features from the hidden layer features, and splicing the CLS features and the average pooling features to obtain semantic features; carrying out nonlinear projection on the semantic features to obtain nonlinear projection features, carrying out local mode enhancement on the hidden layer features to obtain specific features, and fusing the nonlinear projection features and the specific features to obtain final features; inputting the final characteristics into a prediction network to obtain a prediction result, obtaining a loss value according to the prediction result and a fluorescence intensity value, obtaining an optimal prediction network parameter when the loss value is minimum, and updating the prediction network by using the optimal prediction network parameter to obtain the optimal prediction network; Inputting the actual protein sequence into the optimal prediction network to obtain the fluorescence intensity corresponding to the protein sequence. Optionally, calculating the generator loss