CN-121983129-A - Entity knowledge driven language model based method for identifying succinylation sites

CN121983129ACN 121983129 ACN121983129 ACN 121983129ACN-121983129-A

Abstract

The invention belongs to the crossing field of bioinformatics and computational biology, and relates to a method for identifying succinylation sites based on a language model driven by entity knowledge. The method comprises the steps of firstly fusing Amino Acid Composition (AAC), one-Hot coding and feature (DR) based on residue distance to construct a multi-dimensional composite feature set, then respectively coupling a LUKE model with enhanced entity knowledge and a RoBERTa model with optimized robustness with a bidirectional long-short-time memory network (BiLSTM), realizing context dynamic modeling by extracting deep features of a pre-training model, and obtaining a final prediction model through model fine tuning and integration strategy fusion. Finally, the invention verifies the biophysical and chemical rationality of model decision by combining t-SNE feature visualization and SHAP interpretability analysis, and realizes the accurate identification of protein succinylation sites.

Inventors

ZUO YUN
YAN HONGJIN
DU PEI
CAO YUJIE
ZHU HAORAN

Assignees

江南大学

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (8)

1. The method for identifying succinylation sites based on the language model driven by entity knowledge is characterized by comprising the following steps: First step, data set collection and preprocessing Obtaining positive samples of succinylated sites and initial negative samples of non-succinylated sites which are verified by experiments by using a dbPTM database and a UniProt database as data sources, bidirectionally expanding a sample sequence to 101 bits by taking lysine K as a center, and carrying out mirror image complementation and special sign 'X' filling on samples with insufficient length; second step, multidimensional feature extraction and fusion Extracting AAC features, one-Hot features and DR features in parallel for the preprocessed 101-bit sequence, and performing feature fusion to form a composite feature set; third step, model construction and optimization Respectively constructing LUKE semantic coding branches and RoBERTa semantic coding branches, performing fine tuning on a pre-training model, extracting the 12 th layer output of the fine-tuned model as BiLSTM network input, and obtaining the prediction probability of each branch through a full-connection layer and a Sigmoid activation function; fourth step, model evaluation Performing performance evaluation on the model by adopting cross validation and independent test sets; Fifth step, interpretation analysis And carrying out interpretability analysis on the model output by adopting SHAP, and visually verifying the biophysical rationality of model decision by combining the t-SNE characteristics.
2. The method for identifying succinylation sites based on a language model driven by entity knowledge according to claim 1, wherein the first step comprises the following specific operations: 1.1 data Source and initial screening Taking dbPTM database as succinylation site data source, obtaining positive sample of lysine succinylation site and corresponding protein item verified by experiment, and obtaining complete amino acid sequence of the protein item from UniProt database, setting window radius r=10 with lysine K as center, constructing positive sample sequence fragment with length 2r+1, and the standardization is expressed as: ; 1.2 Positive sample sequence spreading and filling And when the central site K is close to two ends of the protein sequence and cannot extend in equal length, mirror image complementation is adopted, and the mirror image complementation relationship is as follows: ; Filling the full mirror image with the length still smaller than 101 bits by adopting a special symbol 'X' to 101 bits; 1.3 negative sample construction and screening The method comprises the steps of selecting a sequence segment which is used for positive sample expansion and is not subjected to filling treatment from a complete protein sequence, then taking lysine K as a center, cutting out 101-bit sequence segments from front to back as a negative sample candidate set, screening the negative sample candidate set to select candidate segments with similarity higher than 80% with any positive sample sequence and candidate segments containing known succinylation site characteristic motifs, and obtaining a negative sample set; 1.4 sample Balancing Process Downsampling the negative sample set by NearMiss-1 downsampling algorithm, determining k minority class neighbors closest to the majority class sample, calculating average distance from the majority class sample to the k minority class neighbors, sorting the majority class samples according to the average distance from small to large, and selecting the minority class neighbors with the front sorting The plurality of majority samples form a plurality of sub-sets of the plurality of down-sampled samples, thereby realizing the balance of the number of the positive and negative samples.
3. The method for identifying succinylation sites based on a language model driven by entity knowledge according to claim 1, wherein the second step comprises the following specific operations: 2.1 extraction of amino acid composition characteristics AAC Setting an input protein sequence Map it into 21-dimensional vector Wherein the frequency of the j-th amino acid , ; 2.2One-Hot encoding feature extraction Defining a mapping Mapping each residue in the sequence into a 21-dimensional binary vector, and representing the sequence as an L multiplied by 21 sparse matrix; 2.3 residue distance based feature DR extraction Let the occurrence position of the j-th amino acid in the sequence be Average position Normalized mean position Distance residue feature ; 2.4 Feature fusion And performing splicing fusion on the AAC features, the One-Hot features and the DR features to form a composite feature vector serving as model input.
4. The method for identifying succinylation sites based on a language model driven by entity knowledge according to claim 1, wherein the third step comprises the following specific operations: 3.1 construction LUKE semantic coding Branch Dividing the protein sequence into words, marking the functional domain or structural domain information corresponding to the amino acid residue, inputting to LUKE model, fine tuning, extracting the layer 12 output of LUKE model after fine tuning as time sequence input Input to a two-way long and short term memory network BiLSTM, wherein: forward hidden state update: ; Backward hidden state update: ; hidden state splicing at the same time step: ; Will be Inputting the result to a full-connection layer and obtaining LUKE semantic coding branch prediction probability through Sigmoid activation ; 3.2 Construction RoBERTa semantic coding Branch Inputting the protein sequence into RoBERTa model and fine tuning, extracting the layer 12 output of RoBERTa model after fine tuning as BiLSTM input, and activating with Sigmoid via full connection layer to obtain RoBERTa semantic coding branch prediction probability ; 3.3 Parameter optimization Fine tuning of LUKE and RoBERTa models employs a binary cross entropy loss function BCELoss, employs a AdamW optimizer and performs learning rate decay via a LinearLR learning rate scheduler to update model parameters; 3.4 model integration and export Weighting the predicted probabilities of LUKE semantic coding branches and RoBERTa semantic coding branches And (3) carrying out probability weighted average to obtain the final prediction probability: ; Wherein the method comprises the steps of And is also provided with According to And outputting a succinylation site recognition result.
5. The method for identifying succinylation sites based on a language model driven by entity knowledge according to claim 1, wherein the fourth step comprises the following specific operations: The method adopts a double verification strategy of ten-fold cross verification and independent test set verification, introduces the existing main stream prediction method as a comparison standard, selects six core evaluation indexes including accuracy, precision, recall rate, F1 value, ma Xiusi correlation coefficient and area under ROC curve, and covers multiple dimensions of positive and negative sample identification capability, overall prediction precision, class balance adaptability and probability discrimination efficiency.
6. The method for identifying succinylation sites based on a language model driven by entity knowledge according to claim 1, wherein the fifth step comprises the following specific operations: 5.1t-SNE characterization visual analysis The method comprises the steps of obtaining a high-dimensional semantic feature vector generated by a pre-training large language model on a sample sequence to be tested, inputting the high-dimensional semantic feature vector into a t-SNE dimension reduction algorithm to obtain a two-dimensional projection representation, and visually presenting and comparing a positive sample and a negative sample in the two-dimensional projection representation to represent the separability of semantic features on the two types of samples; 5.2SHAP interpretive analysis And carrying out statistics on the SHAP values to obtain the importance of each residue and the residue combination, and outputting the key residue and the key residue combination which have the greatest contribution to the succinylation site recognition result according to the importance.
7. The method for identifying succinylation sites based on a language model driven by entity knowledge according to claim 1, wherein parameters of the t-SNE dimension reduction algorithm are set as follows: And respectively carrying out t-SNE dimension reduction projection on the high-dimensional semantic features extracted by the pre-training large language model and the traditional One-Hot coding features to compare the clustering and inter-class boundary differences of positive and negative samples in a two-dimensional space under different feature characterization.
8. The method for identifying succinylation sites based on a language model driven by entity knowledge according to claim 6, wherein the SHAP interpretability analysis specifically comprises: The method comprises the steps of constructing a plurality of groups of background data sets with different scales, respectively calculating average absolute SHAP values of amino acid residues and residue pairs under the conditions that the background sample scale is 100-1000 and the step length is 100, selecting Top-30 important residues and Top-30 important residue pairs with highest contribution degree under each group of background data sets, carrying out frequency statistics on the Top-30 important residue pairs obtained by the plurality of groups of background data sets, and outputting candidate key sequence characteristics with the high-frequency key residue pairs as model discrimination basis based on the frequency statistics result.

Description

Entity knowledge driven language model based method for identifying succinylation sites Technical Field The invention belongs to the crossing field of bioinformatics and computational biology, and relates to a method for identifying succinylation sites based on a language model driven by entity knowledge. Background Lysine succinylation is used as a key protein posttranslational modification (PTMS), plays a multidimensional regulation role in life activities, and is closely related to the occurrence mechanism of various human diseases such as malignant tumors, neurodegenerative diseases and the like through fine regulation of cell cycle processes, metabolic signal transduction and other core biological pathways. Particularly in the infection process of SARS-CoV-2 and other viruses, the succinylation modification dynamic change of the host-virus interface can bidirectionally regulate the virus replication cycle and the host immune response, thus providing a brand new view for analyzing the pathogen-host interaction mechanism, and the accurate identification of the site becomes an important direction in the biomedical research field. However, the current related research still faces a number of core problems to be solved urgently: 1. The traditional mass spectrum identification technology has inherent flux limitation and sequence preference bottleneck, is difficult to effectively capture the high heterogeneity characteristic of succinylation modification when processing proteomics massive data, has high detection cost and complex operation, and cannot meet the requirement of large-scale site identification; 2. Although the existing calculation prediction method has a certain development, the method still has the obvious defects that part of the method depends on the traditional machine learning models such as random forests, SVMs and the like, has limited feature expression capability, can only integrate shallow sequence features, and is difficult to mine deep semantic association and long-range dependency in protein sequences; 3. part of the models do not effectively solve the problem of unbalanced proportion of positive and negative samples, so that model training is easily conducted by most types of samples, the recognition capability of few types of succinylation sites is insufficient, and the generalization performance is limited; 4. The existing method generally lacks of collaborative capturing of local context features and global semantic features, is mostly a 'black box model', cannot explain the biochemical mechanism behind a predicted result, and is difficult to meet the research requirements of accuracy and interpretability. Therefore, in the prior art, a significant short plate exists in accuracy, robustness, interpretability and suitability for large-scale application of succinylation site identification, and development of an efficient prediction technology capable of integrating multidimensional features, solving the problem of sample unbalance, modeling in cooperation with global and local features and having interpretability is needed to meet urgent requirements of precise identification of lysine succinylation sites in the fields of biomedicine, virology and the like. Disclosure of Invention In order to solve the technical problems, the invention provides a method for identifying succinylation sites based on a language model driven by entity knowledge. The technical scheme of the invention is as follows: The data used in the present invention are derived from the internationally recognized protein post-translational modification database (dbPTM) and the universal protein knowledge base (UniProt). Aiming at the problem of unbalance of positive and negative samples commonly existing in protein succinylation site data, the invention adopts an optimization strategy combining sequence expansion and intelligent sampling in a sample construction stage, firstly bidirectionally expands positive samples to 101 bits by taking lysine (K) as a center, fills samples with insufficient length through mirror image complementation and special symbol 'X', screens negative samples from a complete sequence, and adopts NearMiss down sampling algorithm to balance sample distribution so as to improve generalization performance and robustness of a model. On the basis, the invention provides an integrated learning framework named EKROSuc, which is characterized in that Amino Acid Composition (AAC), one-Hot coding and residue distance-based feature (DR) are fused to construct a multi-dimensional composite feature set, a LUKE model with enhanced entity knowledge and a RoBERTa model with optimized robustness are respectively coupled with a bidirectional long-short-time memory network (BiLSTM), context dynamic modeling is realized by extracting deep features of a pre-training model, and a final prediction model is obtained through model fine tuning and integration strategy fusion. Finally, the invention verifies the biop