CN-121983130-A - Anticancer peptide identification method and system based on multi-feature fusion and double-layer integrated learning

CN121983130ACN 121983130 ACN121983130 ACN 121983130ACN-121983130-A

Abstract

The invention discloses an anticancer peptide identification method and system based on multi-feature fusion and double-layer integrated learning, wherein the method comprises the steps of S1, data preprocessing, S2, feature extraction and feature fusion based on a protein language model, S3, feature extraction, S4, performing dimension reduction processing on high-dimensional features, S5, respectively inputting each feature vector in a multi-source feature set into a corresponding XGBoost classifier to perform training and prediction, outputting a peptide sequence as the preliminary prediction probability of anticancer peptide by each classifier, combining the preliminary prediction probabilities output by all classifiers into a probability feature vector, S6, inputting the probability feature vector obtained by an upper layer into a K neighbor classifier and a soft voting classifier at the same time, KNN outputting a first prediction probability, and soft voting integrator outputting a second prediction probability, S7, calculating the arithmetic average value of the first prediction probability and the second prediction probability as the prediction probability, comparing the prediction probability with a preset threshold, and judging that the peptide sequence is the anticancer peptide, or not the anticancer peptide.

Inventors

CHEN DONG
LI YANJUAN

Assignees

衢州学院

Dates

Publication Date: 20260505
Application Date: 20251205

Claims (10)

1. The anticancer peptide identification method based on multi-feature fusion and double-layer integrated learning is characterized by comprising the following steps of: Step S1, data preprocessing, namely selecting a data set used for model training and preprocessing data in the data set; step S2, extracting a distributed representation capable of capturing high-level semantic and structural information of a peptide sequence from a pre-trained protein language model and fusing the extracted features to obtain high-dimensional features based on feature extraction and feature fusion of the protein language model; S3, feature extraction, namely forming a feature set complementary with the depth feature by calculating physicochemical, composition and evolution information of the peptide sequence; s4, feature selection and dimension reduction, namely performing dimension reduction treatment on the generated high-dimension features; S5, constructing a first layer of classification, namely respectively inputting each feature vector in a multisource feature set into a corresponding XGBoost classifier for training and prediction, wherein each XGBoost classifier outputs the preliminary prediction probability that the peptide sequence is anticancer peptide; S6, constructing a second layer of classification, namely inputting the probability feature vector obtained in the previous layer into a K nearest neighbor classifier and a soft voting classifier simultaneously, wherein the KNN classifier outputs a first prediction probability, and the soft voting integrator outputs a second prediction probability by calculating the arithmetic average value of the probability feature vector; And S7, judging the result, namely calculating the arithmetic average value of the first prediction probability and the second prediction probability to be used as the final prediction probability, comparing the final prediction probability with a preset threshold value, judging the peptide sequence to be the anticancer peptide if the final prediction probability is larger than or equal to the threshold value, and judging the peptide sequence to be the non-anticancer peptide if the final prediction probability is not larger than the threshold value.
2. The method according to claim 1, wherein step S1 is specifically as follows: The training set adopts a published ACP500 data set and comprises 250 anticancer peptides which are verified by experiments, namely positive samples, and 250 antibacterial peptides without anticancer activity, namely negative samples; the independent test set adopts the published ACP164 data set, and comprises 82 positive samples and 82 negative samples; for feature codes requiring a fixed length, the sequence is uniformly processed into 20 amino acid lengths, for sequences with length >20, 10 amino acids at the N-terminal and 10 amino acids at the C-terminal are intercepted, and for sequences with length <20, specific characters are adopted for filling at the C-terminal.
3. The method of claim 2, wherein the features based on the protein language model in step S2 include feature vectors extracted from the ProtBert, protBert-BFD, protAlbert and ProtXLNet models, and wherein step S2 is specifically as follows: ProtBert feature extraction, namely using a ProtBert model trained on a UniRef100 dataset, for an input peptide sequence, taking output vectors with the final hidden layer dimension 1024 at all amino acid positions, and calculating element-level average values of the vectors to obtain a 1024-dimensional global feature vector; ProtBert-BFD feature extraction, namely obtaining a 1024-dimensional feature vector by using a ProtBert-BFD model trained on a BFD-100 data set; ProtAlbert feature extraction, namely obtaining a 4096-dimensional feature vector after mean pooling operation by using a ProtAlbert model trained on a UniRef100 data set; ProtXLNet feature extraction, namely using a ProtXLNet model trained on a UniRef100 data set, and carrying out mean pooling to obtain a 1024-dimensional feature vector; and (3) depth feature fusion, namely splicing the obtained four feature vectors ProtBert, protBert-BFD, protAlbert, protXLNet to form a combined high-dimensional feature vector prot_merge.
4. The method according to claim 3, wherein in step S3, the feature extraction is performed by using a protein sequence-based feature extraction method, including AAINDEX features, 188D features, BIT20 features, BIT21 features, DDE features, BLOSUM62 features and CKSAAGP features, specifically as follows: AAINDEX extracting features, namely obtaining 531 amino acid indexes without missing values from a AAindex database, mapping each amino acid into a numerical value according to each index for a peptide sequence with length L to form L numerical values, and forming a 531 xL-dimensional sparse feature vector by calculating statistics of the L numerical values or directly connecting the numerical values of all positions in sequence; 188D feature extraction, comprising: 20-dimensional amino acid composition properties: calculating the frequency of occurrence of 20 standard amino acids in the peptide sequence to form a 20-dimensional vector (f 1, f2,., f 20), wherein fi = Ni/L, ni is the count of the i-th amino acid; The 168-dimensional physicochemical properties comprise 8 key physicochemical properties including secondary structure, hydrophobicity, charge, polarizability, normalized van der Waals volume, solvent accessibility, polarity and surface tension, 20 amino acids are divided into 3 categories for each property, and 3-dimensional composition vectors, 3-dimensional conversion vectors and 15-dimensional distribution vectors are calculated for each property for a given peptide sequence; combining the two parts to obtain a 188-dimensional feature vector 188D; BIT20 feature extraction, namely adopting single thermal coding, coding each of 20 standard amino acids into a 20-BIT binary vector, wherein only one BIT is 1, and the rest are 0, carrying out standardization processing on a peptide sequence if the length of the peptide sequence is not equal to a preset length L, extracting the front 10 amino acids at the N end and the rear 10 amino acids at the C end of the peptide sequence when the length of the sequence is greater than L, filling when the length of the sequence is less than L, and sequentially connecting the single thermal coding vectors of each amino acid to obtain a 20 xL=400-dimensional binary feature vector BIT20; The method comprises the steps of BIT21 feature extraction, namely dividing 20 amino acids into 21 groups based on 7 physical and chemical properties, coding each amino acid into a 21-BIT binary vector according to the group to which the amino acid belongs, and connecting the 21-BIT codes of each amino acid after the length normalization of the sequence to obtain a 21 xL=420-dimensional feature vector BIT21; DDE feature extraction based on dipeptide bias desired specific steps are as follows: calculating the dipeptide composition, and calculating the composition formula of the dipeptide rs as follows: Wherein N rs represents the count of dipeptides formed by amino acid types r and s, and N represents the length of the protein or peptide; 400 potential dipeptide combinations were generated; calculating a theoretical mean value, wherein the calculation formula is as follows: Wherein, the value of C N is 61, and Cr and Cs are the numbers of codons for encoding amino acids r and s respectively; The theoretical variance is calculated, the calculation formula is as follows, Finally, the DDE value is calculated as follows: calculating DDE values for all 400 dipeptides to form 400-dimensional feature vectors; the BLOSUM62 feature extraction, namely, taking N=20 as the sequence length, and replacing each amino acid with a 20-dimensional row vector corresponding to the amino acid in a BLOSUM62 matrix for the peptide sequence with standardized length; CKSAAGP feature extraction, namely, 20 amino acids are summarized into 5 groups, namely, g1 to g5, according to the physicochemical properties of the amino acids, k interval amino acid pair compositions are calculated, for each k value, k=0, 1,2,3,4 and 5, all 25 possible pairs, namely, g1g1, g1g2, are calculated, the frequency of occurrence of g5g5 in a sequence is calculated, a 25-dimensional vector is obtained for each k value according to the frequency calculation formula, and 6 k value vectors are connected to form a 25 multiplied by 6=150-dimensional feature vector; Wherein N g1g1 represents the number of pairs of groups g1g1 in the protein, N total represents the total number of pairs of k spacer groups in the protein, and for proteins of length P, the values of N total are P-1, P-2, P-3, P-4, P-5 and P-6 when k=0, 1,2,3,4 and 5, respectively.
5. The method according to claim 4, wherein step S4 is specifically as follows: the dimensionality reduction algorithm is selected by using an analysis of variance ANOVA F value to evaluate the correlation between each feature and a target variable and independently score each feature; feature importance ranking and selection, namely using SelectKBest functions in the Scikit-learn library, configuring f_ classif as a scoring function, calculating an ANOVA F value of each feature, and ranking all the features according to the F value from high to low; Determining the dimension K of the optimal feature subset, namely determining a K value capable of balancing the performance and the complexity of the model through experimental verification, selecting 1200 features with the highest F value, and forming feature vectors Prot-1200 and AAINDEX-1200 after dimension reduction; Final feature set determination the final feature set resulting in model training and prediction contains eight feature vectors AAINDEX-1200, prot-1200,188d, f_bit20, f_bit21, f_dde, f_blosum62, f_cksaagp.
6. The method according to claim 5, wherein in step S5, the multi-source feature set comprises eight feature vectors, and eight XGBoost classifiers are trained accordingly; Selection of a first layer classifier, namely a base classifier: XGBoost is selected as a base learner of the layer; A model construction strategy, which is to independently train one XGBoost classifier by using each of the eight feature vectors to obtain eight XGBoost base classifiers in total; the output form each XGBoost-based classifier is configured to output a probability value, i.e., the probability Pi that a given peptide sequence is predicted to be an anticancer peptide, where i=1, 2,..8; The probability feature vector construction is that probability values P1 to P8 output by the eight base classifiers of the first layer are combined into an 8-dimensional probability feature vector prob= [ P1, P2, P3, P4, P5, P6, P7, P8].
7. The method of claim 6, wherein in step S6, the first prediction probability is generated by a K nearest neighbor classifier searching K nearest neighbor samples in a training set probability space, and predicting the class of the current sample according to the labels of the neighbors.
8. The method according to claim 7, wherein in step S6, the step of generating the second prediction probability is specifically as follows: the eight probability values P1-P8 output by the first layer are arithmetically averaged as the final prediction probability.
9. The method according to claim 8, wherein step S7 is specifically as follows: Calculating an arithmetic average value of the first prediction probability and the second prediction probability to obtain a final prediction probability; setting a classification decision threshold of 0.5, if the decision threshold is greater than 0.5, determining that the input peptide sequence is an anticancer peptide, otherwise, determining that the input peptide sequence is a non-anticancer peptide.
10. An anticancer peptide recognition system based on multi-feature fusion and double-layer ensemble learning for performing the method according to any one of claims 1-9, comprising the following modules: The data preprocessing module is used for selecting a data set used for model training and preprocessing the data in the data set; Extracting a distributed representation capable of capturing high-level semantic and structural information of a peptide sequence from a pre-trained protein language model, and fusing the extracted features to obtain high-dimensional features; The feature extraction module calculates the physicochemical, composition and evolution information of the peptide sequence to form a feature set complementary with the depth feature; The feature selection and dimension reduction module is used for carrying out dimension reduction treatment on the generated high-dimension features; The first layer classification construction module inputs each feature vector in the multisource feature set into a corresponding XGBoost classifier for training and prediction, wherein each XGBoost classifier outputs the preliminary prediction probability that the peptide sequence is anticancer peptide; The second layer classification construction module is used for inputting the probability feature vector obtained in the previous layer into a K neighbor classifier and a soft voting classifier simultaneously, wherein the KNN classifier outputs a first prediction probability, and the soft voting integrator outputs a second prediction probability by calculating the arithmetic average value of the probability feature vector; And the result judging module is used for calculating the arithmetic average value of the first prediction probability and the second prediction probability to be used as the final prediction probability, comparing the final prediction probability with a preset threshold value, judging that the peptide sequence is the anticancer peptide if the final prediction probability is larger than or equal to the threshold value, and judging that the peptide sequence is the non-anticancer peptide if the final prediction probability is not larger than the threshold value.

Description

Anticancer peptide identification method and system based on multi-feature fusion and double-layer integrated learning Technical Field The invention belongs to the technical fields of bioinformatics, computational biology and artificial intelligence, relates to a peptide sequence function prediction technology, and in particular relates to a technical scheme for realizing intelligent identification and classification of anti-cancer peptides (ANTICANCER PEPTIDES, ACPS) with high precision and high robustness by integrating protein language model depth characteristics, various physical chemistry and evolution characteristics and utilizing a double-layer integrated learning framework. Background Cancer is one of the leading causes of death worldwide. Traditional therapies such as chemotherapy and radiation lack specificity, cause serious damage to normal cells while killing cancer cells, cause various side effects and are prone to develop drug resistance. Anticancer peptides (ACPs) are taken as a class of small molecular polypeptides derived from organisms or obtained through artificial design, and can specifically target and kill cancer cells through various mechanisms such as destroying cancer cell membranes, inducing apoptosis, inhibiting tumor angiogenesis and the like, and meanwhile, the novel anticancer peptide has the advantages of low toxicity, low drug resistance and the like, and becomes an important candidate direction for developing novel anticancer drugs. However, screening and validating ACPs in the vast peptide sequence space of the sea by Wet experiments (Wet-lab experiments) is a time consuming, costly and flux limiting task. This greatly limits the discovery speed and application process of ACPs. Therefore, the development of an efficient identification method rapidly and accurately predicts the anticancer activity of the peptide from the sequence level, and has important significance for accelerating the discovery of the anticancer peptide and the development of the medicine. In the prior art, computational methods based on machine learning have been widely used for ACP prediction. These methods can be broadly divided into two categories, one being anticancer peptide prediction techniques based on traditional machine learning methods. For example, the ACPred-FL model proposed by Wei et al uses feature representation learning and Support Vector Machine (SVM) for classification, and the Li et al constructed ACP-GBDT model is based on a Gradient Boosting Decision Tree (GBDT) algorithm, combining AAIndex and SVMprot-188D features. In addition, methods based on random forests, lightweight gradient Lifts (LGBM), sparse representation, multi-feature fusion and ensemble learning have also been reported. The core of such methods is the reliance on expert knowledge for "feature engineering" (Feature Engineering), i.e., manual design and extraction of numerical features that reflect the biological properties of peptide sequences. Although interpretable, its performance is severely dependent on the upper limit of the characterizability of the designed features, and it is difficult to capture complex, deep semantic information and long-range dependencies in the sequence. One is an anticancer peptide prediction technique based on a deep learning method. For example, wu et al have proposed a model based on a multi-core Convolutional Neural Network (CNN) and an attention mechanism, sun et al have developed a hybrid deep learning network that fuses a fully connected network and a Recurrent Neural Network (RNN), lee et al have constructed a model based on contrast learning, kilimei et al have explored a protein language model (e.g., ESM) based on a transducer architecture. Such methods are capable of automatically learning sequence features, but typically require large amounts of data, and model training is computationally expensive, with the risk of overfitting. Chinese patent publication No. CN116935951B discloses a method and a system for identifying anticancer peptide based on the attention mechanism and the characteristics of multiple granularity levels, and publication No. CN117292742B discloses a method and a system for identifying anticancer peptide. In view of the above prior art, there is a need to solve the technical problem that (1) the one-sided nature of feature characterization is difficult to fully describe all key information dimensions determining the anticancer activity of peptide fragments in a single type of features (whether manual features or deep learning automatic learning features). (2) The model architecture has the limitation that a simple single model or shallow integrated model has limited mining capability and generalization capability for complex feature modes. (3) Processing efficiency and effectiveness of high-dimensional features generated by protein language models and the like contain a great deal of redundancy and noise, and direct use can cause 'dimension disaster' (Curs