CN-121983112-A - Protein database searching and de novo sequencing model and training method thereof
Abstract
The invention provides a protein database searching and de-head sequencing model and a training method thereof, wherein the model comprises a peptide fragment sequence encoder, a mass spectrogram encoder, a joint modal score device and a peptide fragment length perception decoder, the method includes training the model in multiple rounds using a training set, each sample including mass spectral data and a sequence of real peptide fragments. The method comprises the steps of searching a candidate peptide fragment set according to mass spectrum data in each round, encoding each peptide fragment sequence in the set by utilizing a peptide fragment sequence encoder to obtain characteristic representation of each peptide fragment, encoding mass spectrum data by utilizing a mass spectrum encoder to obtain characteristic representation of a mass spectrum, processing the characteristic representation of the mass spectrum and the characteristic representation of each peptide fragment by utilizing a joint modality score device for a searching task to obtain matching scores of each peptide fragment sequence and the mass spectrum data, generating the peptide fragment sequence according to the characteristic representation of the mass spectrum by utilizing a peptide fragment length perception decoder for a de novo sequencing task, and constructing a total loss function based on the two tasks to update model parameters.
Inventors
- ZHAO JIALE
- ZHANG XIN
- LIAO YUCHENG
- E Weinan
- ZHANG WEIJIE
- WEN HAN
- CHI HAO
- MAO PENGZHI
- WANG KAIFEI
- LI YIMING
- PENG YAPING
- Chen Ranfei
- LU SHUQI
- JI XIAOHONG
- DING JIAXIANG
Assignees
- 中国科学院计算技术研究所
Dates
- Publication Date
- 20260505
- Application Date
- 20251201
Claims (11)
- 1. A training method for a protein database search and de novo sequencing model is characterized in that the model comprises a peptide fragment sequence encoder, a mass spectrogram encoder, a joint modal score device and a peptide fragment length perception decoder, the training method comprises the steps of obtaining a training set which comprises a plurality of samples, each sample comprises mass spectrum data and a real peptide fragment sequence of the mass spectrum data, training the model for a plurality of times by using the training set to obtain a trained model, and each training round comprises the following steps: S1, searching candidate peptide fragments from a protein database according to mass spectrum data of each sample in the current round, and constructing a candidate peptide fragment set, wherein the candidate peptide fragment set comprises a plurality of searched candidate peptide fragment sequences serving as negative examples of the samples, candidate peptide fragment sequences serving as positive examples of the samples and real peptide fragment sequences; S2, respectively encoding all peptide sequences in the set by using a peptide sequence encoder to obtain characteristic representations of each peptide, and encoding mass spectrum data by using a mass spectrum encoder to obtain characteristic representations of mass spectra; S3, aiming at a search scoring task, processing the input mass spectrum characteristic representation and each peptide segment characteristic representation in the collection by utilizing a joint mode score indicator to obtain mass spectrum mode embedding and each peptide segment mode embedding, and obtaining the matching score of each peptide segment sequence and mass spectrum data according to the mass spectrum mode embedding and each peptide segment mode embedding in the collection; s4, aiming at a de novo sequencing task, generating a peptide fragment sequence according to a mass spectrogram characteristic representation by using a peptide fragment length perception decoder; S5, guiding updating of model parameters by using the constructed total loss function, wherein the optimization targets of the total loss function during updating guiding comprise maximizing the matching score of the peptide fragment sequence serving as a sample positive example and the mass spectrum data, minimizing the matching score of the peptide fragment sequence serving as a sample negative example and the mass spectrum data, and minimizing the difference between the real peptide fragment sequence and the generated peptide fragment sequence.
- 2. The training method of claim 1, wherein the joint modality score engine comprises a first self-attention network, a first cross-attention network, and a first feed-forward neural network, and wherein S3 comprises: processing the input characteristic representations of each peptide segment through a first self-attention network to obtain the internal context embedding of each peptide segment; obtaining alignment embedding of each peptide segment and a mass spectrogram according to the input mass spectrogram characteristic representation and the internal context embedding of each peptide segment through a first cross attention network; Processing the alignment embedding of each peptide segment and the mass spectrogram through a first feedforward neural network to obtain each peptide segment modal embedding and mass spectrogram modal embedding; And obtaining the matching score of each peptide sequence in the collection and the mass spectrum data according to the embedding of each peptide fragment mode and the embedding of the mass spectrum mode through a joint mode scoring device.
- 3. The training method of claim 1, wherein the peptide fragment length aware decoder comprises a second self-attention network, a second cross-attention network, and a second feed forward neural network, and wherein in S4, the manner of generating the peptide fragment sequence comprises a plurality of iterative generations, resulting in a final generated peptide fragment sequence, and each generation manner comprises: Predicting the peptide fragment sequence length corresponding to the mass spectrum data according to the mass spectrum characteristic representation of the sample in the S2 by using a model, and initializing the peptide fragment sequence according to the predicted peptide fragment sequence length by using a second self-attention network; Processing the peptide fragment sequence generated in the previous time, the input mass spectrum characteristic representation and the initialized peptide fragment sequence through a second cross attention network to obtain a peptide fragment vector which is guided to be updated through the mass spectrum characteristic; Generating the current peptide fragment sequence according to the updated peptide fragment vector through a second feedforward neural network.
- 4. A training method as claimed in claim 3 wherein said total loss function comprises a peptide fragment length loss function and a peptide fragment sequence loss function, construction of the peptide fragment length loss function comprising: Constructing a peptide fragment length loss function for calculating the difference between the predicted peptide fragment sequence length and the real peptide fragment sequence length according to the real peptide fragment sequence length and the predicted length; The construction of the peptide fragment sequence loss function comprises the following steps: and constructing a peptide fragment sequence loss function for calculating the difference between the generated peptide fragment sequence and the real peptide fragment sequence according to the real peptide fragment sequence and the generated peptide fragment sequence.
- 5. The training method of claim 1, wherein the total loss function further comprises a mass spectrum loss function and a spectral peak loss function of the mass spectrum, and wherein the constructing of the mass spectrum loss function comprises: predicting corresponding mass spectrum data according to peptide fragment characteristic representation of each peptide fragment sequence in the candidate peptide fragment set of the sample in the S1 by using a model; constructing a mass spectrum loss function for calculating a difference between the predicted mass spectrum data and the mass spectrum data of the sample according to the predicted mass spectrum data and the mass spectrum data of the sample; The construction of the spectral peak loss function comprises the following steps: Predicting spectral peak data in the mass spectrum data according to the mass spectrum characteristic representation of the sample in the S2 by using a model, wherein the spectral peak data comprises the number of amino acids related to each spectral peak and the last amino acid type related to each spectral peak; And constructing a spectral peak loss function for calculating the difference between the predicted spectral peak data and the real spectral peak data in the mass spectrum data of the sample according to the real spectral peak data and the predicted spectral peak data in the mass spectrum data.
- 6. The training method of claim 1, wherein the total loss function further comprises a peptide fragment mass spectrum feature matching loss function, the construction of the loss function comprising: Constructing a peptide fragment mass spectrum characteristic matching loss function based on the characteristic matching value between the calculated mass spectrum characteristic representation of each sample and the peptide fragment characteristic representation of each peptide fragment sequence in the candidate peptide fragment set of each sample; The optimization objective comprises maximizing a feature matching value between a mass spectrum feature representation of a sample and a peptide fragment feature representation of a candidate peptide fragment sequence as a sample positive example, and minimizing a feature matching value between a mass spectrum feature representation of the sample and a peptide fragment feature representation of each peptide fragment sequence of the current round except for the sample positive example, so as to optimize a value of a peptide fragment mass spectrum feature matching loss function.
- 7. A method for sequencing a low abundance modified peptide fragment from the head, wherein the method is applied to predicting the peptide fragment sequence of the low abundance modified peptide fragment, and wherein the ratio of the peptide fragment containing the modified type in the low abundance modified peptide fragment to all the peptide fragments is less than or equal to a preset ratio, and the method comprises: Obtaining target mass spectrum data of a low-abundance modified peptide fragment and a model obtained based on the method of one of claims 1 to 6, wherein the model comprises a peptide fragment sequence encoder, a mass spectrum encoder, a joint modality score indicator and a peptide fragment length perception decoder; Encoding the target mass spectrum data by utilizing a mass spectrum encoder to obtain mass spectrum characteristic representation, predicting a plurality of peptide fragment lengths according to the mass spectrum characteristic representation, and generating a plurality of peptide fragment sequences with different lengths by utilizing a peptide fragment length perception decoder based on the mass spectrum characteristic representation and the plurality of peptide fragment lengths; Respectively encoding a plurality of peptide sequences with different lengths by using a peptide sequence encoder to obtain peptide characteristic representations of each peptide sequence, obtaining matching scores of each generated peptide sequence and target mass spectrum data by using a joint modality score indicator according to the peptide characteristic representations and mass spectrum characteristic representations of each peptide sequence, and predicting the corresponding mass spectrum data according to the peptide characteristic representations of each peptide sequence; calculating the similarity of the mass spectrum data and the target mass spectrum data of each corresponding prediction of the generated peptide fragment sequences and the number of missing fracture sites; and filtering each generated peptide fragment sequence according to the matching score, the similarity and the number of missing rupture sites to obtain the peptide fragment sequence of the low-abundance modified peptide fragment.
- 8. A method for de novo sequencing of high abundance modified peptides, wherein the method is applied to predicting peptide sequences of high abundance modified peptides, wherein the ratio of peptides comprising a modified type to all peptides in the high abundance modified peptides is greater than a predetermined ratio, the method comprising: Acquiring target mass spectrum data of high-abundance modified peptide fragments and a model obtained based on the method of one of claims 1 to 6, wherein the model comprises a peptide fragment sequence encoder, a mass spectrum encoder, a joint modal score indicator and a peptide fragment length perception decoder; Encoding target mass spectrum data by utilizing a mass spectrum encoder to obtain mass spectrum characteristic representation, predicting a plurality of peptide fragment lengths according to the mass spectrum characteristic representation, and generating a plurality of peptide fragment sequences with different lengths by utilizing a peptide fragment length perception decoder based on the mass spectrum characteristic representation and the plurality of peptide fragment lengths; Selecting a preset number of modification variables based on a plurality of peptide sequences with different lengths, and combining the peptide sequences with different lengths into a protein file; and searching from a protein database by utilizing the target mass spectrum data, the preset number of modification variables and the protein file to obtain the peptide fragment sequence of the high-abundance modification peptide fragment.
- 9. A protein database searching and screening method based on end-to-end scoring is characterized by comprising the following steps: acquiring target mass spectrum data of high-abundance modified peptide fragments, and searching candidate peptide fragment sequences from a protein database according to the target mass spectrum data to obtain a peptide fragment set comprising a plurality of peptide fragment sequences; Obtaining a model obtained based on the method of one of claims 1 to 6, respectively encoding all peptide sequences in a peptide set by using a peptide sequence encoder of the model to obtain characteristic representations of each peptide, and encoding target mass spectrum data by using a mass spectrum encoder of the model to obtain characteristic representations of mass spectra; Processing the input mass spectrogram characteristic representation and each peptide segment characteristic representation in the collection by using a joint mode scoring device of the model to obtain matching scores of each peptide segment sequence and mass spectrum data; And selecting a peptide fragment sequence with higher matching score from the peptide fragment collection according to the matching score to serve as a peptide fragment sequence of the high-abundance modified peptide fragment.
- 10. A computer readable storage medium, having stored thereon a computer program executable by a processor to perform the steps of the method of one of claims 1-9.
- 11. An electronic device, comprising: One or more processors, and A memory, wherein the memory is for storing executable instructions; the one or more processors are configured to implement the steps of the method of one of claims 1-9 via execution of the executable instructions.
Description
Protein database searching and de novo sequencing model and training method thereof Technical Field The invention relates to the field of bioinformatics, in particular to the field of computational proteomics and deep learning, and more particularly relates to a protein database search and de novo sequencing model and a training method thereof. Background Proteomics has become a key area in modern life science research, and secondary mass spectrometry (MS/MS) technology is widely established as a fundamental analytical tool. Database search engines are the primary tools for current peptide and protein identification, where traditional search engines such as SEQUEST, maxQuant, pFind and ALPHAPEPT rely on pre-stored sequence databases for peptide profile matching, typically based on the results of peptide profile matching assessed by simple machine learning models of limited manual screening features. To improve MS/MS data identification coverage, open search engines such as Open-pFind and MSFRAGGER were developed to identify peptide fragments generated by nonspecific enzymatic hydrolysis or unknown modification. In addition, the peptide sequence can be directly deduced without referring to a proteome database by using the head sequencing technology as an alternative method, and plays an important role in applications such as new antigen discovery and the like. Over the past two decades, tools such as pnnovo, PEAKS and PepNovo have been introduced successively, but their overall accuracy and robustness has been significantly limited in complex biological samples. In recent years, deep learning techniques have been widely used in proteomic data analysis, researchers have developed a variety of models to predict key peptide fragment properties, such as theoretical mass spectra and liquid phase retention times, and integrate these predictions into a machine learning framework to optimize peptide spectrum matching scores, as DEEPSEARCH proposed by some researchers has demonstrated the potential for end-to-end scoring, comparable to conventional search engines such as MaxQuant. Meanwhile, aiming at the challenge of de novo sequencing, the deep learning models such as DeepNovo, pointNovo, casanovo, graphNovo, pi-PrimeNovo and the like are developed successively, and the sequence inference capability is improved by improving the algorithm structure. In general, the prior art advances in standardized sample analysis, but database searching and de-novo sequencing are still each based on isolated modules, deep learning applications are focused on subtask optimization, and global performance bottlenecks of core identification tasks have not been systematically solved. That is, although the prior art achieves a certain result, the following objective problems and disadvantages exist: First, the current data sets used to train deep learning models are mainly derived from traditional database searches, these data strictly assume complete trypsin digestion (meaning that all cleavage types are trypsin, few or no other cleavage types such as AspN cleavage exist), and contain only limited common modification types, resulting in data sets that cannot fully reflect the real features of mass spectrometry data, especially difficult to cover non-specific proteolytic or unexpected peptide fragment modification scenarios, severely hampering the efficient identification of such complex peptide fragments. Second, database searching and de novo sequencing have inherent uniformity in mass spectrum analysis, but there are few studies that propose unified framework to jointly handle peptide spectrum matching assessment and quality control of both. Although the fields of computer vision, protein design and the like have improved generalization capability and reduced overfitting through multitasking and multimodal learning (such as BLIP, uni-Mol), such methods have not been systematically applied in proteomics, resulting in task splitting leading to resource waste and failure to utilize synergistic effects. Finally, the traditional de novo sequencing method lacks a reliable quality control mechanism, the output result of the traditional de novo sequencing method often contains a large number of false positives, manual verification is needed, the existing deep learning model improves the sequence inference precision, but the performance of the traditional deep learning model suddenly drops due to the rapid increase of search space in modified peptide enrichment data, and a systematic filtering strategy cannot be provided, so that the identification rate is low in high-value applications such as immune peptide group science and the authenticity of the result is difficult to verify. In general, the existing secondary mass spectrometry identification mode has the problems that database searching and protein de novo sequencing are respectively operated by application mainly by isolated modules, namely a search engine and de novo sequencing are re