CN-121983142-A - Antibacterial peptide length self-adaptive generation method and device based on pairing learning framework

CN121983142ACN 121983142 ACN121983142 ACN 121983142ACN-121983142-A

Abstract

The invention discloses an antibacterial peptide length self-adaptive generation method and device based on a pairing learning framework, and belongs to the field of computational biology. The method comprises the steps of constructing a lead set, a target set and a mapping table, wherein the lead set comprises a plurality of amino acid sequences, the target set comprises known antibacterial peptide sequences, the mapping table is used for explicitly defining the pairing relation between the lead set and the target set and length constraint rules, and training a sequence conversion model based on pairing training data formed by the lead set, the target set and the mapping table, and for any input test lead sequence, candidate antibacterial peptide sequences with antibacterial potential are generated by the sequence conversion model. The invention innovatively decouples the antibacterial peptide generation framework from the traditional data-model binary structure, derives a unique data enhancement method, can realize the requirement of self-adaptive adjustment of the length of the antibacterial peptide sequence by adjusting the length mapping rule in the mapping table, and solves the current situation that the current front edge generation model cannot generate a large amount of long antibacterial peptide.

Inventors

LIU YONGQIANG
CHEN GAOXIANG
SUN WENHUI
QIU JIEZHONG
HU JIE
FENG LINQING
ZHANG NING

Assignees

之江实验室

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. The antibacterial peptide length self-adaptive generation method based on the pairing learning framework is characterized by comprising the following steps of: 1) Constructing a lead set L, a target set S and a mapping table M, wherein the lead set L comprises n elements, each element is an amino acid sequence, the target set S comprises n elements, each element is a known antibacterial peptide sequence, and the mapping table M is used for explicitly defining the pairing relation and length constraint rule between each element in the lead set L and the corresponding element in the target set S; 2) Training a sequence conversion model based on paired training data formed by the lead set L, the target set S and the mapping table M, so that the sequence conversion model learns a mapping relation from the lead sequence to the target antibacterial peptide sequence, wherein the mapping relation internalizes a length constraint rule in the mapping table M; 3) And inputting any input test lead sequence into a trained sequence conversion model, and generating a candidate antibacterial peptide sequence with antibacterial potential by the sequence conversion model, wherein the length of the generated candidate antibacterial peptide sequence is regulated and controlled by the length of the test lead sequence and a length constraint rule of model internalization.
2. The method according to claim 1, wherein the length constraint rules defined in the mapping table M comprise at least one of: a length compression rule in which the length of the lead-concentrating sequence is not less than the length of its paired target antibacterial peptide sequence; a length extension rule wherein the length of the lead-concentrating sequence is no greater than the length of its paired target antibacterial peptide sequence; a length peering rule, wherein the length of the lead set sequence is equal to the length of its paired target antimicrobial peptide sequence.
3. The method according to claim 2, wherein the length constraint rule defined in the mapping table M is a length extension rule, i.e. the length of the lead-concentrating sequence is not greater than the length of its paired target antimicrobial peptide sequence.
4. The method of claim 1, wherein step 2) further comprises the step of expanding the paired training data by a data enhancement mode, wherein the target set S is kept unchanged, the lead set L is expanded, and the same antibacterial peptide sequence in the target set S is paired with a plurality of different lead sequences in the lead set L, so that the number of paired training data is increased.
5. The method of claim 4, wherein the extending the lead set L comprises randomly generating a new amino acid sequence as a new lead sequence, or performing partial masking processing on the antibacterial peptide sequences in the target set S, and using the masked sequences as the new lead sequences.
6. The method of claim 1, wherein the sequence conversion model is a neural network model based on a transducer architecture.
7. The method of claim 1, wherein the amino acid sequences in the lead set are randomly assembled sequences from a collection of 20 standard amino acids, or any protein sequence fragments derived from a non-antibacterial peptide, and wherein the known antibacterial peptide sequences in the target set comprise only 20 basic amino acids.
8. An antibacterial peptide length self-adaptive generation device based on a pairing learning framework, which is characterized by comprising one or more processors and a graphics processor GPU; the one or more processors are configured to execute instructions to implement the method of any of claims 1 to 7.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the method of any of claims 1 to 7.
10. An electronic device, comprising: A memory for storing a computer program; One or more processors, and a graphics processor GPU to execute a computer program stored in the memory to implement the method of any of claims 1 to 7.

Description

Antibacterial peptide length self-adaptive generation method and device based on pairing learning framework Technical Field The invention belongs to the field of bioinformatics and artificial intelligence driven computational biology, and particularly relates to an antibacterial peptide length self-adaptive generation method and device based on a pairing learning framework. Background The drug resistance of antibacterial drugs has significantly aggravated the global death burden caused by microbial infection, and has become a long-standing serious problem in the pharmaceutical field. Antibacterial peptides (AMPs) composed of short chain amino acid sequences can kill bacteria by various mechanisms such as disruption of bacterial cell membranes, immunomodulation, specific target binding or interference with metabolic processes. Antimicrobial peptides have demonstrated the ability to inhibit microbial pathogens, which possess broad spectrum antimicrobial activity and low risk of drug resistance, and are thus considered as powerful candidates for the development of novel alternative antibiotic therapies. It helps to alleviate the threat of drug-resistant pathogens to the life and socioeconomic performance of patients. Thus, there is an increasing need to discover and design highly potent antimicrobial peptides. Currently, various strategies exist to discover novel antimicrobial peptides. For example, the bioactive experimental method (such as chromatographic separation and fluorescence screening) is characterized by higher precision, but is time-consuming, tedious and high in cost, and is difficult to develop large-scale application. The traditional bioinformatics means represented by BLAST has remarkable advantages in sequence comparison, and can be used for identifying antibacterial peptides. However, the "homologous filtering" mechanism limits the diversity of results, which tends to put the scope of exploration into a bottleneck. In contrast, by means of artificial intelligence strategies and multifunctional tools, the discovery process of antimicrobial peptides can be significantly accelerated and expanded. Heretofore, there have been a variety of artificial intelligence driven antimicrobial peptide design models, which can be generalized into two branches, the "recognition model" and the "generation model". The recognition model was designed to determine whether a given peptide sequence has antibacterial activity, for example, ma et al used three antibacterial peptide predictors in 2022, attention, LSTM and BERT to successfully mine the human intestinal microbiome for novel antibacterial peptides. Wang et al, 2025, proposed EvoGradient and succeeded in mining novel antimicrobial peptides in the human oral microbiome. The generation model is to generate novel antibacterial peptide candidate sequences with potential therapeutic efficacy, such as HydrAMP (2023), pepdiffusion (2025) and AMP-designer (2025). The generated candidate peptide is usually subjected to iterative screening by a multiple recognition model or other filtering strategies (such as physicochemical properties and structural characteristics) so as to meet the requirements of activity, specificity and safety, and then experimental verification is carried out. The sudden drop of the cost and the shortening of the period make the optimization of the antibacterial peptide generation model a key link in the anti-infection field. However, existing antimicrobial peptide generation models still face the limitation of data scarcity. For example, the antimicrobial peptide database APD3 has only 6301 antimicrobial peptides (check time is 12 months of 2025), which is far smaller than the data set size in the computer vision field (computer vision data set ImageNet (Deng et al 2009) contains more than 1400 ten thousand marker data samples). Meanwhile, the data enhancement of the antibacterial peptide sequence is lack of stringency, because the function of the antibacterial peptide sequence is changed dramatically by small changes of a single sequence, and whether the antibacterial peptide sequence is active or not is finally verified by a wet experiment. Thus, the method of directly altering the sequence to amplify the antimicrobial peptide training dataset fails. The lack of training data sets limits the generalization of the generated model. Which in turn results in a limited accuracy of the sequence generation of the model. Therefore, the data enhancement is performed without changing the original antibacterial peptide sequence, so that the generalization of the generated model is improved, and the method has great challenges. On the other hand, scarcity of data also causes data distribution skews. Figure 1 shows the sequence length distribution in antimicrobial peptide dataset APD 3. Of these, sequences with a sequence length greater than 60 are only 348. The length of the antimicrobial peptide sequence affects its inherent functional properties. The