CN-121999858-A - Bidirectional reversible conversion method and system between peptide molecule SMILES and sequence expression
Abstract
The invention discloses a bidirectional reversible conversion method and a bidirectional reversible conversion system between peptide molecule SMILES and a sequence expression, which are characterized in that a new sequence description grammar is defined to keep the information of a polypeptide special bond and the specific modification of amino acid, a topology identification algorithm of main chain atomic index and adjacent traversal, end group and topology integrated detection and coding are carried out, a residue identification algorithm matched with a main chain cutting and template library is compatible with any standard or nonstandard amino acid residue, an expandable end group library/monomer template library and an automatic increment mechanism, the automatic identification and sequence annotation of S-S disulfide bond, a high-fidelity assembly algorithm of HELM anchor points and topology perceived cyclic peptide processing. The invention solves the problems that the prior art cannot support complex polypeptide topological structure, has poor reversibility, and insufficient expansibility of a monomer library, and the like, can be widely applied to the scenes of quantitative structure-activity relation model construction, large-scale polypeptide data cleaning and the like, and has remarkable practicability and innovation.
Inventors
- TANG BOWEN
- ZHUANG YUAN
- XIE YI
Assignees
- 埃森生物有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260409
Claims (10)
- 1. A method of bi-directional reversible conversion between a peptide molecule SMILES and a sequence expression, comprising: (1) SMILES-sequence expression conversion flow, specifically: Carrying out standardized processing on the inputted peptide molecule SMILES and constructing a molecular diagram; Searching a molecular diagram by using SMARTS modes to obtain the sequence relation of each residue in the molecule and the corresponding main chain atom index; judging whether the peptide molecules have a head-tail main chain ring structure and/or a covalent connection structure between side chains based on the main chain atom indexes and the sequence relation thereof, and generating corresponding topology description information; Searching the first residue N, the last residue C and the peripheral non-main chain heavy atoms of the molecular diagram to obtain a candidate source subset of the terminal group, and carrying out sequence matching with a terminal group library; identifying the position of a peptide bond based on the main chain atomic index, cutting off the peptide bond to obtain a residue structure fragment, carrying out structure normalization on the residue structure fragment, and determining corresponding residue sequence representation through structure matching with a monomer template library; Positioning and identifying disulfide bond connection relations among residues based on the atomic adjacency relations of the residue structure fragments in a molecular diagram, and generating disulfide bond annotation information in the form of residue index pairs; Integrating the end group structure representation, the residue sequence representation, the topology description information and the disulfide bond annotation information to generate a sequence expression containing residue sequence and topology constraint information; (2) Sequence expression→smiles conversion flow, specifically: Analyzing the sequence expression to obtain residue sequence expression, topology description information and specific connection annotation information; searching corresponding molecular fragments in a monomer template library based on the residue sequence, and performing main chain splicing according to the residue sequence and topology description information to construct a molecular diagram containing a main chain structure; deleting the leaving atom or group from the molecular diagram to ensure the correct structural form of the amino acid in the polypeptide; reconstructing disulfide bonds between corresponding residues in the molecular diagram or forming a specific connection relationship through a linker based on the specific connection annotation information; and encoding the constructed molecular diagram to generate SMILES corresponding to the input sequence expression.
- 2. The method for bidirectional reversible conversion between peptide molecule SMILES and sequence expression according to claim 1, wherein the searching of the molecular diagram using SMARTS mode obtains the sequence relation of each residue in the molecule and the corresponding main chain atomic index, specifically: In the molecular diagram, a SMARTS pattern is defined as [ N; $ (NCC (=O)) ] [ C; $ (C (N) C=O) ] - [ C; $ (C=O) ], and a SMARTS pattern is used for searching based on the N-C alpha-C (=O) repeating structure of the peptide molecular main chain in the molecular diagram, the residue sequence is established through the main chain atomic index, and adjacent traversal is performed along the peptide bond direction, so that the sequence relation of each residue in the molecule and the corresponding main chain atomic index are obtained.
- 3. The bidirectional reversible conversion method between peptide molecule SMILES and sequence expression according to claim 1, wherein the topology determination is performed on the molecular diagram, and the step of generating the topology information sequence specifically comprises: comprehensively analyzing the connection relation between the main chain N and the main chain C and the disconnection/reconstruction of the side chain link, judging whether a head-tail main chain ring structure and/or a covalent connection structure between the side chains exist or not, and generating corresponding topology description information; Whether the peptide is a cyclic peptide is determined by detecting whether a direct chemical bond exists between the N atom of the first residue and the C atom of the last residue, and whether the peptide is a side chain binding structure is determined by identifying a special connecting group on a side chain and the position of the connected residue.
- 4. The method for bidirectional reversible conversion between peptide molecule SMILES and sequence expression according to claim 1, wherein the step of obtaining a candidate set of terminal group precursors and sequence matching with a terminal group library is specifically: cutting the obtained terminal group candidate atom set into sub-molecules, standardizing the sub-molecules into standard terminal group SMILES, then matching with a terminal group library, and directly encoding a terminal group structure into a terminal group structure representation when the terminal group candidate atom set cannot be matched; The end group library comprises a core end group library and an extension end group library which are compatible with HELM, and the standard end group SMILES is matched with the core end group library firstly, and is matched with the extension end group library after failure.
- 5. The method for bidirectional reversible conversion between peptide molecule SMILES and sequence expression according to claim 1, wherein said structure normalization treatment comprises removing the cleavage dummy atom, recovering the C-terminal carboxylic acid form, retaining the side chain and modifying group; and when the structure matching is carried out with the monomer template library, a two-stage matching strategy is adopted, namely, firstly, carrying out accurate structure matching with the items in the monomer template library, and if the structure matching is failed, carrying out similarity matching with the items in the monomer template library through Morgan fingerprints.
- 6. The method of bi-directional reversible conversion between peptide molecule SMILES and sequence expressions according to claim 1, wherein the monomer template library further comprises a scalable end group library/monomer template library and an automatic increment mechanism: If an unknown side chain combination or end group structure with a definite structure is detected in the SMILES-sequence conversion flow, automatically generating new expansion items, recording related information and adding the new expansion items and the record related information into an expansion library of the monomer template library or an expansion end group library of the end group library.
- 7. The bidirectional reversible conversion method between peptide molecule SMILES and sequence expression according to claim 1, wherein the step of searching for corresponding molecular fragments in a monomer template library based on the residue sequence and performing backbone splicing according to the residue sequence and topology description information is specifically as follows: each monomer template bar code adopts HELM type SMILES with an atomic mapping number to define a main chain anchor point, the anchor point and leaving group of each molecular fragment are analyzed, and main chain splicing is realized by adopting HELM type fragment fusion operation of anchor point mapping.
- 8. The method according to claim 1, wherein the step of determining topology as cyclic peptide further comprises the steps of breaking a peptide bond and making the end group into linear peptide during the sequence expression conversion process, and then performing the sequence expression conversion process, wherein the step of deleting the end group and reforming the peptide bond after the linear SMILES is generated for the sequence without the end group during the sequence expression conversion process, so as to form cyclic peptide.
- 9. A bidirectional reversible conversion system between peptide molecule SMILES and sequence expression, characterized in that the system is used for implementing the method of any one of claims 1 to 8, comprising a preprocessing module, a main chain recognition module, a topology determination module, a terminal group detection module, a residue recognition module, a disulfide processing module, a sequence splicing module, a sequence parsing module, a fragment assembly module and a terminal processing module; the pretreatment module is used for SMILES standardization and molecular diagram construction; the main chain identification module is used for searching a main chain sequence; The topology judgment module is used for identifying a topology structure; the end group detection module is used for end group identification and end group library matching; The residue identification module is used for matching the residue segmentation with the monomer template library; The disulfide bond processing module is used for disulfide bond recognition, annotation and reconstruction; the sequence assembly module is used for generating a sequence expression of SMILES-sequence; the sequence analysis module is used for splitting each component part and metadata in the sequence; The fragment assembly module is used for splicing molecular fragments in the sequence-SMILES process; The terminal processing module is used for leaving atom deletion and end group backfill; the SMILES generation module is used for disulfide bond reconstruction and SMILES derivation.
- 10. The bi-directional reversible conversion system between peptide molecule SMILES and sequence expressions according to claim 9, wherein the system further comprises a Web/script interaction module; The Web/script interaction provides a Web interface of FastAPI +static front end and a batch processing script, wherein the Web interface supports four conversion modes of linear/binding/cyclopeptide-binding, and the batch processing script is used for automatically adding sequence, topology information and disulfide bond information to large-scale CSV data.
Description
Bidirectional reversible conversion method and system between peptide molecule SMILES and sequence expression Technical Field The invention belongs to the technical field of computer-aided drug design, and particularly relates to a bidirectional reversible conversion method and system between peptide molecule SMILES and a sequence expression. Background In the development of modern medicaments, peptide compounds become important candidate medicaments for treating various diseases (such as cancers, metabolic diseases and infectious diseases) due to the advantages of high specificity, low toxicity and the like. With the development of high throughput screening, combinatorial chemistry and computational biology techniques, massive amounts of polypeptide molecular structure and activity data were generated. SMILES is widely used as a simple text representation method of molecular structure for storing molecular databases, inputting computational chemistry software and the like, while polypeptide sequences (such as FASTA format or custom extension sequences) are visual forms for describing polypeptide composition and connection modes, and are the basis for sequence comparison, machine learning model training and the like. The method realizes the accurate, stable and reversible conversion between the SMILES and the sequence, is a bridge for connecting the polypeptide molecular structure information and the sequence information, and is also a precondition for developing the subsequent data mining and model construction. The structural diversity of polypeptide molecules makes them uniquely advantageous in drug development, but also presents challenges for their structural representation and data processing. SMILES and sequence serve as two important molecular representations whose interconversion is one of the fundamental problems in polypeptide informatics. At present, there are some related conversion tools and methods in the field, but these techniques have obvious disadvantages in processing complex polypeptide structures, and it is difficult to meet actual research and development requirements, such as: 1. Existing mainstream conversion tools, such as the simple sequence interface provided by RDKit, partially online conversion websites (e.g., peptide Calculator, SMILES Translator, etc.), typically support only limited types of polypeptides. The method is characterized by only supporting 20 natural amino acids and D-type isomers thereof, lacking support for increasingly more unnatural amino acids (such as modified amino acids and artificially synthesized amino acids), only receiving FASTA or a simple one-dimensional sequence as input, failing to process a sequence containing complex information such as end group modification, side chain modification and the like, failing to support any monomer expansion, failing to accurately identify polypeptide molecules containing modified components such as PEG chains, fatty chains, glycosyl and the like, and meanwhile, having severely insufficient processing capacity for special topological structures such as cyclic peptides, binding peptides and the like, and often directly neglecting or erroneously analyzing the topological information. For example, the PeptideToSmiles function of RDKit can only handle linear polypeptides composed of natural amino acids, and can generate wrong SMILES for polypeptides with simple end modifications such as N-terminal acetylation (Ac) and C-terminal amidation (NH 2), while on-line conversion sites mostly target basic linear polypeptides, and when disulfide-or cyclic peptide-containing structures are input, the conversion results often lose key topology information. 2. In the process of SMILES to sequence conversion, most existing schemes rely on "fuzzy matching" or substructural matching algorithms based on residue fragment libraries. The key idea of the method is that the SMILES character string is split into a plurality of sub-fragments, and then the sub-fragments are compared with a preset amino acid fragment library, so that corresponding amino acid residues are determined. However, this approach has significant limitations: (1) The topology information identification capability is poor, whether the polypeptide molecules are linear peptides or cyclic peptides can not be identified stably, and whether the polypeptide molecules have N/C terminal group modification, whether the polypeptide molecules contain side chain binding structures and other topology information are difficult to judge. For example, for head-to-tail cyclic peptides (head-to-TAIL CYCLIC PEPTIDE), existing methods often fail to recognize the cyclic linkage of their backbones, erroneously resolving them into linear peptides, and for polypeptides containing side chain binding, the linker structure of the binding moiety is often erroneously recognized as an amino acid side chain or discarded directly. (2) Lacking strict backbone level resolution, existing methods focus on f