JP-7857363-B2 - Method and apparatus for predicting the structure of protein complexes

JP7857363B2JP 7857363 B2JP7857363 B2JP 7857363B2JP-7857363-B2

Inventors

朱昆睿
▲劉▼▲リー▼行
方▲暁▼敏
▲張▼肖男
何径舟

Assignees

ベイジンバイドゥネットコムサイエンステクノロジーカンパニーリミテッド

Dates

Publication Date: 20260512
Application Date: 20240827
Priority Date: 20231108

Claims (20)

A method for predicting the structure of a protein complex, The steps include obtaining the initial coordinates of each amino acid residue in the target protein complex, and obtaining the target residue pair characteristics, first multiple sequence alignment characteristics, and second multiple sequence alignment characteristics of each protein monomer in the target protein complex, The process includes the steps of: inputting the initial coordinates of each amino acid residue, the target residue pair features of each protein monomer, the first multiple sequence alignment features, and the second multiple sequence alignment features into an N-level folding repeat network layer; predicting the twist angle of each amino acid residue, the residue-level positional transformation, and the monomer chain-level positional transformation using the N-level folding repeat network layer; obtaining the target coordinates of each amino acid residue; and obtaining the predicted structure of the protein complex. The target residue pair feature includes a template feature of the corresponding protein monomer and an amino acid sequence pair feature, the first multiple sequence alignment feature is a multiple sequence alignment feature in which the target multiple sequence alignment feature of the corresponding protein monomer is regularized, and the second multiple sequence alignment feature is a multiple sequence alignment feature to which the target multiple sequence alignment feature of the corresponding protein monomer is mapped, where N is an integer greater than 1. Methods for predicting the structure of protein complexes.
The steps include inputting the initial coordinates, target residue pair features, and second multiple sequence alignment features into the first-stage folded repeat network layer, predicting the positional transformation at the residue level and the positional transformation at the monomer chain level for each amino acid residue, and obtaining the target residue code 1 and candidate positional transformation 1 of the first-stage folded repeat network layer, A step of inputting the target residue pair features, the target residue code m-1 of the m-1 stage folded repeat network layer, and the candidate positional transformation m-1 to the m-th stage folded repeat network layer, predicting the positional transformation at the residue level and the positional transformation at the monomer chain level for each amino acid residue, thereby obtaining the target residue code m and the candidate positional transformation m of the m-th stage folded repeat network layer, wherein the value of m is between 2 and N. The process further includes: using an N-th folded repeating network layer to predict side chains and twist angles for the first multiple sequence alignment feature and the target residue code N of the N-th folded repeating network layer, obtaining the twist angle in the side chain of each amino acid residue, and obtaining the target coordinates of each amino acid residue based on the twist angle in the side chain of each amino acid residue and the candidate position transformation N of the N-th folded repeating network layer. A method for predicting the structure of a protein complex according to claim 1.
The steps of inputting the initial coordinates, the target residue pair features, and the second multiple sequence alignment features into the first-stage folded repeat network layer, predicting the positional transformation at the residue level and the positional transformation at the monomer chain level for each amino acid residue, and obtaining the target residue code 1 and candidate positional transformation 1 of the first-stage folded repeat network layer are as follows: The first step involves performing an invariant point attention mechanism and mapping process on the initial coordinates, the target residue pair features, and the second multiple sequence alignment features using the first folded repeat network layer to obtain the target residue code 1. The steps include: predicting the positional transformation at the residue level for the target residue code 1 to obtain a first positional transformation 1 for each amino acid residue; and predicting the positional transformation at the monomer chain level for the target residue code 1 to obtain a second positional transformation 1 for each amino acid residue; The process includes the step of performing a position update based on the first position transformation 1, the second position transformation 1, and the initial coordinates to obtain a candidate position transformation 1 for the first stage of the folded network layer, The method for predicting the structure of a protein complex according to claim 2.
The steps of inputting the target residue pair features, the target residue code m-1 and candidate positional transformation m-1 of the m-1 folding repeat network layer into the m-stage folding repeat network layer, predicting the positional transformation at the residue level and the positional transformation at the monomer chain level for each amino acid residue, and obtaining the target residue code m and candidate positional transformation m of the m-stage folding repeat network layer are as follows: The steps include: applying an invariance point attention mechanism and mapping process to the candidate position transformation m-1 of the m-1 stage folded repeat network layer input to the m-stage folded network layer and the target residue pair feature to obtain the target residue code m; The steps include: predicting the positional transformation at the residue level for the target residue code m to obtain a first positional transformation m for each amino acid residue; and predicting the positional transformation at the monomer chain level for the target residue code m to obtain a second positional transformation m for each amino acid residue. The process includes the step of obtaining a candidate position transformation m for the m-stage folded network layer based on the first position transformation m and the second position transformation m, The method for predicting the structure of a protein complex according to claim 2.
The process of predicting the positional transformation at the residue level for the target residue code of each amino acid residue and obtaining the first positional transformation of each amino acid residue is as follows: The step of mapping the target residue code of each amino acid residue based on a backbone update algorithm to obtain the first positional transformation of each amino acid residue, A method for predicting the structure of a protein complex according to claim 3.
The process of predicting monomer chain-level positional transformations for the target residue code of each amino acid residue and obtaining a second positional transformation for each amino acid residue is as follows: For each amino acid residue, the step of splitting two or more adjacent amino acid residues into different monomer chains based on the target residue code of the amino acid residue, The method includes the steps of: for any one monomer chain, calculating the average value of the target amino acid residue code of the target amino acid residue to obtain a chain-level candidate residue code; mapping the candidate residue code based on a multilayer neural network structure to obtain the second positional transformation of each amino acid residue in the monomer chain; A method for predicting the structure of a protein complex according to claim 3.
The aforementioned multilayer neural network structure includes a three-layer linear network. The step of mapping the candidate residue codes based on the multilayer neural network structure to obtain the second repositional transformation of each amino acid residue in the monomer chain is: The steps include inputting the candidate residue codes into a first linear network and mapping them to obtain a first transformed representation, The steps include: inputting the first transformation representation into a second linear network and mapping it to obtain a second transformation representation; The process includes the step of inputting the first transformation representation and the second transformation representation into a third linear network and mapping them to obtain the second positional transformation of each amino acid residue in the monomer chain, A method for predicting the structure of a protein complex according to claim 6.
The step of obtaining the target residue pair characteristics of each protein monomer in the target protein complex is: The steps include obtaining template features for each protein monomer and constructing pair features of amino acid sequences for each protein monomer, The steps include: inputting the template features of each protein monomer into a linear network and mapping them, then adding them to the pair features of each protein monomer to obtain candidate residue pair features; The process includes the step of inputting the candidate residue pair features into a pre-configured encoder and encoding them to obtain the target residue pair features of each protein monomer. A method for predicting the structure of a protein complex according to claim 1.
Obtaining template characteristics for each of the aforementioned protein monomers is possible. The target amino acid sequence of each protein monomer is matched against multiple first amino acid sequences in a protein structure database to obtain a second amino acid sequence whose similarity is greater than a predetermined threshold. This includes extracting the distances between the coordinates of amino acid residues in the second amino acid sequence and using them as template features for each protein monomer. A method for predicting the structure of a protein complex according to claim 8.
Constructing the pair characteristics of the amino acid sequence of each protein monomer is The amino acid sequences of each protein monomer are input into two pre-configured linear networks to obtain candidate sequence coding features. The process involves adding one empty dimension in each of the candidate sequence coding features in different directions to obtain the first sequence coding feature and the second sequence coding feature, This includes adding the first sequence encoding feature and the second sequence encoding feature to obtain the pair features of each protein monomer, A method for predicting the structure of a protein complex according to claim 8.
The step of obtaining the first multiple sequence alignment features and the second multiple sequence alignment features of each protein monomer in the target protein complex is: The steps include: searching for and obtaining homology sequences of each protein monomer from multiple gene sequence databases based on the target amino acid sequence of each protein monomer; The steps include performing multiple sequence alignment on the homologous sequences of each protein monomer to obtain candidate multiple sequence alignment features for each protein monomer, The steps include inputting the candidate multiple sequence alignment features of each protein monomer into a pre-configured encoder and encoding them to obtain the target multiple sequence alignment features of each protein monomer, The process includes the steps of: regularizing the target multiple sequence alignment features of each protein monomer to obtain a first multiple sequence alignment feature of each protein monomer; and mapping the target multiple sequence alignment features of each protein monomer to obtain a second multiple sequence alignment feature of each protein monomer. A method for predicting the structure of a protein complex according to claim 1.
A protein complex structure prediction device, An acquisition module for obtaining the initial coordinates of each amino acid residue in a target protein complex, and for obtaining target residue pair characteristics, first multiple sequence alignment characteristics, and second multiple sequence alignment characteristics of each protein monomer in the target protein complex, The system includes a structure prediction module that inputs the initial coordinates of each amino acid residue, the target residue pair features of each protein monomer, the first multiple sequence alignment features, and the second multiple sequence alignment features into an N-level folding and repeating network layer, and uses the N-level folding and repeating network layer to predict the twist angle of each amino acid residue, the positional transformation at the residue level, and the positional transformation at the monomer chain level, thereby obtaining the target coordinates of each amino acid residue and obtaining the predicted structure of the protein complex. The target residue pair feature includes a template feature of the corresponding protein monomer and an amino acid sequence pair feature, the first multiple sequence alignment feature is a multiple sequence alignment feature in which the target multiple sequence alignment feature of the corresponding protein monomer is regularized, and the second multiple sequence alignment feature is a multiple sequence alignment feature to which the target multiple sequence alignment feature of the corresponding protein monomer is mapped, where N is an integer greater than 1. A device for predicting the structure of protein complexes.
The aforementioned structural prediction module further, The initial coordinates, target residue pair features, and second multiple sequence alignment features are input to the first-stage folded repeat network layer, and the positional transformation at the residue level and the positional transformation at the monomer chain level are predicted for each amino acid residue to obtain the target residue code 1 and candidate positional transformation 1 of the first-stage folded repeat network layer. For the m-th folding repeat network layer, the target residue pair features, the target residue code m-1 and candidate positional transformation m-1 of the m-1-th folding repeat network layer are input to the m-th folding repeat network layer, and the positional transformation at the residue level and the positional transformation at the monomer chain level are predicted for each amino acid residue to obtain the target residue code m and candidate positional transformation m of the m-th folding repeat network layer, the value of m being between 2 and N. The Nth-stage folded repeating network layer performs side chain and twist angle predictions for the first multiple sequence alignment feature and the target residue code N of the Nth-stage folded repeating network layer, thereby obtaining the twist angle in the side chain of each amino acid residue, and based on the twist angle in the side chain of each amino acid residue and the candidate position transformation N of the Nth-stage folded repeating network layer, the target coordinates of each amino acid residue are obtained. The protein complex structure prediction device according to claim 12.
The aforementioned structural prediction module further, The first-stage folded repeating network layer performs an invariant point attention mechanism and mapping process on the initial coordinates, the target residue pair features, and the second multiple sequence alignment features to obtain the target residue code 1. Predicting the positional transformation at the residue level for the target residue code 1 and obtaining the first positional transformation 1 for each amino acid residue, and predicting the positional transformation at the monomer chain level for the target residue code 1 and obtaining the second positional transformation 1 for each amino acid residue, Based on the first position transformation 1, the second position transformation 1, and the initial coordinates, a position update is performed to obtain the candidate position transformation 1 for the first stage of the folded network layer. A protein complex structure prediction device according to claim 13.
The aforementioned structural prediction module The candidate position transformation m-1 of the m-1 stage folded repeat network layer input to the m-stage folded network layer and the target residue pair features are subjected to an invariance point attention mechanism and mapping process to obtain the target residue code m. Predicting the positional transformation at the residue level for the target residue code m is performed to obtain the first positional transformation m for each amino acid residue, and predicting the positional transformation at the monomer chain level for the target residue code m is performed to obtain the second positional transformation m for each amino acid residue, Based on the first position transformation m and the second position transformation m, a candidate position transformation m for the m-th stage of the folded network layer is obtained. A protein complex structure prediction device according to claim 13.
The aforementioned structural prediction module further, The target residue code of each amino acid residue is mapped based on the backbone update algorithm to obtain the first positional transformation of each amino acid residue. A protein complex structure prediction device according to claim 14.
The aforementioned structural prediction module further, For each of the amino acid residues, two or more adjacent amino acid residues are split into different monomer chains based on the target residue code of the amino acid residue. For any one monomer chain, the average value of the target amino acid residue code of the target amino acid residue is calculated to obtain a chain-level candidate residue code, and the candidate residue codes are mapped based on a multilayer neural network structure to obtain the second positional transformation of each amino acid residue in the monomer chain. A protein complex structure prediction device according to claim 14.
The aforementioned multilayer neural network structure includes a three-layer linear network. The aforementioned structural prediction module further, The candidate residue codes are input into the first linear network and mapped to obtain the first transformed representation. The first transformation representation is input to a second linear network and mapped to obtain a second transformation representation. The first transformation representation and the second transformation representation are input to a third linear network and mapped to obtain the second positional transformation of each amino acid residue in the monomer chain. A protein complex structure prediction device according to claim 17.
The aforementioned acquisition module further, Template features of each protein monomer are obtained, and pair features of the amino acid sequence of each protein monomer are constructed. After inputting the template features of each protein monomer into a linear network and mapping them, candidate residue pair features are obtained by adding them to the pair features of each protein monomer. The candidate residue pair features are input into a pre-configured encoder and encoded to obtain the target residue pair features of each protein monomer. The protein complex structure prediction device according to claim 12.
The acquisition module, The target amino acid sequence of each protein monomer is matched against a plurality of first amino acid sequences in a protein structure database to obtain a second amino acid sequence whose similarity is greater than a predetermined threshold. The distances between the coordinates of amino acid residues in the second amino acid sequence are extracted and used as template features for each protein monomer. A protein complex structure prediction device according to claim 19.

Description

This disclosure relates to the field of artificial intelligence technology, particularly to technologies such as natural language processing and biocomputing. Protein complexes are stable macromolecular complexes formed by the interaction of two or more protein molecules, playing crucial roles in various biological functions such as enzymatic reactions, cell signaling, metabolic regulation, and gene expression. Here, the function of a protein is largely determined by its spatial structure, and the technology to predict the three-dimensional structure (tertiary structure) of a protein in space based on the amino acid categories (primary structure) of the protein chain has extremely high research value in the life sciences field. Therefore, accurately predicting protein structures, improving the efficiency of protein complex structure prediction, and addressing various biological applications has become a crucial research area. The drawings are provided to better understand this solution and do not limit the scope of this disclosure. This is a flowchart of a method for predicting the structure of a protein complex according to one embodiment of the present disclosure. This is a flowchart of a method for predicting the structure of a protein complex according to one embodiment of the present disclosure. This is a structural diagram of a protein complex structure prediction method according to one embodiment of the present disclosure. This is a flowchart of a method for predicting the structure of a protein complex according to one embodiment of the present disclosure. This is a structural diagram of a protein complex structure prediction method according to one embodiment of the present disclosure. This is a structural diagram of a protein complex structure prediction device according to one embodiment of the present disclosure. This is a block diagram of an electronic device that implements the method of the embodiment of this disclosure. The following description, in conjunction with the drawings, illustrates exemplary embodiments of the present disclosure and includes various details of these embodiments for the sake of clarity; these should be considered illustrative. Therefore, as those skilled in the art will see, various changes and modifications can be made to the embodiments, provided they do not deviate from the scope and spirit of the disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description. The embodiments described herein relate to the fields of artificial intelligence technologies such as computer vision and deep learning. Artificial intelligence (AI) is a scientific field that researches and develops theories, methods, techniques, and applied systems that simulate and extend human intelligence. Natural Language Processing (NLP) is a significant field in computer science and artificial intelligence. It studies various theories and methods to enable effective communication between humans and computers using natural language. NLP is a science that integrates linguistics, computer science, and mathematics. Therefore, while research in this field is closely related to linguistics—that is, the language people use in everyday life—there is a crucial difference. NLP does not study natural language in general, but rather the study of computer systems, particularly software systems, that can effectively realize natural language communication. It is a part of computer science. Biocomputing refers to a new mode of computing developed through research and development utilizing the unique information processing mechanisms of biological systems. Biocomputing research encompasses two aspects: devices and systems. It provides fundamental units for detecting, processing, transmitting, and storing information through physical and chemical processes at the molecular level, using ordered systems constructed from organic (or biological) materials at the molecular scale. The following describes the method and apparatus for predicting the structure of protein complexes according to this disclosure, in conjunction with the drawings. Figure 1 is a flowchart of a method for predicting the structure of a protein complex according to one embodiment of the present disclosure. As shown in Figure 1, the method includes the following steps S101 to S102. In S101, the initial coordinates of each amino acid residue in the target protein complex are obtained, and the target residue pair characteristics, first multiplex alignment characteristics, and second multiplex alignment characteristics of each protein monomer in the target protein complex are acquired. A protein complex has multiple protein monomers, each protein monomer having one amino acid sequence, and when amino acids bind to each other to form a peptide bond, they lose one molecule of water, and thus the amino acid unit in a polypeptide/protein is an amino acid residue. In the embodiment