CN-115713972-B - Graphormer algorithm-based protein sequence design method, graphormer algorithm-based protein sequence design device and storage medium

CN115713972BCN 115713972 BCN115713972 BCN 115713972BCN-115713972-B

Abstract

The invention relates to a protein sequence design method, a device and a storage medium based on Graphormer algorithm, wherein the method comprises the steps of representing a protein structure in a graph mode, taking single amino acid as a node, taking connection among amino acid as an edge, extracting initial edge characteristics and initial node characteristics of protein, splicing the initial node characteristics of the protein and a random matrix, adjusting dimensionality through a linear layer, adding position coding information to obtain node characteristics, inputting the node characteristics as a GPD model, splicing the initial edge characteristics of the protein through the matrix, then passing through two linear layers to obtain edge characteristics, embedding the edge characteristics into an attention matrix of the GPD model, constructing the GPD model for fixing skeleton protein sequence design, training, and designing the protein sequence based on the GPD model, the node characteristics and the edge characteristics. Compared with the prior art, the invention has higher sequence recovery rate and higher design sequence diversity.

Inventors

CHEN HAIFENG
WEI TING

Assignees

上海交通大学

Dates

Publication Date: 20260505
Application Date: 20221118

Claims (4)

1. A protein sequence design method based on Graphormer algorithm is characterized by comprising the following steps: S1, representing a protein structure in a graph mode, wherein single amino acids are taken as nodes, and the connection among the amino acids is taken as edges; s2, extracting initial edge features of proteins, wherein the initial edge features comprise a distance matrix, a displacement vector, a rotation quaternion and a residue shortest path; S3, extracting initial node characteristics of protein, wherein the initial node characteristics comprise dihedral angles, predicted secondary structures, amino acid centrality and initial protein sequence codes, and both the initial node characteristics and initial edge characteristics have translational rotation invariance; S4, splicing the initial node characteristics of the protein and a random matrix, adjusting the dimension through a linear layer, adding position coding information to obtain node characteristics, and inputting the node characteristics as a GPD model; The splicing of the protein initial node characteristics and a random matrix is specifically as follows: Respectively calculating a dihedral angle sine matrix and a dihedral angle cosine matrix; inputting the predicted secondary structure and the initial protein sequence code into an embedding layer to obtain a secondary structure embedding matrix and an initial protein sequence code embedding matrix; generating a normally distributed random matrix based on the random number seeds; splicing a dihedral angle sine matrix, a cosine matrix, a secondary structure embedded matrix, an initial protein sequence code embedded matrix, an amino acid centrality matrix and a random matrix; S5, splicing the initial edge features of the protein through a matrix, and then passing through two linear layers to obtain edge features, and embedding the edge features into an attention matrix of the GPD model; s6, constructing and training a GPD model for designing a fixed framework protein sequence, wherein the GPD model comprises 6 identical Graphormer modules, linear layer processing and softmax which are sequentially connected, the Graphormer module is built on the basis of a Graphormer block, and the Graphormer block comprises an attention matrix; The Graphormer module comprises a Graphormer block, a first regularization module, a feedforward module and a second regularization module which are sequentially connected, wherein the input of the first regularization module comprises the output of the Graphormer block and the initial node characteristic, and the input of the second regularization module comprises the output of the feedforward module and the output of the first regularization module; the specific processing process of the Graphormer block comprises the steps of respectively obtaining Q, K and V matrixes through three linear layers for input node characteristics, obtaining a result matrix through matrix multiplication processing of the Q matrix and the K matrix, inputting the result matrix and the edge characteristics into an attention matrix after softmax processing, and obtaining the output of the Graphormer block through one linear layer after matrix multiplication processing of the output of the attention matrix and the V matrix; s7, designing a protein sequence based on the GPD model, the node characteristics and the edge characteristics.
2. The method for designing a protein sequence based on Graphormer algorithm according to claim 1, wherein the GPD model is trained using Adam optimizer minimization of a loss function, which is a class cross entropy.
3. A protein sequence design device based on Graphormer algorithm, comprising a memory, a processor and a program stored in the memory, wherein the processor implements the method according to any one of claims 1-2 when executing the program.
4. A storage medium having a program stored thereon, wherein the program, when executed, implements the method of any of claims 1-2.

Description

Graphormer algorithm-based protein sequence design method, graphormer algorithm-based protein sequence design device and storage medium Technical Field The invention relates to the technical field of protein sequence design model construction, in particular to a method, a device and a storage medium for protein sequence design based on Graphormer algorithm. Background The de novo design (De novo protein design) of proteins is intended to design proteins with specific structures or functions. Protein design is a core problem in protein engineering, for example, improving the catalytic efficiency of enzymes, affinity of antibodies, and the like by using protein design. Protein design involves two key tasks, protein backbone design and fixed backbone protein sequence design (fixed-backbone protein sequence design). Fixed-backbone protein sequence design aims at designing an amino acid sequence that can be folded into a specific protein backbone structure, and specifically, the designed sequence needs to be folded into a desired structure and also needs to have a specific function. This task is also known as the protein reverse folding problem (inverse protein folding problem). Methods of designing the fixed-backbone protein sequence can be divided into two categories, protein sequence design based on classical energy functions and protein sequence design based on deep learning. Protein sequence design based on classical energy functions, such as the Rosetta series method, which is currently widely used, minimizes the energy function of the target structure by searching for a combination of sequence and conformation. The classical energy function-based protein sequence design method relies not only on the precise definition of the protein energy function, but also on the efficiency of the sampling algorithm. The accuracy and calculation speed are to be further improved. With the rapid development of deep learning technology, protein sequence design based on deep learning has achieved good effect in recent years. Deep learning based protein sequence design can provide rapid and accurate protein design, resulting in a revolution in the field of protein design. The Po-Ssu Huang laboratory constructed a 3D CNN model that predicts residue types and rotamer dihedral angles in an autoregressive manner. ProteinSolver encodes the nodes as amino acid types, edges as distances between amino acids, and designs the sequence as constraints to meet the problem. Structure Transformer generalized the transducer into map-based protein three-dimensional structural coding. The ESM-IF1 model was model trained using the AlphaFold predicted 1200 ten thousand structures. ProteinMPNN extends Structure Transformer, adding a virtual Cb atom and a random decoding instead of forward decoding. These approaches described above aim to increase the sequence recovery (recovery) of the model, while ignoring diversity between designed sequences, resulting in insufficient spatial coverage of the designed sequences, such that the designed sequences tend to closely resemble the native sequences, particularly the protein core sequences. Ideally, the designed sequence should cover a wide protein sequence space with a high sequence diversity. Current partial methods also employ superparameters to increase the diversity and variability of design sequences. For example Structure Transformer and ABACUS-R, construct the bias distribution by means of superparameters (temperature T in Structure Transformer and alpha in ABACUS-R) to increase the diversity and variability of the design sequences. However, the values of the super parameters in different methods are different, so that the method has strong subjectivity. Disclosure of Invention The invention aims to provide a protein sequence design method, a device and a storage medium based on Graphormer algorithm, which can maintain high sequence recovery rate and improve diversity among sequences. The aim of the invention can be achieved by the following technical scheme: A protein sequence design method based on Graphormer algorithm comprises the following steps: S1, representing a protein structure in a graph mode, wherein single amino acids are taken as nodes, and the connection among the amino acids is taken as edges; S2, extracting initial edge characteristics of proteins; s3, extracting initial node characteristics of the protein; S4, splicing the initial node characteristics of the protein and a random matrix, adjusting the dimension through a linear layer, adding position coding information to obtain node characteristics, and inputting the node characteristics as a GPD model; S5, splicing the initial edge features of the protein through a matrix, and then passing through two linear layers to obtain edge features, and embedding the edge features into an attention matrix of the GPD model; s6, constructing and training a GPD model for designing a fixed framework protein sequence, wherein the GPD model co