EP-4738367-A1 - PREDICTION OF PROTEIN STRUCTURE ENSEMBLES

EP4738367A1EP 4738367 A1EP4738367 A1EP 4738367A1EP-4738367-A1

Abstract

A computing system (10) for predicting protein structure ensembles includes processing circuitry (18) configured to, in a first training phase, ingest a synthetic dataset of protein sequences, perform structure-based clustering on the synthetic dataset to produce clusters of protein structures, filter the clusters of protein structures, and train a diffusion model (66) on training pairs. In a second training phase, the processing circuitry (18) receives a predicted protein structure for an input training protein sequence from the diffusion model (66), and compares the predicted protein structure to a corresponding training protein structure from a molecular dynamics simulation. In a third training phase, the processing circuitry (18) receives a predicted value for a property of sampled protein structures, compares the predicted value to an actual value of the property, and backpropagates the diffusion model (66) with the difference. The diffusion model (66) estimates a denoised protein structure corresponding to the given protein sequence in fifteen or fewer denoising steps.

Inventors

FOONG, YUE KWANG
Noe, Frank
SCHNEUING, ARNE
YANG, Soojung
JIMENEZ LUNA, JOSE SALVADOR
CLEGG, SARAH
ABDIN, Osama
Gastegger, Michael
XIE, YU
HEMPEL, Tim
SATORRAS, VICTOR GARCÍA
VEELING, BASTIAAN SJOUKE

Assignees

Microsoft Technology Licensing, LLC

Dates

Publication Date: 20260506
Application Date: 20251024

Claims (15)

A computing system (10) for predicting protein structure ensembles, comprising: a computing device (14) including processing circuitry (18) configured to execute instructions (28) using portions of associated memory (22) to implement a protein structure ensemble prediction model (30), the processing circuitry (18) being configured to: in a first training phase, ingest a synthetic dataset of protein sequences, identify protein sequences in the synthetic dataset having structurally heterogeneous predictions, perform structure-based clustering on the identified protein sequences based on the structurally heterogeneous predictions to produce clusters of predicted protein structures, filter the clusters of predicted protein structures to remove disordered predicted protein structures and clusters comprising a single predicted protein structure, generate training pairs for a diffusion model (66) included in the protein structure ensemble prediction model (30), and train the diffusion model (66) on the training pairs; in a second training phase, sample a training protein sequence and a corresponding training protein structure from a molecular dynamics simulation, corrupt the corresponding training protein structure, input the training protein sequence and the corrupted version of the corresponding training protein structure into the diffusion model (66), receive, from the diffusion model (66), a predicted uncorrupted protein structure corresponding to the input training protein sequence, and compare the predicted uncorrupted protein structure from the diffusion model (66) to the corresponding training protein structure sampled from the molecular dynamics simulation; and in a third training phase, instruct the diffusion model (66) to sample a plurality of structures for a given protein sequence, receive, from the diffusion model (66), a predicted value for a property of a distribution of the plurality of sampled structures, compare the predicted value of the property from the diffusion model (66) to an actual value of the property, calculate a difference between the predicted value of the property and the actual value of the property, and backpropagate the diffusion model (66) with the calculated difference to minimize a loss function, wherein the diffusion model (66) is configured to estimate a denoised protein structure corresponding to the given protein sequence in fifteen or fewer denoising steps.
The computing system according to claim 1, wherein the protein sequences having structurally heterogeneous predictions are identified via many-against-many sequence searching, and/or wherein the structure-based clustering is performed using a protein structure alignment server.
The computing system according to any one of claim 1 or claim 2, wherein to generate the training pairs for the diffusion model, the processing circuitry is configured to: randomly select a predicted protein structure from a randomly selected cluster of predicted protein structures, and pair the randomly selected predicted protein structure with a protein sequence that corresponds to a predicted protein structure having a highest predicted local distance different test value from within the randomly selected cluster.
The computing system according to any one of claims 1 to 3, wherein the property of the distribution of the plurality of sampled structures is one of a class, a value, and a tensor, and optionally, wherein the property is a value of a free energy difference between different metastable states or a mean value of a distance between two amino acids in a three-dimensional structure of the protein structure or wherein the value of the free energy difference includes a value of a free energy difference between folded and unfolded states of the protein structure.
The computing system according to any one of claims 1 to 4, wherein when the molecular dynamics simulation does not reach equilibrium within a simulation timeframe, a re-weighting procedure is performed over protein structures in the molecular dynamics simulation to approximate an equilibrium distribution, and during the second training phase, the corresponding training protein structure is sampled from the molecular dynamics simulation with a probability according to the re-weighted protein structures.
A computerized method (400) for training a model (30) to predict protein structure ensembles utilizing processing circuitry (18) and memory (22) of one or more computing devices (14), the method comprising: in a first training phase, ingesting (402) a synthetic dataset of protein sequences, identifying (404) protein sequences in the synthetic dataset having structurally heterogeneous predictions, performing (406) structure-based clustering on the identified protein sequences based on the structurally heterogeneous predictions to produce clusters of predicted protein structures, filtering (408) the clusters of predicted protein structures to remove disordered predicted protein structures and clusters comprising a single predicted protein structure, generating (410) training pairs for a diffusion model (66) included in the protein structure ensemble prediction model (30), and training (412) the diffusion model (66) on the training pairs; in a second training phase, sampling (414) a training protein sequence and a corresponding training protein structure from a molecular dynamics simulation, corrupting (416) the corresponding training protein structure, inputting (418) the training protein sequence and the corrupted version of the corresponding training protein structure into the diffusion model (66), receiving (420), from the diffusion model (66), a predicted uncorrupted protein structure corresponding to the input training protein sequence, and comparing (422) the predicted uncorrupted protein structure from the diffusion model (66) to the corresponding training protein structure sampled from the molecular dynamics simulation; and in a third training phase, instructing (424) the diffusion model (66) to sample a plurality of structures for a given protein sequence, receiving (426), from the diffusion model (66), a predicted value for a property of a distribution of the plurality of sampled structures, comparing (428) the predicted value of the property from the diffusion model (66) to an actual value of the property, calculating (430) a difference between the predicted value of the property and the actual value of the property, and backpropagating (432) the diffusion model (66) with the calculated difference to minimize a loss function, wherein the diffusion model (66) is configured to estimate a denoised protein structure corresponding to the given protein sequence in fifteen or fewer denoising steps.
The computerized method according to claim 9, further comprising: identifying the protein sequences having structurally heterogeneous predictions via many-against-many sequence searching, and/or performing the structure-based clustering with a protein structure alignment server.
The computerized method according to any one of claims 6 or claim 7, further comprising: generating the training pairs for the diffusion model by: randomly selecting a predicted protein structure from a randomly selected cluster of predicted protein structures, and pairing the randomly selected predicted protein structure with a protein sequence that corresponds to a predicted protein structure having a highest predicted local distance different test value from within the randomly selected cluster.
The computerized method according to any one of claims 6 to 8, wherein the property of the distribution of the plurality of sampled structures is one of a class, a value, and a tensor.
The computerized method according to claim 9, wherein the property is a value of a free energy difference between different metastable states or a mean value of a distance between two amino acids in a three-dimensional structure of the protein structure, and optionally, wherein the value of the free energy difference includes a value of a free energy difference between folded and unfolded states of the protein structure.
The computerized method according to any one of claims 6 to 14, further comprising: when the molecular dynamics simulation does not reach equilibrium within a simulation timeframe, performing a re-weighting procedure over protein structures in the molecular dynamics simulation to approximate an equilibrium distribution, and during the second training phase, sampling the corresponding training protein structure from the molecular dynamics simulation with a probability according to the re-weighted protein structures.
A computing system (10) for predicting protein structure ensembles, comprising: a computing device (14) including processing circuitry (18) configured to execute instructions (28) using portions of associated memory (22) to implement a protein structure ensemble prediction model (30), the processing circuitry (18) being configured to: in an inference phase, receive an input protein sequence (32); perform a search for protein sequence data (46) based on the input protein sequence (32); identify and retrieve a subset of the protein sequence data (46) having similarity to the input protein sequence (32); perform a multiple sequence alignment between the input protein sequence (32) and the subset of the protein sequence data (46) to produce multiple sequence alignment data (52); encode data from the multiple sequence alignment, the encoded data (62) including single representations (62A) corresponding to the multiple sequence alignment data (52); and input the encoded data (62) into a denoising diffusion model (66) to predict molecular properties and structural features of the input protein sequence (32).
The computing system according to claim 12, wherein the processing circuitry is further configured to: perform a search for protein structure data based on the input protein sequence; identify and retrieve candidate protein structure data for candidates having a sequence-structure relationship with the input protein sequence; pair the input protein sequence and candidate protein structure data to produce pair data; and encode data from the pairing of the input protein sequence and the candidate protein structure data, the encoded data including pair representations corresponding to the pair data.
The computing system according to claim 13, wherein the multiple sequence alignment data and the pair data from the pairing of the input sequence and the candidate protein structure data are input to a refinement model, and the refinement model outputs a joint latent representation as encoded data, the encoded data including the single representations corresponding to the multiple sequence alignment data and the pair representations corresponding to the pair data.
The computing system according to any one of claims 12 to 14, wherein the multiple sequence alignment between the input protein sequence and the subset of the protein sequence data is expressed as graph-structured data.

Description

BACKGROUND Biomolecules, such as proteins and ribonucleic acids (RNA), are fundamental to gene expression, cellular functions, and biological processes. The ability to predict and manipulate different three-dimensional (3D) structures that biomolecules adopt and switch between, and the affinity with which they bind to other molecules, is of fundamental importance for advancing biological research, as well as for pharmaceutical and biotechnology industries. However, many biomolecular mechanisms cannot be directly observed via laboratory experiments. While molecular dynamics (MD) simulations can be used for certain molecular property simulations, such as dynamics in the folded protein state, protein folding and conformational changes, and utilized for industrial applications such as drug discovery, such MD simulations require sampling a huge and complex conformational space, thereby resulting in either impractical computational costs or uncontrollable inaccuracies. SUMMARY To address the issues discussed herein, a computing system for predicting protein structure ensembles is provided. According to one aspect, a computing system includes processing circuitry configured to execute instructions using portions of associated memory to implement a protein structure ensemble prediction model. In a first training phase, the processing circuitry is configured to ingest a synthetic dataset of protein sequences, identify protein sequences in the synthetic data having structurally heterogeneous predictions, perform structure-based clustering on the protein sequences based on the structurally heterogeneous predictions, filter the clustered protein sequences to remove disordered sequences and clusters having a single representative, generate training pairs for a diffusion model included in the protein structure ensemble prediction model, and train the diffusion model on the training pairs. In a second training phase, the processing circuitry is configured to sample a training protein sequence and a corresponding training protein structure from a molecular dynamics simulation, corrupt the corresponding training protein structure, input the training protein sequence and the corrupted version of the corresponding training protein structure into the diffusion model, receive a predicted uncorrupted protein structure corresponding to the input training protein sequence from the diffusion model, and compare the predicted uncorrupted protein structure from the diffusion model to the uncorrupted corresponding training protein structure sampled from the molecular dynamics simulation. In a third training phase, the processing circuitry is configured to instruct the diffusion model to sample a plurality of structures for a given protein sequence, receive a predicted value for a property of a distribution of the plurality of sampled structures, compare the predicted value of the property from the diffusion model to an actual value of the property, calculate a difference between the predicted value of the property and the actual value of the property, and backpropagate the diffusion model with the calculated difference to minimize a loss function. The diffusion model is configured to estimate a denoised protein structure corresponding to the given protein sequence in fifteen or fewer denoising steps. This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows a schematic diagram of a computing system for predicting protein structure ensembles, according to one embodiment of the present disclosure.FIGS. 2A and 2B show an inference phase for a protein structure ensemble prediction model in accordance with the computing system of FIG. 1.FIGS. 3A to 3C show a training pipeline for a protein structure ensemble prediction model in accordance with the computing system of FIG. 1.FIGS. 4A to 4C show a flowchart of a computerized method for training a model to predict protein structure ensembles, according to an example implementation of the present disclosure.FIG. 5 shows an example computing environment according to which the embodiments of the present disclosure may be implemented. DETAILED DESCRIPTION Proteins and their complexes constitute the functional building blocks of life and are central to drug discovery and development. They are the workhorses in biotechnological processes such as gene editing, enzymatic catalysis, and the formation of biomaterials. Understanding how proteins work, and how their function is affected by introducing other molecules or changing their sequence,