CN-122029608-A - Conditional production of protein sequences

CN122029608ACN 122029608 ACN122029608 ACN 122029608ACN-122029608-A

Abstract

A computing system (10) for conditional generation of protein sequences includes a processing circuit (18), the processing circuit (18) implementing a denoising diffusion probability model (26). In the inference phase, the processing circuit (18) receives an instruction (32) to generate a predicted protein sequence (64) having a target function, the instruction (32) including first condition information (34) and second condition information (36). The processing circuit (18) concatenates the first conditional information embedding (40) generated by the first encoder (38) and the second conditional information embedding (44) generated by the second encoder (42) to produce a concatenated conditional information embedding (46). The processing circuit (18) samples noise from the distribution function (52) and embeds (46) splice condition information into the sampled noise to produce a noisy splice input (56). The processor (10) inputs the noise splice input (56) to a de-noising neural network (58) to generate a predicted sequence insert (60), inputs the predicted sequence insert (60) to a decoding neural network (62) to generate a predicted protein sequence (64), and outputs the predicted protein sequence (64).

Inventors

B. J. Whitman
E.J. Howitz
R. V. Kudry

Assignees

微软技术许可有限责任公司

Dates

Publication Date: 20260512
Application Date: 20240924
Priority Date: 20231026

Claims (20)

1. A computing system (10) for conditional generation of a protein sequence, the computing system (10) comprising a processing circuit (18), the processing circuit (18) executing instructions using a portion of an associated memory (22) to implement a denoising diffusion probability model (26), wherein, in an inference phase, the processing circuit (18) is configured to: Receiving an instruction (32) to generate a predicted protein sequence (64) having a target function, the instruction (32) comprising first condition information (34) and second condition information (36) associated with the target function of the predicted protein sequence (64); Concatenating a first conditional information embedding (40) generated by a first encoder (38) and a second conditional information embedding (44) generated by a second encoder (42) to produce a concatenated conditional information embedding (46), the first conditional information embedding (40) representing the first conditional information (34) and the second conditional information embedding (44) representing the second conditional information (36); Sampling noise embedding (54) from the distribution function (52); -combining (46) the splice condition information with the sampled noise embeddings (54) to produce a noisy splice input (56); inputting the noisy stitched input (56) to a denoising neural network (58) such that the denoising neural network (58) generates a prediction sequence embedding (60); inputting the predicted sequence embedding (60) to a decoding neural network (62) to generate the predicted protein sequence (64) based on the input predicted sequence embedding (60), and Outputting the predicted protein sequence (64).
2. The computing system of claim 1 wherein, The processing circuit embeds the predicted sequence into a denoising loop in which, prior to inputting the predicted sequence into the decoding neural network: Noise is added to the prediction sequence embedding to generate a noisy prediction sequence embedding, Embedding the noisy prediction sequence with the conditional information to generate a new noisy stitched input, and The new noisy stitched input is input to the de-noised neural network to generate a new prediction sequence embedding.
3. The computing system of claim 2, wherein The denoising cycle is repeated a predetermined number of times to generate a final predicted sequence embedding, The final predicted sequence is embedded and input into a decoding neural network to generate a final predicted protein sequence, an The final predicted protein sequence is output.
4. The computing system of any of claims 1 to 3, wherein Before inputting the predicted sequence embedding into a decoding neural network to generate the predicted protein, the predicted protein sequence is input into a sequence encoder to generate a clamp predicted sequence embedding, an The processing circuit embeds the clamp prediction sequence into a denoising loop in which: Noise is added to the clamp prediction sequence embedding to generate a noisy clamp prediction sequence embedding, Embedding the noisy clamp prediction sequence with the condition information embedding splice to generate a new noisy clamp splice input, and The new noisy clamp splice input is input to the denoising neural network to generate a new prediction sequence embedding.
5. The computing system of claim 4, wherein The denoising loop is repeated a predetermined number of times to generate a final predicted sequence embedding, and The final predicted sequence embedding is input to a decoding neural network to generate a final predicted protein sequence, and the final predicted protein sequence is output.
6. The computing system of any one of claims 1 to 5, wherein The first condition information is selected from the group consisting of protein structure information, text information, chemical reaction information, and metadata associated with the input protein sequence, and The second condition information is different from the first condition information and is selected from the group consisting of protein structure information, text information, chemical reaction information, and metadata associated with the input protein sequence.
7. The computing system of any of claims 1 to 6, wherein During a training phase, the processing circuitry is configured to: receiving a text instruction to generate a predicted protein sequence, the text instruction comprising a training protein sequence and training condition information from two or more condition information categories; inputting the training protein sequence to a sequence encoder to convert the training protein sequence into a training sequence insert; Adding noise sampled from the distribution function to the training sequence embedding to produce a noisy training sequence embedding; Encoding the training condition information via respective encoders to generate two or more training condition information embeddings; inputting the two or more training condition information embeddings to a feedforward neural network to generate a training condition information embedment; Embedding and combining the training condition information output from the feedforward neural network with the noisy training sequence to produce a noisy training input; Inputting the noisy training input to the de-noised neural network to generate a predictive training sequence embedding, and The predicted training sequence is embedded into the sequence decoding neural network to generate a predicted training protein sequence based on the training protein sequence and the training condition information.
8. The computing system of claim 2, wherein The trained ranking model calculates the probability that the noisy predicted sequence insert corresponds to a protein sequence having a higher level of target function than another protein sequence insert.
9. A method (500) for conditionally generating a protein sequence, the method (500) comprising, in an inference phase: Receiving an instruction (32) to generate a predicted protein sequence (64) having a target function, the instruction (32) comprising first condition information (34) and second condition information (36) associated with the target function of the predicted protein sequence (64); Concatenating a first conditional information embedding (40) generated by a first encoder (38) and a second conditional information embedding (44) generated by a second encoder (42) to produce a concatenated conditional information embedding (46), the first conditional information embedding (40) representing the first conditional information (34) and the second conditional information embedding (44) representing the second conditional information (36); Sampling noise embedding (54) from the distribution function (52); -combining (46) the splice condition information with the sampled noise embeddings (54) to produce a noisy splice input (56); inputting the noisy stitched input (56) to a denoising neural network (58) such that the denoising neural network (58) generates a prediction sequence embedding (60); Inputting the predicted sequence embedding (60) to a sequence decoding neural network (62) to generate a predicted protein sequence (64) based on the input predicted sequence embedding (60), and Outputting the predicted protein sequence (64).
10. The method of claim 9, the method further comprising: before embedding the prediction sequence into the decoding neural network, in a denoising cycle: Adding noise to the predicted sequence embedding to generate a noisy predicted sequence embedding; embedding the noisy prediction sequence with the conditional information to generate a new noisy stitched input, and The new noisy stitched input is input to the de-noised neural network to generate a new prediction sequence embedding.
11. The method of claim 10, the method further comprising: Repeating the denoising cycle a predetermined number of times to generate a final predicted sequence embedding; Embedding the final predicted sequence into a decoding neural network to generate a final predicted protein sequence, and Outputting the final predicted protein sequence.
12. The method of any one of claims 9 to 11, the method further comprising: inputting the clamped predicted protein sequence to a sequence encoder to generate a clamped predicted sequence insert prior to inputting the predicted sequence insert to the decoding neural network, and Embedding the predicted sequence into a denoising loop, the denoising loop comprising: adding noise to the clamp prediction sequence embedding to generate a noisy clamp prediction sequence embedding, Embedding the noisy clamp prediction sequence with the condition information embedding splice to generate a new noisy clamp splice input, and The new noisy clamp splice input is input to the denoising neural network to generate a new prediction sequence embedding.
13. The method of claim 12, the method further comprising: Repeating the denoising cycle a predetermined number of times to generate a final predicted sequence embedding; Embedding the final predicted sequence into a decoding neural network to generate a final predicted protein sequence, and Outputting the final predicted protein sequence.
14. The method of any one of claims 9 to 13, the method further comprising: selecting the first condition information from the group consisting of protein structure information, text information, chemical reaction information, and metadata associated with the input protein sequence, and Selecting the second condition information from the group consisting of protein structure information, text information, chemical reaction information, and metadata associated with the target function of the input protein sequence, wherein The second condition information is different from the first condition information.
15. The method of any of claims 9 to 14, further comprising, during a training phase: Inputting a training protein sequence to a sequence encoder to convert the training protein sequence from raw text data to training sequence embedding; adding noise sampled from a standard normal distribution to the training sequence embedding to produce a noisy training sequence embedding; encoding training condition information from the two or more condition information categories via respective encoders to generate two or more training condition information embeddings; inputting the two or more training condition information embeddings to a feedforward neural network to generate a training condition information embedment; Embedding and combining the training condition information output from the feedforward neural network with the noisy training sequence to produce a noisy training input; Inputting the noisy training input to the de-noised neural network to generate a predictive training sequence embedding, and The predicted training sequence is embedded into the sequence decoding neural network to generate a predicted training protein sequence based on the training protein sequence and the training condition information.
16. The computing system of any one of claims 1 to 8, wherein The distribution function is a standard normal distribution.
17. The computing system of any one of claims 1 to 8 and 16, wherein The denoising neural network includes a temperature super-parameter and a predicted position distance difference test that is used as a reward function to change the sensitivity of the temperature super-parameter.
18. The method of any one of claims 9 to 15, wherein The distribution function is a standard normal distribution.
19. The method of any one of claims 9 to 15 and 17, further comprising: A temperature hyper-parameter and a predicted local distance difference test are included in the denoising neural network, the predicted local distance difference test serving as a reward function to alter the sensitivity of the temperature hyper-parameter.
20. A computing system (10) for conditional generation of a protein sequence, the computing system (10) comprising a processing circuit (18), the processing circuit (18) executing instructions using a portion of an associated memory (22) to implement a denoising diffusion probability model (26), wherein, in an inference phase, the processing circuit (18) is configured to: Receiving an instruction (32) to implement a de novo protein design method to generate a predicted protein sequence (64) having a target function, the instruction (32) comprising first condition information (34) and second condition information (36) for the predicted protein sequence (64); Encoding the first condition information (34) for the protein sequence of interest using a first encoder (38) to produce a first condition information insert (40); Encoding the second condition information (36) for inputting a protein sequence using a second encoder (42) to produce a second condition information embedding (44); concatenating the first conditional information embedding (40) and the second conditional information embedding (44) to produce a conditional information embedding (46); sampling noise embedding (54) from a standard normal distribution (52); -combining (46) the conditional information embedding with the sampled noise embedding (54) to produce a noisy stitched input (56); inputting the noisy stitched input (56) to a denoising neural network (58) such that the denoising neural network (58) generates a prediction sequence embedding (60); Inputting the predicted sequence embedding (60) to a decoding neural network (62) to generate the predicted protein sequence (64) based on the input predicted sequence embedding (60); Inputting the predicted protein sequence (64B) to a sequence encoder (74) to generate a clamp predicted sequence insert (128); embedding (128) the predicted sequence into a denoising loop in which: noise (122) is added to the clamp prediction sequence embedding (60A) to generate a noisy clamp prediction sequence embedding (124A), Splicing (124B) the noisy clamp-predicted sequence with the conditional information embedding (46) to generate a new noisy clamp-splice input (126A), and A new noisy clamp splice input (126A) is input to the denoising neural network (58) to generate a new predicted sequence embedding (60B); Repeating the denoising cycle a predetermined number of times to generate a final predicted sequence embedding (60N); embedding (60N) the final predicted sequence into a sequence decoding neural network (62) to generate a final predicted protein sequence (64A), and Outputting the final predicted protein sequence (64A).

Description

Conditional production of protein sequences Background In the field of computational protein engineering, computer-based techniques have been developed that, given a set of conditions describing the function of a target molecule, recognize protein sequences that produce a three-dimensional protein structure with molecular properties to achieve that target molecule function. Because molecular properties can have a broad impact on the activity and function of molecules or substrates, tools for predicting optimized protein sequences are of great interest in a variety of fields, including drug design and drug discovery. However, as discussed below, the generation of protein sequences and protein structures still presents an opportunity for improvement, particularly with respect to the goal of achieving a particular targeted functional capability of a protein. Disclosure of Invention To address the problems discussed herein, computing systems and methods for conditional generation of protein sequences are provided. In one aspect, a computing system includes processing circuitry to execute instructions using portions of associated memory to implement a denoising diffusion probability model. In the inference phase, the processing circuitry is configured to receive instructions to generate a predicted protein sequence having a target function. The instructions include first condition information and second condition information associated with a target function of the predicted protein sequence. The first conditional information embedding generated by the first encoder and the second conditional information embedding generated by the second encoder are concatenated to produce a concatenated conditional information embedding, wherein the first conditional information embedding represents the first conditional information and the second conditional information embedding represents the second conditional information. Noise is sampled from the distribution function and combined with splice condition information embedding to produce a noisy splice input. The noisy stitched input is input to a denoising neural network such that the denoising neural network generates a prediction sequence embedding. The predicted sequence embedding is input to the decoding neural network to generate a predicted protein sequence based on the input predicted sequence embedding, and the predicted protein sequence is output. This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. Furthermore, the claimed subject matter is not limited to implementations that solve any or all disadvantages noted in any part of this disclosure. Drawings FIG. 1 illustrates a schematic diagram of a computing system for conditional generation of protein sequences using a denoising diffusion probability model, according to one embodiment of the present disclosure. FIG. 2 illustrates a schematic diagram of a training phase of the computing system of FIG. 1. FIG. 3 illustrates a schematic diagram of an inference phase of the computing system of FIG. 1. FIG. 4 shows a schematic diagram of a denoising cycle that directs the computing system of FIG. 1 to generate an improved predicted protein sequence. Fig. 5 shows a flowchart of a method for conditionally generating protein sequences according to an example embodiment of the disclosure. FIG. 6 illustrates a schematic diagram of an example computing environment in accordance with which embodiments of the present disclosure may be implemented. Detailed Description The field of computing protein engineering has been over several decades and includes computing protein design and computing protein optimization. Given a set of conditions that describe an idealized protein function, the goal of computing a protein design is to identify the protein sequence that fulfills that function. The goal of computing protein optimisation is to identify protein sequences having improved activity, given a set of conditions describing the function of an idealised protein and the protein that has performed that function, where "activity" is defined as a quantifiable measure of how well a protein performs the function of interest. In some of the earliest methods, known as "rational design," researchers have relied on attempts to model the relationship between protein sequences and their three-dimensional structures to design and optimize proteins for certain functions. While these rational design approaches have been successfully applied to the design of new proteins, rational design tools still have several limitations. Most notably, detailed knowledge about the underlying mechanisms that determine the function of a protein target is required, and such information is oft