US-12620460-B2 - Methods and systems for machine-learning based molecule generation and scoring

US12620460B2US 12620460 B2US12620460 B2US 12620460B2US-12620460-B2

Abstract

A method for machine learning aided modeling of two interacting structures may include: (a) receiving an input structure comprising an interaction region; (b) generating a plurality of candidate structures using a first differentiable machine learning model; (c) docking one or more candidate structures of the plurality of candidate structures at the interaction region of the input structure using a second differentiable machine learning model to predict a docking geometry; (d) ranking the one or more candidate structures of the plurality of candidate structures docked in (c) using a third differentiable machine learning model to predict a score; and (e) backpropagating the score to (i) the first differentiable machine learning model to update the plurality of candidate structures or (ii) the second differentiable machine learning model to update the docking geometry.

Inventors

Kevin RYCZKO
Amit Kadan
Takeshi Yamazaki

Assignees

GOOD CHEMISTRY INC.

Dates

Publication Date: 20260505
Application Date: 20240919

Claims (20)

1 . A method performed by one or more computers for training a latent vector of a generative machine learning model by a machine learning training technique, the method comprising: identifying one or more ligands for binding to a target molecule using the generative machine learning model by performing operations comprising: obtaining, at each of a plurality of optimization steps, a target structure defining a structure of at least a portion of the target molecule and an initial latent vector; processing, at each of the plurality of optimization steps, a model input comprising the target structure and the initial latent vector using the generative machine learning model to generate a predicted ligand structure of a ligand that is predicted to bind to the target molecule, comprising: iteratively modifying the initial latent vector in accordance with trained values of a set of generative machine learning model parameters to generate an output that defines the predicted ligand structure of the ligand; processing, at each of the plurality of optimizations steps, a model input comprising data characterizing the predicted ligand structure using a scoring model to generate a predicted measure of affinity between the target molecule and the ligand having the predicted ligand structure; generating, at each of the plurality of optimization steps, a new latent vector, comprising: training the initial latent vector by the machine learning training technique by backpropagating gradients of the predicted measure of binding affinity through the scoring model and the generative machine learning model and into the initial latent vector; and providing, at each of the plurality of optimization steps, the new latent vector for processing by the generative machine learning model as a new initial latent vector; and outputting data defining the one or more ligands.
2 . The method of claim 1 , wherein at each optimization step, the iteratively modified versions of the initial latent vector that are generated by the generative machine learning model define noisy ligand structures.
3 . The method of claim 1 , wherein the structure of the portion of the target molecule is generated using a structure prediction machine learning model.
4 . The method of claim 1 , wherein the structure of at least the portion of the target molecule comprises a structure of an interaction region of the target molecule.
5 . The method of claim 1 , wherein the target molecule is a protein molecule.
6 . The method of claim 1 , wherein iteratively modifying the initial latent vector in accordance with trained values of the set of generative machine learning model parameters comprises denoising the initial latent vector.
7 . The method of claim 6 , wherein the denoising comprises reverse diffusing the initial latent vector.
8 . The method of claim 6 , wherein the target structure is fixed during the denoising.
9 . The method of claim 6 , wherein the target structure is movable during the denoising.
10 . The method of claim 1 , further comprising: processing, at each of a plurality of optimization steps, a model input comprising data characterizing the predicted ligand structure using an additional scoring model to generate a predicted measure of synthetic accessibility of the ligand; and wherein generating the new latent vector further comprises training the initial latent vector by the machine learning training technique by backpropagating gradients of the predicted measure of synthetic accessibility through the scoring model and the generative machine learning model and into the initial latent vector at each of the plurality of optimization steps.
11 . The method of claim 1 , further comprising: processing, at each of a plurality of optimization steps, a model input comprising data characterizing the predicted ligand structure using an additional scoring model to generate a predicted measure of feasibility of the ligand; and wherein generating the new latent vector further comprises training the initial latent vector by the machine learning training technique by backpropagating gradients of the predicted measure of feasibility through the scoring model and the generative machine learning model and into the initial latent vector at each of the plurality of optimization steps.
12 . The method of claim 1 , wherein the target molecule is a protein and wherein one or more of the ligands are active pharmaceutical compounds.
13 . A computer-implemented system comprising: a digital processing device comprising: at least one processor, an operating system configured to perform executable instructions, a memory, and a computer program including instructions executable by the digital processing device to perform operations for training a latent vector of a generative machine learning model by a machine learning training technique, comprising: identifying one or more ligands for binding to a target molecule using the generative machine learning model by performing operations comprising: obtaining, at each of a plurality of optimization steps, a target structure defining a structure of at least a portion of the target molecule and an initial latent vector; processing, at each of the plurality of optimization steps, a model input comprising the target structure and the initial latent vector using the generative machine learning model to generate a predicted ligand structure of a ligand that is predicted to bind to the target molecule, comprising: iteratively modifying the initial latent vector in accordance with trained values of a set of generative machine learning model parameters to generate an output that defines the predicted ligand structure of the ligand; processing, at each of the plurality of optimizations steps, a model input comprising data characterizing the predicted ligand structure using a scoring model to generate a predicted measure of affinity between the target molecule and the ligand having the predicted ligand structure; generating, at each of the plurality of optimization steps, a new latent vector, comprising: training the initial latent vector by the machine learning training technique by backpropagating gradients of the predicted measure of binding affinity through the scoring model and the generative machine learning model and into the initial latent vector; and providing, at each of the plurality of optimization steps, the new latent vector for processing by the generative machine learning model as a new initial latent vector; and outputting data defining the one or more ligands.
14 . The computer-implemented system of claim 13 , wherein at each optimization step, the iteratively modified versions of the initial latent vector that are generated by the generative machine learning model define noisy ligand structures.
15 . The computer-implemented system of claim 13 , wherein the structure of the portion of the target molecule is generated using a structure prediction machine learning model.
16 . The computer-implemented system of claim 13 , wherein the structure of at least the portion of the target molecule comprises a structure of an interaction region of the target molecule.
17 . The computer-implemented system of claim 13 , wherein the target molecule is a protein molecule.
18 . The computer-implemented system of claim 13 , wherein iteratively modifying the initial latent vector in accordance with trained values of the set of generative machine learning model parameters comprises denoising the initial latent vector.
19 . The computer-implemented system of claim 18 , wherein the denoising comprises reverse diffusing the initial latent vector.
20 . The computer-implemented system of claim 18 , wherein the target structure is fixed during the denoising.

Description

CROSS-REFERENCE This application is a continuation of International Application No. PCT/IB2024/056174, filed Jun. 25, 2024, which claims the benefit of U.S. Provisional Application No. 63/510,422, filed Jun. 27, 2023, and U.S. Provisional Application No. 63/648,851, filed May 17, 2024, each of which is incorporated herein by reference in its entirety. BACKGROUND Computational chemistry has become an established tool for the molecular and material discovery process in many areas of industry. Computational chemistry can provide accurate prediction of chemical phenomena and examination of molecular properties that may be inaccessible solely from the experiment and/or requires significant labor. In an example application, computer-aided drug discovery (CADD) has the potential to be a faster and less expensive approach compared to an laboratory-based drug discovery process. Structure-based drug design (SBDD) paradigms can involve designing ligands with high binding affinities for a given a 3-dimensional protein pocket. SBDD can involve finding a solution to an inverse design problem, where the desired properties (e.g., high binding affinity to a target protein, synthesizability etc.) are known, but the design of a molecule with the desired properties is non-trivial. SBDD can comprise two steps. One step can be the sampling of a chemical space, and the other step can be scoring (or evaluating) sampled compounds' ability to satisfy the set of desired properties. The sampling of chemical space can be performed in various ways. For drug discovery, for example, this can be performed by evaluating each entry of a large database of molecules (such as ZINC, Enamine, or GDB) and collecting the results and ranking them to yield a shortlist of compounds to be screened in a laboratory. Although these databases can contain hundreds of billions of molecules, billions is still a very small fraction of the drug-like chemical space which is estimated to number anywhere between 1020 to 1060 molecules. SUMMARY Computer-aided design can accelerate drug discovery. Recent advances in scalable computing and generative chemistry have led to deep learning models that access uncharted chemical space for creating novel drug compounds. However, existing models may be limited in designing molecules that satisfy multiple desired physicochemical properties. In an aspect, the present disclosure provides, a method that combines generative modeling (e.g., a diffusion model) with multi-objective optimization. In some embodiments, the latent variables of a generative model are guided to generate ligands while optimizing for a plurality of target properties. In some embodiments, the plurality of target properties can comprise affinity (e.g., binding affinity to a protein molecule of interest) and synthetic accessibility. In a CADD method, the larger the chemical space that is explored, the higher the chances are to discover better materials. However, considering that synthesizable chemical space is estimated to be 10180, the scale of the problem is massive in terms of both time and computational cost. To expand the search chemical space for CADD, machine learning (ML) approaches may be used to perform this exploration while managing computational cost. In an aspect, the present disclosure provides, a method for machine learning aided modeling of two interacting structures, the method comprising: (a) receiving an input structure comprising an interaction region; (b) generating a plurality of candidate structures using a first differentiable machine learning model; (c) docking one or more candidate structures of the plurality of candidate structures at the interaction region of the input structure using a second differentiable machine learning model to predict a docking geometry; (d) ranking the one or more candidate structures of the plurality of candidate structures docked in (c) using a third differentiable machine learning model or differentiable scoring function to predict a score; and (e) propagating the score to (i) the first differentiable machine learning model to update the plurality of candidate structures or (ii) the second differentiable machine learning model to update the docking geometry. In some embodiments, the method further comprises outputting a list of the plurality of candidates updated in (e). In some embodiments, the input structure is a host molecule and wherein the plurality of candidate structures comprises a guest molecule. In some embodiments, the input structure is a macromolecule or a biomolecule, wherein the plurality of candidate structures comprises a ligand, and wherein the interaction region is an active site. In some embodiments, the macromolecule or the biomolecule is a protein. In some embodiments, the protein is an enzyme. In some embodiments, the macromolecule or the biomolecule is a protein and wherein the ligand is selected from the group consisting of another protein, a neurotransmitter, a toxin, a neurope