WO-2026093340-A1 - SYNTHETIC DATA GENERATION USING MATCHED MOLECULAR PAIRS

WO2026093340A1WO 2026093340 A1WO2026093340 A1WO 2026093340A1WO-2026093340-A1

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for generating synthetic training examples for training a machine learning model. In one aspect, a method comprises: processing the data identifying a collection of molecules to identify a plurality of matched molecular pairs; processing the plurality of matched molecular pairs to generate data defining a set of transformation functions, wherein each transformation function is defined by at least: (i) a first molecular fragment; (ii) a second molecular fragment; (iii) an inclusion criterion defining a class of molecules; and (iv) a predicted change in the molecular property value resulting from replacing the first molecular fragment by the second molecular fragment in a molecule that satisfies the inclusion criterion; generating a plurality of synthetic training examples for training a machine learning model using the set of transformation functions.

Inventors

RICHARDS, SIMON JAMES
ROSE, HARRY

Assignees

ISOMORPHIC LABS LIMITED

Dates

Publication Date: 20260507
Application Date: 20251028
Priority Date: 20241031

Claims (20)

1. A method performed by one or more computers, the method comprising: obtaining data that identifies a collection of molecules and, for each molecule in the collection of molecules, a molecular property value for the molecule; processing the data identifying the collection of molecules to identify a plurality of matched molecular pairs; processing the plurality of matched molecular pairs to generate data defining a set of transformation functions, wherein each transformation function is defined by at least: (i) a first molecular fragment; (ii) a second molecular fragment; (iii) an inclusion criterion defining a class of molecules; and (iv) a predicted change in the molecular property value resulting from replacing the first molecular fragment by the second molecular fragment in a molecule that satisfies the inclusion criterion; generating a plurality of synthetic training examples for training a machine learning model using the set of transformation functions, wherein the machine learning model is configured to process data characterizing an input molecule to generate a predicted molecular property value for the input molecule; and training the machine learning model using the plurality of synthetic training examples by a machine learning training technique.
2. The method of claim 1, wherein each matched molecular pair includes a pair of molecules comprising: (i) a first molecule, and (ii) a second molecule that differs from the first molecule in that one molecular fragment in the first molecule is replaced by another molecular fragment in the second molecule.
3. The method of any preceding claim, wherein for each transformation function, a candidate molecule satisfies the inclusion criterion for the transformation function if: (i) the candidate molecule includes the first molecular fragment associated with the transformation function; and (ii) a chemical structure of a portion of the candidate molecule that includes the first molecular fragment and all atoms in the candidate molecule that are separated from the first molecular fragment by at most a threshold number of bonds in the candidate molecule 36 matches a target chemical structure associated with the transformation function.
4. The method of claim 3, wherein the threshold number of bonds is at least two bonds.
5. The method of any one of claims 1-2, wherein for each transformation function, a candidate molecule satisfies the inclusion criterion for the transformation function if: (i) the candidate molecule includes the first molecular fragment associated with the transformation function; and (ii) a molecular fingerprint of the candidate molecule excluding the first molecular fragment matches a target molecular fingerprint associated with the transformation function.
6. The method of any preceding claim, wherein processing the plurality of matched molecular pairs to generate data defining the set of transformation functions comprises, for each transformation function: identifying a subset of the plurality of matched molecular pairs for which: (i) the first molecule in each matched molecular pair includes a same first fragment; (ii) the second molecule in each matched molecular pair differs from the corresponding first molecule in that the same first fragment is replaced by a same second fragment; (iii) the first molecule in each matched molecular pair satisfies a same inclusion criterion; and (iv) the subset of the plurality of matched molecular pairs includes at least two matched molecular pairs; and generating the transformation function based on the subset of the plurality of matched molecular pairs.
7. The method of claim 6, wherein processing the plurality of matched molecular pairs to generate data defining the set of transformation functions comprises, for each transformation function: determining, for each matched molecular pair in the subset of the plurality of matched molecular pairs that are used to generate the transformation function, a delta between the molecular property values of: (i) the first molecule, and (ii) the second molecule, in the matched molecular pair; and 37 determining the predicted change in the molecular property value for the transformation function based on a measure of central tendency of the deltas.
8. The method of claim 7, wherein the measure of central tendency is a mean.
9. The method of any one of claims 7-8, wherein processing the plurality of matched molecular pairs to generate data defining the set of transformation functions further comprises, for each transformation function: determining that a measure of dispersion of the deltas does not exceed a maximum threshold.
10. The method of claim 9, wherein the measure of dispersion is a standard deviation.
11. The method of any preceding claim, wherein generating the plurality of synthetic training examples for training the machine learning model using the set of transformation functions comprises, for each of a plurality of original molecules in the collection of molecules: determining, for each transformation function, whether the original molecule satisfies the inclusion criterion of the transformation function; and in response to determining that the original molecule satisfies the inclusion criterion of a transformation function: generating a new molecule by replacing the first molecule fragment with the second molecular fragment in the original molecule; generating a molecular property value for the new molecule as a sum of: (i) the molecular property value for the original molecule, and (ii) the predicted change in the molecular property value that is specified by the transformation function; and generating a synthetic training example that includes: (i) a training input to the machine learning model that characterizes the new molecule, and (ii) a target output of the machine learning model that specifies the molecular property value of the new molecule.
12. The method of any preceding claim, wherein each synthetic training example includes: (i) a training input to the machine learning model that characterizes a molecule, and (ii) a target output of the machine learning model that specifies a target molecular property value of the molecule.
13. The method of claim 12, wherein training the machine learning model using the plurality of synthetic training examples by the machine learning training technique comprises, for each synthetic training example: training the machine learning model to reduce a discrepancy between: (i) a predicted molecular property value generated by processing the training input of the synthetic training example using the machine learning model, and (ii) the target molecular property value specified by the synthetic training example.
14. The method of any preceding claim, wherein the machine learning model comprises a neural network.
15. A computer-implemented method of predicting a molecular property value, comprising: obtaining data characterizing an input molecule; and processing the data characterizing the input molecule using a machine learning model that has been trained by the method of any of claims 1-14, to generate a predicted molecular property value for the input molecule.
16. A method performed by one or more computers, the method comprising: obtaining chemical reaction data that identifies a collection of chemical reactions; processing the chemical reaction data to generate data defining a set of transformation functions, wherein each transformation function is defined by at least: (i) a first molecular fragment, and (ii) a second molecular fragment, and wherein generating each transformation function comprises: determining that, for at least a threshold number of pairs of chemical reactions in the collection of chemical reactions: (a) at least one reactant in a first chemical reaction of the pair of chemical reactions includes the first molecular fragment; and (b) a second set of reactants of a second chemical reaction of the pair of chemical reactions differs from a first set of reactants of the first chemical reaction of the pair of chemical reactions in that each instance of the first molecular fragment in the first set of reactants is replaced by the second molecular fragment in the second set of reactants; generating a plurality of synthetic training examples for training a machine learning model using the set of transformation functions, wherein the machine learning model is configured to process data characterizing an input set of molecules to classify whether the input set of molecules is reactive; and training the machine learning model using the plurality of synthetic training examples by a machine learning training technique.
17. The method of claim 16, wherein each chemical reaction in the collection of chemical reactions further comprises a label defining a reaction type of the reaction, and wherein processing the chemical reaction data to generate data defining the set of transformation functions further comprises, for each reaction type: selecting a subset of the chemical reaction data corresponding with chemical reactions that have the label defining the reaction type; determining one or more transformation functions associated with the reaction type using the subset of the chemical reaction data.
18. The method of any one of claims 16-17, wherein the threshold number of pairs of chemical reactions is at least three.
19. The method of any one of claims 16-18, wherein each synthetic training example comprises: (i) a training input to the machine learning model that characterizes a set of molecules, and (ii) a target output of the machine learning model that identifies the set of molecules as being reactive.
20. The method of any one of claims 16-19, wherein generating the plurality of synthetic training examples for training the machine learning model using the set of transformation functions comprises, for each of a plurality of original chemical reactions in the collection of chemical reactions: determining, for each transformation function, whether the original chemical reaction has at least one reactant that includes the first molecular fragment of the transformation function; and in response to determining that the original chemical reaction has at least one reactant that includes the first molecular fragment of the transformation function: generating a new chemical reaction by replacing each instance of the first molecular fragment in a set of reactants and in a set of products of the original chemical reaction with the second molecular fragment of the transformation function; and generating a synthetic training example that includes: (i) a training input to the machine learning model that characterizes at least the set of reactants of the new chemical reaction, and (ii) a target output of the machine learning model that identifies the set of reactants as being reactive.

Description

Isomorphic Labs Limited F&R Ref.: 53672-0030W01 PCT Application SYNTHETIC DATA GENERATION USING MATCHED MOLECULAR PAIRS BACKGROUND [0001] This specification relates to processing data using machine learning models. [0002] Machine learning models receive an input and generate an output, e.g., a predicted output, based on the received input. Some machine learning models are parametric models and generate the output based on the received input and on values of the parameters of the model. [0003] Some machine learning models are deep models that employ multiple layers of models to generate an output for a received input. For example, a deep neural network is a deep machine learning model that includes an output layer and one or more hidden layers that each apply a non-linear transformation to a received input to generate an output. SUMMARY [0004] This specification describes a system implemented as computer programs on one or more computers in one or more locations that can generate a dataset of synthetic training examples including a chemical structure and one or more molecular properties of the chemical structure, using matched molecular pairs. In this specification, a synthetic training example refers to a training example that is generated computationally rather than experimentally, e.g., an example that is predicted using computational techniques. [0005] In this specification, a matched molecular pair refers to a pair of molecules that differ by a transformation, e.g., a structural modification. For example, the structural modification can involve a change to an atom, functional group, or substituent, e.g., an atom or group of atoms attached to the core structure of a molecule. As another example, the structural modification can involve a change to one or more atoms of the core structure of a molecule. Matched molecular pairs can be useful for determining the impact of a structural change to one or more molecular properties. As an example, the one or more molecular properties can be ADMET properties, e.g., absorption, distribution, metabolism, excretion, and toxicity characteristics of a chemical structure within the context of drug development. [0006] In particular, the system can define a transformation function by identifying one or more matched molecular pairs that differ by a particular transformation, e.g., structural modification, and can determine the impact of the transformation function by aggregating the experimentally observed changes in the molecular property values between the respective molecules in each of the identified one or more matched molecular pairs as the predicted change in the molecular property values that results from the transformation. [0007] The system can then identify candidate molecules that are candidates for the application of the particular transformation function based on the inclusion of a target chemical structure associated with the transformation function, and can apply the transformation specified by the transformation function to the candidate molecules. In particular, the system can modify each candidate molecule according to the transformation function to generate a synthetic training input molecule and can apply the determined impact in the molecular property for the transformation function to yield a predicted molecular property value for the synthetic training example. [0008] More specifically, the system can identify a number of transformation functions using experimentally observed matched molecular pairs and can generate synthetic training examples using the transformation functions. The system can assemble a synthetic training dataset for training a machine learning model using the synthetic training examples, e.g., by a machine learning technique. As an example, the system can use the synthetic training examples to train a machine learning model to generate predicted molecular property values for an input molecule. [0009] As another example, the system can identify transformations between reactants in a collection of reactions and generate synthetic training reactions using the transformations. In particular, the system can identify pairs of reactions in which one or more reactants are modified in the same way, e.g., in which at least one reactant in the first reaction and one reactant in the second reaction are matched molecular pairs, and can define a transformation function based on the transformation to the reactants. In this case, the system can train a machine learning model to classify whether an input set of molecules is reactive, e.g., whether an input set of molecules can react to form a reaction product, e.g. comprising one or more product molecules). [0010] According to a first aspect there is provided a method for obtaining data that identifies a collection of molecules and, for each molecule in the collection of molecules, a molecular property value for the molecule, processing the data identifying the collection of molecules to identify