US-20260127442-A1 - PLATFORMS, SYSTEMS, AND METHODS FOR PERFORMANCE PREDICTION USING MACHINE LEARNING MODELS

US20260127442A1US 20260127442 A1US20260127442 A1US 20260127442A1US-20260127442-A1

Abstract

A method performed by one or more computers may comprise: receiving information about an entity; generating a set of representations based on the information about the entity; generating a set of edits of the entity based on the set of representations; and generating a performance prediction for each edit of the set of edits of the entity based on a pre-trained generalization machine learning model applied to each edit of the set of edits.

Inventors

John Ata Bachman
Nicholas Ruggero
Federico Vaggi
Chiam Yu Ng
Relly Brandman
Laura Barker
Lin Wang
Carl Hans Albach

Assignees

X DEVELOPMENT LLC

Dates

Publication Date: 20260507
Application Date: 20251229

Claims (20)

1 . A method performed by one or more computers, the method comprising: receiving information about an entity; generating a set of representations based on the information about the entity; generating a set of edits of the entity based on the set of representations; and generating a performance prediction for each edit of the set of edits of the entity based on a pre-trained generalization machine learning model applied to each edit of the set of edits.
2 . The method of claim 1 , wherein the information about the entity further comprises information describing a plurality of edits to the entity.
3 . The method of claim 1 , wherein the set of representations further comprises a set of embeddings based on the information about the entity, and the generating comprises processing the information about the entity using one or more embedding models, and each of the one or more embedding models: receives the information about the entity as input; and applies computational transformations to the input using a corresponding embedding model to generate a multi-dimensional vector representation for each of the set of edits, wherein each multi-dimensional vector representation generated by the one or more embedding models is added to the set of embeddings.
4 . The method of claim 3 , wherein the performance prediction of the entity is based on inputting the set of embeddings to the pre-trained generalization machine learning model, wherein the pre-trained generalization machine learning model includes at least one a neural network trained to predict performance of the entity based on training data including, information about a plurality of edits; and target data indicating a performance of each of the set of edits of the entity, wherein the pre-trained generalization machine learning model applies computational transformations to the set of embeddings to generate the performance prediction.
5 . The method of claim 3 , wherein the one or more embedding models include two or more of, a GenePT model, a Proteinfer model, a pFBA-PCA model, or a GO-PCA model, the method further comprising aggregating the multi-dimensional vector representations generated by the two or more embedding models to create the set of embeddings.
6 . The method of claim 3 , wherein each token of the set of embeddings corresponds to an edit of the set of edits.
7 . The method of claim 3 , wherein generating the set of embeddings occurs at prediction time.
8 . The method of claim 3 , wherein generating the set of embeddings occurs prior to training, the method further comprising caching the generated embeddings for later use at prediction time.
9 . The method of claim 1 , wherein the pre-trained generalization machine learning model comprises a first stage that generates a strain embedding characterizing the entity and a second stage that generates the performance prediction based on the strain embedding.
10 . The method of claim 9 , wherein the first stage is one or more of a long-short term memory (LSTM) model, a transformer model, or a convolutional neural network (CNN) model.
11 . The method of claim 9 , wherein the second stage is a multi-layer perceptron.
12 . The method of claim 1 , further comprising receiving process condition information, wherein the pre-trained generalization machine learning model is trained to predict performance with respect to a set of process conditions as indicated by process inputs, and wherein generating the performance prediction for each edit of the set of edits of the entity is further based on process inputs corresponding to the process condition information.
13 . The method of claim 1 , wherein the pre-trained generalization machine learning model is trained using a two-step process including, pre-training the generalization machine learning model using training data; and fine-tuning the generalization machine learning model using additional strain-specific data, wherein the training data is a larger data set compared to the strain-specific data.
14 . The method of claim 1 , further comprising updating the pre-trained generalization machine learning model using an active learning process including: generating a set of candidate modifications; generating a corresponding performance prediction for each of the set of candidate modifications using the pre-trained generalization machine learning model; receiving experimental data associated with at least a portion of the set of candidate modifications; updating training data using the experimental data; and re-training the pre-trained generalization machine learning model using the updated training data.
15 . The method of claim 14 , further comprising determining which portion of the set of modifications to test via experiment based at least in part on an uncertainty quantification generated by the pre-trained generalization machine learning model.
16 . The method of claim 1 , wherein the pre-trained generalization machine learning model is an ensemble of multiple pre-trained generalization machine learning models.
17 . The method of claim 1 , wherein the information about the entity comprises information about a base strain of the entity.
18 . A system comprising: one or more computers; and one or more storage devices communicatively coupled to the one or more computers, wherein the one or more storage devices store instructions that, when executed by the one or more computers, cause the one or more computers to perform operations comprising: receiving information about an entity; generating a set of representations based on the information about the entity; generating a set of edits of the entity based on the set of representations; and generating a performance prediction for each edit of the set of edits of the entity based on a pre-trained generalization machine learning model applied to each edit of the set of edits.
19 . One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations comprising: receiving information about an entity; generating a set of representations based on the information about the entity; generating a set of edits of the entity based on the set of representations; and generating a performance prediction for each edit of the set of edits of the entity based on a pre-trained generalization machine learning model applied to each edit of the set of edits.
20 . The non-transitory computer storage media of claim 19 , wherein the information about the entity further comprises information describing a plurality of edits to the entity.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of PCT Application No. PCT/US2025/031891, filed on Jun. 2, 2025, which claims priority to U.S. Provisional Patent Application No. 63/655,575, filed on Jun. 3, 2024, and U.S. Provisional Patent Application No. 63/803,471, filed on May 9, 2025. and the disclosure of these applications are incorporated herein by reference in their entirety. Each of the aforementioned earlier-filed applications is hereby incorporated by reference in its entirety. BACKGROUND Most synthetic biology work today is lab-driven, and hence capital intensive, painstaking, expensive, and uncertain. However, the rapid development of AI models in general, as well as in pharma and specific segments within the life sciences, is poised to spur rapid innovation in AI-driven synthetic biology. Competition will emerge as AI, LLMs, and supporting technologies accelerate. These advancements could reduce barriers to entry, contributing to the emergence of a rapidly evolving research and development landscape and marketplace. SUMMARY Embodiments include an AI-guided synthetic biology development platform, systems, and methods substantially as shown and described. Embodiments include a method for providing AI-guided synthetic biology development platform, systems, and methods substantially as shown and described. In embodiments, a computer-implemented method for data integration in an AI-guided analytic platform for development of biologic synthesis processes may comprise: receiving, by a platform, biologic data from a plurality of databases, wherein the biologic data use different data formats and/or semantics; converting the received biologic data into at least one standardized data format to create an integrated dataset; processing the integrated dataset through at least one data normalization process to minimize batch-specific systemic variation; storing the normalized biologic data in a structured format that describes biologic components and their relationships to other components; applying at least one machine learning method to the normalized biologic data to generate at least one predictive model for synthetic biology design; and outputting at least one specification for biologic system design based on the at least one predictive model. In embodiments, the data normalization processes used by the platform may include applying a Bayesian statistical model that incorporates prior knowledge about strain behavior, modeling different sources of variation including biological effects and technical factors, estimating strain performance while accounting for batch effects and other sources of systematic variability, batch effect correction, wherein a batch effect correction addresses systematic variations across at least one of a plurality of experimental runs, equipment, or operators, multi-modal data integration, or some other type of data normalization process. In embodiments, multi-modal data integration may include data relating to at least one of an enzyme level, a metabolite concentration, or a gene expression level. In embodiments, data normalization processes used by the platform may include standardized nomenclature across different data sources, quality control normalization, including flagging an anomalous data point, and/or flagging a well or sample that failed during an experiment. In embodiments, data normalization processes used by the platform may include experiment normalization, such as experiment normalization to account for a variation across a plurality of experimental runs using a similar strain or condition. Experiment normalization used by the platform may implement a statistical method to minimize impact of a technical variation, and/or may use a control sample and spike-in standard for validation. In embodiments, data normalization processes used by the platform may include cross-platform data harmonization, including but not limited to data harmonization that standardizes data from a plurality of experimental platforms and setups. In embodiments, data normalization processes used by the platform may include time series data normalization, wherein the time series data normalization includes normalizing data relating to time-varying growth conditions, wherein the time series data normalization includes normalizing data relating to variations in a feed profile or fermentation parameter. In embodiments, data normalization processes used by the platform may include knowledge graph-based normalization, including but not limited to knowledge graph-based normalization that represents biological entities and relationships in standardized format, knowledge graph-based normalization that associates information across a plurality of experiments or organisms, and/or knowledge graph-based normalization integrates a plurality of biological data types. In embodiments, a predictive model used by the platform may include, but is not limited to, a long-short term mem