US-20260128117-A1 - PLATFORMS, SYSTEMS, AND METHODS FOR PROTOTYPE AND SCALE

US20260128117A1US 20260128117 A1US20260128117 A1US 20260128117A1US-20260128117-A1

Abstract

A system may include a data collection facility that collects strain data including biological information for multiple biological strain candidates and receives assay data from experiments. A prototype prediction system generates initial fitness predictions using first AI models trained on historical performance data and identifies an initial strain candidate subset. A scale-up prediction system receives assay data for the initial candidates, analyzes this data using second AI models to generate scale-up performance predictions for bioreactor production conditions, and selects strain candidates for production based on these predictions.

Inventors

John Ata Bachman
Laura Barker
Relly Brandman
Federico Vaggi
Nicholas Ruggero
Carl Hans Albach
Chiam Yu Ng
Jeffrey David ORTH
Lin Wang

Assignees

X DEVELOPMENT LLC

Dates

Publication Date: 20260507
Application Date: 20251210

Claims (20)

1 . A platform for synthetic biology development, the platform comprising: a data collection facility configured to collect strain data for strain candidates and to receive experimental assay data from biological strain experiments, wherein the strain data comprises biological information for each of the strain candidates, a synthetic biology development system configured to: generate initial performance predictions for the strain candidates using a set of artificial intelligence models, wherein at least one of the set of artificial intelligence models is trained on strain performance data; identify an initial subset of the strain candidates based on the initial performance predictions; receive, from the data collection facility, assay data for the initial subset of the strain candidates; analyze the assay data and the strain data using the set of artificial intelligence models, wherein the set of artificial intelligence models generates scale-up performance predictions for predicting strain performance under bioreactor production conditions; and select at least one strain candidate for production based on the scale-up performance predictions.
2 . The platform of claim 1 , wherein the biological information comprises one or more of genetic edits, metabolic pathway data, or strain library information.
3 . The platform of claim 1 , wherein the assay data comprises one or more of yield data, titer data, productivity data, stability data, or growth rate data.
4 . The platform of claim 1 , wherein the set of artificial intelligence models comprises one or more of a convolutional neural network, a long-short term memory (LSTM) network, or a transformer neural network.
5 . The platform of claim 1 , wherein the set of artificial intelligence models comprises at least one artificial intelligence model that is trained using a training data set that includes correlations between plate assay data and data collected during bioreactor production.
6 . The platform of claim 1 , wherein the bioreactor production conditions comprise one or more of temperature profiles, pH setpoints, nutrient concentrations, dissolved oxygen levels, mixing speeds, gas flow rates, or nutrient feeding rates.
7 . The platform of claim 1 , wherein the synthetic biology development system is further configured to: continuously collect performance data during production of the selected at least one strain candidate; and update the scale-up performance predictions based on the continuously collected performance data.
8 . The platform of claim 1 , wherein the data collection facility is configured to receive the assay data for the initial subset of the strain candidates after the generation of the initial performance predictions, wherein the synthetic biology development system is further configured to re-train the set of artificial intelligence models using the assay data.
9 . The platform of claim 1 , wherein the synthetic biology development system is configured to generate embeddings that identify strain-specific sensitivities to process conditions that may affect performance at production scale.
10 . The platform of claim 1 , wherein the set of artificial intelligence models comprise at least one ensemble model configured to generate uncertainty estimates for the scale-up performance predictions.
11 . The platform of claim 1 , wherein the scale-up prediction system is configured to generate a digital twin simulation of at least one production facility, wherein the one or more second artificial intelligence models are configured to generate the scale-up performance predictions based on data from the digital twin simulation.
12 . A method for synthetic biology development, the method comprising: collecting strain data for strain candidates, wherein the strain data comprises biological information for each of the strain candidates; generating initial performance predictions for the strain candidates using a set of artificial intelligence models, wherein at least one of the set of artificial models is trained on strain performance data; identifying an initial subset of the strain candidates based on the initial performance predictions; receiving assay data from plate assays of the initial subset of the strain candidates; processing the assay data and the strain data using the set of artificial intelligence models, wherein the processing comprises generating scale-up performance predictions for predicting strain performance under bioreactor production conditions; and selecting at least one strain candidate for production based on the scale-up performance predictions.
13 . The method of claim 12 , wherein the biological information comprises one or more of genetic edits, metabolic pathway data, or strain library information, and wherein the assay data comprises one or more of yield data, titer data, productivity data, stability data, or growth rate data.
14 . The method of claim 12 , wherein the set of artificial intelligence models comprises at least one artificial intelligence model that is trained using a training data set that includes correlations between plate assay data and data collected during bioreactor production.
15 . The method of claim 12 , further comprising: continuously collecting performance data during production of the selected at least one strain candidate; and updating the scale-up performance predictions based on the continuously collected performance data.
16 . The method of claim 12 , further comprising generating a digital twin simulation of at least one production facility, wherein the set of artificial intelligence models generates the scale-up performance predictions based on data from the digital twin simulation.
17 . The method of claim 16 , wherein the digital twin simulation comprises a simulation of one or more of equipment configurations, operational parameters, environmental conditions, process control settings, material flows, or quality measurements.
18 . The method of claim 12 , further comprising re-training the set of artificial intelligence models using the assay data received from the plate assays.
19 . The method of claim 12 , wherein processing the assay data comprises generating embeddings that identify strain-specific sensitivities to process conditions that may affect performance at production scale.
20 . One or more non-transitory computer storage media storing instructions that when executed by one or more computers cause the one or more computers to perform operations for synthetic biology development, the operations comprising: collecting strain data for strain candidates, wherein the strain data comprises biological information for each of the strain candidates; generating initial performance predictions for the strain candidates using a set of artificial intelligence models, wherein at least one of the set of artificial models is trained on strain performance data; identifying an initial subset of the strain candidates based on the initial performance predictions; receiving assay data from plate assays of the initial subset of the strain candidates; processing the assay data and the strain data using the set of artificial intelligence models, wherein the processing comprises generating scale-up performance predictions for predicting strain performance under bioreactor production conditions; and selecting at least one strain candidate for production based on the scale-up performance predictions.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of PCT Application No. PCT/US2025/031891, filed on Jun. 2, 2025, which claims priority to U.S. Provisional Patent Application No. 63/655,575, filed on Jun. 3, 2024, and U.S. Provisional Patent Application No. 63/803,471, filed on May 9, 2025. and the disclosure of these applications are incorporated herein by reference in their entirety. Each of the aforementioned earlier-filed applications is hereby incorporated by reference in its entirety. BACKGROUND Most synthetic biology work today is lab-driven, and hence capital intensive, painstaking, expensive, and uncertain. However, the rapid development of AI models in general, as well as in pharma and specific segments within the life sciences, is poised to spur rapid innovation in AI-driven synthetic biology. Competition will emerge as AI, LLMs, and supporting technologies accelerate. These advancements could reduce barriers to entry, contributing to the emergence of a rapidly evolving research and development landscape and marketplace. SUMMARY Embodiments include an AI-guided synthetic biology development platform, systems, and methods substantially as shown and described. Embodiments include a method for providing AI-guided synthetic biology development platform, systems, and methods substantially as shown and described. In embodiments, a computer-implemented method for data integration in an AI-guided analytic platform for development of biologic synthesis processes may comprise: receiving, by a platform, biologic data from a plurality of databases, wherein the biologic data use different data formats and/or semantics; converting the received biologic data into at least one standardized data format to create an integrated dataset; processing the integrated dataset through at least one data normalization process to minimize batch-specific systemic variation; storing the normalized biologic data in a structured format that describes biologic components and their relationships to other components; applying at least one machine learning method to the normalized biologic data to generate at least one predictive model for synthetic biology design; and outputting at least one specification for biologic system design based on the at least one predictive model. In embodiments, the data normalization processes used by the platform may include applying a Bayesian statistical model that incorporates prior knowledge about strain behavior, modeling different sources of variation including biological effects and technical factors, estimating strain performance while accounting for batch effects and other sources of systematic variability, batch effect correction, wherein a batch effect correction addresses systematic variations across at least one of a plurality of experimental runs, equipment, or operators, multi-modal data integration, or some other type of data normalization process. In embodiments, multi-modal data integration may include data relating to at least one of an enzyme level, a metabolite concentration, or a gene expression level. In embodiments, data normalization processes used by the platform may include standardized nomenclature across different data sources, quality control normalization, including flagging an anomalous data point, and/or flagging a well or sample that failed during an experiment. In embodiments, data normalization processes used by the platform may include experiment normalization, such as experiment normalization to account for a variation across a plurality of experimental runs using a similar strain or condition. Experiment normalization used by the platform may implement a statistical method to minimize impact of a technical variation, and/or may use a control sample and spike-in standard for validation. In embodiments, data normalization processes used by the platform may include cross-platform data harmonization, including but not limited to data harmonization that standardizes data from a plurality of experimental platforms and setups. In embodiments, data normalization processes used by the platform may include time series data normalization, wherein the time series data normalization includes normalizing data relating to time-varying growth conditions, wherein the time series data normalization includes normalizing data relating to variations in a feed profile or fermentation parameter. In embodiments, data normalization processes used by the platform may include knowledge graph-based normalization, including but not limited to knowledge graph-based normalization that represents biological entities and relationships in standardized format, knowledge graph-based normalization that associates information across a plurality of experiments or organisms, and/or knowledge graph-based normalization integrates a plurality of biological data types. In embodiments, a predictive model used by the platform may include, but is not limited to, a long-short term mem