KR-20260064652-A - Machine learning system and method for predicting blood-brain barrier permeability
Abstract
A machine learning system and method for predicting blood-brain barrier permeability are provided. The system acquires samples of data associated with molecules from various data sources, transforms the samples into structural representations, and generates multiple features from structural representations such as fingerprint representations. Testing of the features is performed to determine the dependence on blood-brain barrier permeability. The system analyzes the ratio of permeable to impermeable samples in the data samples, and if an imbalance between sample types is detected, augments the samples with synthetic data to generate a balanced dataset. To generate a selected set of features for the balanced dataset, the system reduces the features used to train the machine learning model by utilizing techniques such as logistic regression. The system trains a machine learning model using the balanced dataset and uses the machine learning model to predict blood-brain barrier permeability for candidate molecules.
Inventors
- 카타드, 우메쉬
- 맥더모트, 조셉
- 폰테노, 리차드
- 샤르마, 판나
Assignees
- 랜턴 파마 인코포레이티드
Dates
- Publication Date
- 20260507
- Application Date
- 20240314
- Priority Date
- 20230314
Claims (20)
- As a system, Memory for storing instructions; and It includes a processor configured to execute the above commands, and the commands are, To generate a plurality of features for at least one structural representation of at least one molecule—the plurality of features include at least one molecular fingerprint representation associated with the at least one molecule—; To perform a chi-square test on the plurality of features of the at least one structural expression to determine whether blood-brain barrier permeability depends on the at least one molecular fingerprint expression; To determine the ratio of permeable samples and impermeable samples associated with the at least one molecule and comprising the at least one molecular fingerprint expression; By utilizing a k-nearest neighbor algorithm to generate synthetic data for a minority class of permeable and opaque samples until sample counts for the training dataset are balanced between the permeable and opaque samples to generate a balanced training dataset, the training dataset including the permeable and opaque samples is augmented based on the ratio; To reduce the plurality of features utilized for the balanced training dataset by using logistic regression with least absolute shrinkage to generate a selected set of features for the balanced training dataset; To train an ensemble meta-learner by utilizing the balanced training dataset along with a selected set of the above features to predict blood-brain barrier permeability; By utilizing the above-mentioned ensemble meta-learner, to analyze candidate molecules for blood-brain barrier permeability; and A system comprising a processor configured to generate a prediction regarding whether the candidate molecule has blood-brain barrier permeability by utilizing the ensemble meta-learner.
- A system according to claim 1, wherein the processor is further configured to generate the at least one structural representation of the at least one molecule by converting the three-dimensional structure of the at least one molecule into a string of symbols identifiable by the system.
- A system according to claim 1, wherein the plurality of features further include descriptors, graph embeddings, or a combination thereof.
- A system according to claim 1, wherein the processor is further configured to determine that blood-brain barrier permeability is dependent on the expression of the at least one molecular fingerprint based on the fact that the at least one molecular fingerprint has a p-value of less than 0.05.
- A system according to claim 1, wherein the processor is further configured to classify the permeable samples among the plurality of samples as permeable based on the fact that the fingerprints associated with the permeable samples have a critical blood-brain permeability, and the processor is further configured to classify the impermeable samples among the plurality of samples as impermeable based on the fact that the fingerprints associated with the impermeable samples are smaller than the critical blood-brain permeability.
- A system according to claim 1, wherein the processor is further configured to determine that the impermeable samples are a minority class within the plurality of samples based on the fact that the permeable samples are more numerous than the impermeable samples.
- A system according to claim 1, wherein the processor is further configured to reduce the coefficients of the features among the plurality of features to zero using the logistic regression in order to remove the features from being included in a selected set of the features.
- A system according to claim 1, wherein the processor is further configured to rank the features within a selected set of features in order of importance based on the absolute value of each coefficient of the features within the selected set of features.
- A system according to claim 1, wherein the processor is further configured to generate the ensemble meta-learner from at least one base learner model trained based on the balanced training dataset and by utilizing the logistic regression, deep neural network, or a combination thereof.
- A system according to claim 1, wherein the processor is further configured to determine the predicted probability of permeability for holdout verification samples not included in the balanced training dataset.
- A system according to claim 10, wherein the processor is further configured to utilize the predicted probability of permeability for the holdout validation samples as input to a logistic regression meta-learner ensemble model.
- A system according to claim 1, wherein the processor is further configured to select the ensemble meta-learner as a combination of basic learner models having the highest region under the receiver operation characteristic curve.
- As a method, A step of generating a plurality of features for at least one structural representation of at least one molecule by utilizing instructions from memory executed by a processor—the plurality of features include at least one molecular fingerprint representation associated with the at least one molecule—; A step of performing a chi-square test on the plurality of features of the at least one structural expression to determine whether blood-brain barrier permeability depends on the at least one molecular fingerprint expression by utilizing the instructions from the memory executed by the processor; A step of determining the ratio of permeable samples and impermeable samples associated with the at least one molecule and comprising the at least one molecular fingerprint expression; A step of augmenting the training dataset containing the permeable samples and the opaque samples based on the ratio by utilizing a k-nearest neighbor algorithm to generate synthetic data for a minority class of the permeable and opaque samples until the sample counts for the training dataset are balanced between the permeable samples and the opaque samples to generate a balanced training dataset; A step of reducing the plurality of features utilized for the balanced training dataset using logistic regression with minimum absolute reduction to generate a selected set of features for the balanced training dataset; A step of training an ensemble meta-learner by utilizing the balanced training dataset together with a selected set of the above features to predict blood-brain barrier permeability; A step of analyzing candidate molecules for blood-brain barrier permeability by utilizing the above-mentioned ensemble meta-learner; and A method comprising the step of generating a prediction of whether the candidate molecule has blood-brain barrier permeability by utilizing the ensemble meta-learner and utilizing the instructions from the memory executed by the processor.
- A method according to claim 13, further comprising the step of identifying a specific portion of the candidate molecule having blood-brain barrier permeability by utilizing the ensemble meta-learner.
- A method according to claim 13, further comprising the step of generating the ensemble meta-learner from at least one base learner model trained based on the balanced training dataset and by utilizing the logistic regression, deep neural network, or a combination thereof.
- A method according to claim 13, further comprising the step of stopping the training of at least one base learner model used to generate the ensemble meta-learner at an epoch representing the highest region under the receiver behavior characteristic curve for the holdout samples.
- A method according to claim 13, further comprising the step of reducing the coefficients of the features among the plurality of features to zero using the logistic regression in order to remove the features from being included in a selected set of features.
- A method according to claim 13, further comprising the step of generating at least one structural representation of at least one molecule by converting the three-dimensional structure of at least one molecule into a string of symbols.
- A method according to claim 13, further comprising the step of determining the correlation of blood-brain permeability between the at least one molecule and the at least one candidate molecule.
- A non-transient computer-readable device comprising instructions, wherein, when the instructions are loaded and executed by a processor, the processor, To generate a plurality of features for at least one structural representation of at least one molecule—the plurality of features include at least one molecular fingerprint representation associated with the at least one molecule—; To perform a chi-square test on the plurality of features of the at least one structural expression to determine whether blood-brain barrier permeability depends on the at least one molecular fingerprint expression; To determine the ratio of permeable samples and impermeable samples associated with the at least one molecule and comprising the at least one molecular fingerprint expression; By utilizing a k-nearest neighbor algorithm to generate synthetic data for minority classes of permeable and impermeable samples until sample counts for the training dataset are balanced between the permeable and impermeable samples to generate a balanced training dataset, the training dataset including the permeable and impermeable samples is augmented based on the ratio; To reduce the plurality of features utilized for the balanced training dataset by using logistic regression with minimum absolute reduction to generate a selected set of features for the balanced training dataset; To train an ensemble meta-learner by utilizing the balanced training dataset along with a selected set of the above features to predict blood-brain barrier permeability; By utilizing the above-mentioned ensemble meta-learner, to analyze candidate molecules for blood-brain barrier permeability; and A non-transient computer-readable device configured to generate a prediction regarding whether the candidate molecule has blood-brain barrier permeability by utilizing the ensemble meta-learner.
Description
Machine learning system and method for predicting blood-brain barrier permeability Cross-reference regarding related applications This application claims priority and interest in U.S. provisional patent application No. 63/452,108 filed March 14, 2023, the whole of which is incorporated by reference. Technology field The present application relates to artificial intelligence technologies, machine learning technologies, blood-brain barrier permeability prediction technologies, molecular design technologies, and data analysis technologies, and more specifically, to machine learning systems and accompanying methods for predicting blood-brain barrier permeability. The blood-brain barrier is a semipermeable membrane that effectively separates circulating blood from the extracellular cerebrospinal fluid in the human central nervous system. Blood-brain barrier permeability is the ability of various substances to pass through the barrier between the human bloodstream and brain tissue. The various cells of the blood-brain barrier prevent the passage of many types of molecules, such as those harmful to the brain. However, the blood-brain barrier allows certain substances, such as water, oxygen, and fat-soluble molecules, to cross in order to enable the passage of essential nutrients. Currently, effectively improving blood-brain barrier permeability for drug delivery purposes is a goal desired by more people. To this end, various pharmaceutical companies have adopted the use of technical tools, such as software and artificial intelligence systems, to determine or predict the blood-brain barrier permeability of specific molecules in consideration of drugs. Although blood-brain barrier permeability predictions using certain existing machine learning and deep learning methods based on molecular structure have been shown to be somewhat accurate, current approaches suffer from several major flaws that reduce their applicability and utility. For example, current state-of-the-art methods generate black-box models that cannot provide insight into why a molecule is predicted to be permeable or impermeable, thereby making it nearly impossible to use molecular predictions as tools to improve blood-brain barrier permeability. Based on at least the foregoing, there remains room for significant improvements to existing technologies and processes, as well as for the development of new technologies and processes that provide blood-brain barrier permeability prediction capabilities. For example, current technologies can be improved and enhanced to provide improved artificial intelligence model performance on validation data, more efficient use of computing resources while generating models and predictions, greater interpretability, and various other benefits. Such enhancements and improvements to methodologies and technologies can provide a greater understanding of which parts of molecules are correlated with blood-brain barrier permeability and, ultimately, which molecules are optimal candidates for treating various health conditions. Systems and accompanying methods for predicting blood-brain barrier permeability are disclosed. In particular, the systems and methods involve utilizing unique processes to generate machine learning models capable of effectively predicting whether a specific molecule under consideration possesses blood-brain barrier permeability while simultaneously utilizing fewer computing resources and features. As a result, machine learning models generated by utilizing the systems and methods are more robust and interpretable. The capabilities provided by the systems and methods also facilitate an understanding of how specific chemical structures of the molecule under consideration affect blood-brain barrier permeability, and how molecular design can be improved or modified to enhance blood-brain barrier permeability. Furthermore, the systems and methods provide unique model interpretation analyses that advance the chemical engineering of blood-brain barrier permeability therapeutics. In certain embodiments, a system for predicting blood-brain barrier permeability is provided. In certain embodiments, the system may include a memory for storing instructions and a processor configured to execute instructions to configure the processor to perform various operations. In certain embodiments, the processor may be configured to generate a plurality of features for one or more structural representations of one or more molecules. In certain embodiments, the plurality of features may include one or more molecular fingerprint representations associated with one or more molecules, descriptors, graph embeddings, any other features, or a combination thereof. In certain embodiments, the processor may be configured to perform a chi-square test on the plurality of features of one or more structural representations to determine whether blood-brain barrier permeability depends on one or more molecular fingerprint representations. In c