CN-122024913-A - Molecular intelligent design method based on mathematical programming method and skeleton-group coupling
Abstract
A molecular intelligent design method based on mathematical programming method and skeleton-group coupling. The algorithm takes Bemis-Murcko skeleton and group fragments as structural members, obtains potential skeleton subsets through skeleton similarity retrieval and constructs candidate structural space, on the basis, models fragment selection and counting of candidate molecules as mixed integer nonlinear programming (MINLP) problems, and uniformly introduces structural feasibility constraint, property threshold constraint and scoring/probability output by a deep learning model as objective functions or constraint conditions. Meanwhile, an improved SMILES-based structure generation mechanism is adopted in the algorithm to establish consistency bridging between a fragment set and a structure character string representation in the solving process, a fragment feasible solution is firstly generated by combining a decomposition solving strategy, then structure representation generation and nonlinear property/model evaluation screening are carried out, and finally a molecular structure which meets constraint and is optimal or near optimal in target is output.
Inventors
- LIU QILEI
- ZHAO YUJING
Assignees
- 大连理工大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260225
Claims (7)
- 1. A molecular intelligent design method based on mathematical programming method and skeleton-group coupling is characterized by comprising the following specific steps: Step 1, establishing a molecular database; Step 2, establishing a database containing the Bemis-Murcko skeleton, namely extracting the Bemis-Murcko skeleton from the molecular structure in the molecular database by using a Bemis-Murcko algorithm in RDKit; Step 3, searching a skeleton subset similar to the target molecule skeleton from a skeleton database by using a skeleton-based similarity algorithm aiming at the skeleton structure of the target molecule At the same time, a group of groups commonly used is selected to form a set Wherein the skeleton subset G1 and the group set G2 together define a fragment space of candidate molecules, and the candidate molecules are generated by combining skeleton fragments and group fragments; The method comprises the steps of designing a promising candidate molecule, expressing a segment combination selection problem of the candidate molecule as a mixed integer nonlinear programming MINLP model, taking segment selection and segment counting as core decision variables, taking structure and property constraint as feasibility judgment conditions, driving global optimization search on the segment combination of the candidate molecule by an objective function, identifying an optimal candidate molecule structure with high binding affinity probability in feasible solutions by solving the MINLP model, wherein the feasible solutions are generated by optimizing combination of a molecular skeleton and a group, the combination process is limited by the MINLP model constraint, and the objective function is the maximum molecular property to be optimized, wherein the MINLP model formula is as follows: Objective function: subject to steps 5-8: Step 5, deep learning constraint, wherein a general equation (1) represents a deep learning model for predicting molecular properties to be optimized; step 6, molecular structure constraint, wherein the general equation (2) represents the eight-corner rule Rule of valence bond And chemical complexity The combination of the skeleton and the group can generate a molecule with reasonable structure; step 7, molecular property constraint, wherein general equations (3) and (4) respectively represent properties predicted by the groups and the SMILES; other constraints are that general equation (5) represents an improved SMILES-based isomer generation algorithm for automatically converting a set of fragments of a candidate molecule into a corresponding molecular SMILES string, the fragments of the candidate molecule comprising a backbone and a group; In the general equation above, F obj is the objective function, property is the nature of the molecule to be optimized, i represents the fragment involved in the candidate molecule, n i represents the number of fragments involved in the candidate molecule, s is the SMILES representation of the molecule, m is the type of structural constraint, And Is the upper and lower bound of the structural constraint m, p is the molecular property, k is the property type, And Is the upper and lower bounds of p k ; And 9, adopting a decomposition type solving algorithm to solve the MINLP model, and returning to the step 4 to relax the constraint range if the best candidate molecules conforming to all the constraints are not available.
- 2. The method for intelligently designing a molecule based on a mathematical programming method and skeleton-group coupling according to claim 1, wherein the step 1 specifically comprises: Step 1.1, collecting a small molecular structure with a unique identifier from a public or private data source, searching a corresponding structure record in a structure database based on the unique identifier, deleting molecules which cannot obtain an effective structure record, and then enabling the remaining molecules to be further screened; Step 1.2, screening the properties of the molecules based on a preset property threshold set, and obtaining the property information of the screened molecules and an isomer SMILES character string; after applying the above criteria of step 1.1 to step 12, a molecular database is established containing small molecules and their identification, structural representation, isomer SMILES strings and property information.
- 3. The intelligent molecular design method based on mathematical programming and skeleton-group coupling as claimed in claim 1, wherein in step 3, the skeleton-based similarity algorithm is to combine six similarity algorithms with four molecular representation methods to form 24 combinations, each combination is used to identify skeletons similar to the target molecular skeleton, three most similar skeletons obtained by each combination are taken out, and the repeated skeletons are removed to obtain the final subset of similar skeletons Six similarity algorithms include Tanimoto, dice, cosine, sokal, russel, kulczynski, four molecular representation methods including topology fingerprints, MACCS keys, ECFP fingerprints, and FCFP fingerprints.
- 4. The method for intelligently designing a molecule based on a mathematical programming method and skeleton-group coupling according to claim 1, wherein the step 6 specifically comprises: The specific molecular structure constraint of the MINLP model is given by equations (6) - (12), and the structure constraint is given Meaning that a candidate molecule selects only one skeleton from the set of skeletons; Among the structural constraints of chemical complexity are: in equations (6) - (12), Is a fragment involved in a candidate molecule Is used in the number of (a) and (b), Is a fragment The number of bonds of (c) is, Is a set of fragments that are to be processed, Is a subset of the skeleton and, Is a group set of groups, which is a group, Is that Is a subset of the set of (c), ={-CH 3 , -CH 2 , -CH, CH 2 =CH-, -CH=CH-, CH 2 =C<, -CH=C<}; Is that The number of fragments to be processed, Is that The number of fragments to be processed, Is that Valence number of fragments.
- 5. The method for intelligently designing molecules based on the mathematical programming method and the skeleton-group coupling according to claim 1, wherein the step 7 specifically comprises the step of respectively entering a decomposition solving process in a linear and nonlinear form for screening a property set of candidate molecules in an MINLP model.
- 6. The method for intelligently designing a molecule based on a mathematical programming method and skeleton-group coupling according to claim 1, wherein the step 8 specifically comprises: The SMILES character string of the candidate molecule is model input information for predicting properties such as Property, a bridge is needed to be established in structural constraint to associate the SMILES character string with a fragment set, otherwise, the MINLP model cannot be successfully solved, therefore, the fragment set of the candidate molecule can be automatically converted into the corresponding SMILES character string by using an improved SMILES-based isomer generating algorithm, the MINLP model can be successfully solved, the algorithm can identify the candidate molecule isomer containing the same fragment set, the improved SMILES-based isomer generating algorithm is executed by a RWMol module in RDKit library, a certain molecule is represented by a group of fragments, the fragments comprise a skeleton and a group, and the information is sent to the isomer generating algorithm, and the specific algorithm execution flow is as follows: (1) Selecting skeleton as seed, combining the groups as leaves with the skeleton according to preset group adding sequence, checking whether the seed-leaf structure is present " If yes, adding another group into the seed-leaf structure according to the preset group adding sequence, if not, entering the next step; (2) Checking whether all the groups are added into the seed-leaf structure, if not, deleting the seed-leaf structure, repeating the operation of the step (1) according to the next preset group adding sequence, and if so, entering the next step; (3) Checking whether the generated seed-leaf structure can generate the SMILES structure through the RDKit library, if not, repeating the operations in the steps (1) - (2) according to the next preset group adding sequence, if so, saving the SMILES result and repeating the operations in the steps (1) - (2) according to the next preset group adding sequence, and after all preset group adding sequences are tried, obtaining all the possible SMILES structures.
- 7. The method for intelligently designing a molecule based on a mathematical programming method and skeleton-group coupling according to claim 1, wherein the step 9 specifically comprises: step 9.1, adopting a decomposition type solving algorithm to solve the MINLP optimizing model, namely decomposing the MINLP model into a mixed integer linear programming MILP sub-problem and three nonlinear programming NLP sub-problems; Step 9.2, the sub-problem 1-MILP, namely, firstly limiting the structural constraint of the octave rule, the valence rule and the chemical complexity and the linear property constraint, generating N 1 feasible solutions in GAMS by using a BARON solver, wherein the feasible solutions are candidate molecules represented by fragment sets; Step 9.3, sub-problem 2- -NLP, wherein the modified constraint of the isomer generation algorithm based on SMILES is adopted to generate N 2 candidate molecule SMILES character strings based on N 1 fragment sets, N 2 ≥N 1 ; Step 9.4, sub-problem 3-NLP, namely, calculating the corresponding properties of SMILES character strings of N 2 candidate molecules by using a nonlinear property prediction model in consideration of nonlinear property constraint, and eliminating the SMILES character strings which do not meet constraint conditions; And 9.5, sub-problem 4-NLP, namely calculating an objective function Property by using a nonlinear deep learning model in consideration of the deep learning model constraint, and sequencing SMILES character strings of candidate molecules according to the objective function.
Description
Molecular intelligent design method based on mathematical programming method and skeleton-group coupling Technical Field The invention belongs to the fields of computer-aided molecular design, molecular informatics and optimization calculation, and particularly relates to a molecular intelligent design algorithm combining skeleton-group fragmented representation, machine learning/deep learning property prediction and mathematical programming (mixed integer nonlinear programming MINLP) and a system implementation method thereof. Background Molecular structure design is a typical problem in the field of chemical and information technology intersection, with the goal of screening or generating molecular structures meeting specified objectives from vast chemical structural space within acceptable time and computational resources. The goal may be embodied as a property index (e.g., stability, solubility, hydrophobicity, polarity-related index, synthetic feasibility index, etc.), a structural constraint (e.g., a connection site, valence rule, structural complexity constraint, etc.), or a score/probability (e.g., activity/performance probability under a certain task or composite score) output by a data driven model. In multiple scenes such as medicines, materials, catalysis, solvents, functional chemicals and the like, two common characteristics of molecular structure design generally exist, namely (1) structural space is discrete and combined explosion, wherein when a combined mode of 'framework+groups/substituents' is adopted, the number of candidates grows exponentially along with the size of a framework set and a group set, (2) constraint and target are various and coupled, namely, design targets are often more than one, and meanwhile, a plurality of hard constraints (which are required to be met) and soft constraints (which are required to be met) exist, and the design targets are strongly coupled with a structure selection process. In the prior art, one common route is to generate a large number of candidate molecules based on rules or enumeration, and then filter and sort the candidate molecules through property calculation or prediction. The route is relatively direct, but when the structural space is large and the constraint is more, a large number of candidates of illegal structures or obviously unsatisfied property thresholds can be generated, so that a large amount of computing resources are consumed on invalid samples, the structural constraint and the property constraint can be checked only after the generation, and the structure, namely the compliance, is difficult to ensure in the generation process. Another type of route is to use heuristic search or evolutionary algorithm to find a better structure by iteratively improving an objective function, and although the enumeration scale can be reduced to a certain extent, complex penalty terms or repair strategies are often required to maintain feasibility in a multi-constraint scene, and the result is sensitive to super-parameters and initial conditions, and has insufficient stability and repeatability, and it is difficult to give an explanation of "optimal or near optimal" with clear solution semantics. In recent years, data-driven generation models (such as those based on latent variable models, countermeasure generation, reinforcement learning or diffusion models, etc.) are used for molecular generation and optimization, which are capable of learning data distribution and generating structures, but in the case of multi-objective and multi-hard constraints, there are still common problems that hard constraints are difficult to guarantee, distribution is not generalized stably, and two-stage flow of "first generation and second verification" is difficult to unify. In engineering practice, in order to perform controllable search in a structural space of combined explosion, candidate molecules are often expressed as 'selection and counting of skeleton and group fragments', namely, structural combination is characterized by discrete decision variables, and the fragmentation expression is convenient for introducing structural constraints such as connection sites, valence rules, skeleton uniqueness, complexity threshold and the like, and is naturally adaptive to discrete optimization frameworks such as integer programming and the like. However, most property prediction models and task scoring models (especially deep learning models) typically require structural representations such as SMILES, molecular figures, or three-dimensional conformations as inputs, so that there are naturally breaks between the "segment decision variable space" and the "structural representation space" that an optimization model can choose combinations in the segment space, but model evaluations often must rely on structural strings or figure representations to calculate property or probability scores. If a bridging mechanism capable of strictly constraining and automatically gener