CN-121983172-A - Interpretable molecule optimization method based on large language model

CN121983172ACN 121983172 ACN121983172 ACN 121983172ACN-121983172-A

Abstract

The invention discloses an interpretable molecular optimization method based on a large language model, which relates to the technical field of computational chemistry and comprises the following steps of obtaining an original subset, carrying out standardized processing and effective molecular screening on the original subset, outputting a standardized subset and an optimized task test subset, carrying out multi-mode molecular embedding generation on the standardized subset, outputting a multi-mode molecular embedding vector set, carrying out semantic embedding and clustering de-duplication on an initial molecular editing rule set, outputting a re-representative rule set, extracting high-frequency molecular fragments based on the standardized molecular set, outputting a final molecular editing rule base, grading and sorting specific target optimization attributes based on the optimized task test subset and a structured rule metadata table, outputting a multi-attribute candidate rule sorting table, carrying out molecular structure editing and property standard evaluation based on a to-be-optimized reference molecule and a corrected editing rule set, and outputting an optimized final subset after iteration.

Inventors

WANG SHUANG
SONG ZIAN
Yin Qiyao
YU RUNFU

Assignees

中国石油大学(华东)

Dates

Publication Date: 20260505
Application Date: 20260409

Claims (8)

1. An interpretable molecular optimization method based on a large language model is characterized by comprising the following steps: s10, acquiring an original subset, carrying out standardization processing and effective molecular screening on the original subset, and outputting a standardized subset and an optimized task test subset; s20, carrying out multi-mode molecule embedding generation and LLM rule initial generation on the standardized molecular set, and outputting a multi-mode molecule embedding vector set and an initial molecule editing rule set; S30, carrying out semantic embedding and clustering de-duplication on the initial molecule editing rule set, outputting a re-representative rule set, extracting high-frequency molecule fragments based on the standardized molecule set, and outputting a final molecule editing rule base; S40, normalizing the data of the final molecular editing rule base and outputting a structural rule metadata table; s50, scoring and sorting the appointed target optimization attributes based on the optimization task test subset and the structural rule metadata table, and outputting a multi-attribute candidate rule sorting table; S60, based on the optimization task test subset and the multi-attribute candidate rule ranking table, combining the standardized molecular set to perform chemical validity check and screen optimal reference molecules, and outputting reference molecules to be optimized; s70, similar molecule retrieval is carried out based on the reference molecules to be optimized, the standardized molecule set and the multi-modal molecule embedding vector set, and successful reference molecules are output; S80, extracting a shared skeleton and splitting unique fragments from the reference molecule to be optimized and the successful reference molecule, carrying out matching mapping with the final molecule editing rule base, and outputting a correction editing rule set; And S90, performing molecular structure editing and property standard evaluation based on the reference molecules to be optimized and the correction editing rule set, and outputting an optimized final subset after iteration.
2. The method for optimizing interpretable molecules based on a large language model according to claim 1, wherein the normalization processing is performed on the original subset, and the specific steps of outputting the normalized subset are as follows: removing repeated items in the original molecular set, screening structural effective items, and outputting a composite molecular original set; and carrying out molecular structure standardization and valence validity check and invalid structure elimination on the original set of the composite molecules through RDkit tools, and outputting the standardized subset.
3. The method for optimizing interpretable molecules based on a large language model according to claim 1, wherein the specific steps of generating the normalized molecular set by multi-modal molecular embedding and outputting the multi-modal molecular embedding vector set are as follows: And performing 1D fragment embedding, 2D topology embedding and 3D conformation embedding on the standardized molecular set, and then performing splicing fusion on the 1D fragment, the 2D topology and the 3D conformation to output the multi-modal molecular embedding vector set.
4. The method for optimizing interpretable molecules based on a large language model according to claim 1, wherein the specific steps of scoring and sorting for a specified target optimization attribute based on the optimization task test subset and the structured rule metadata table, and outputting a multi-attribute candidate rule sorting table are as follows: binary property judgment is carried out on the appointed target optimization attribute, and a property-attribute mapping table is output; Based on the optimization task test subset, the structural rule metadata table and the property-attribute mapping table, performing rule direction matching, triggering fragment existence check and attribute influence binary judgment, calculating rule property contribution scores and sequencing, and outputting a single attribute candidate rule sequencing table; And merging rules based on the single attribute candidate rule sorting table and the property-attribute mapping table, calculating the correct direction proportion, the forward propulsion degree and the reverse deviation degree, obtaining comprehensive scores, sorting, and outputting the multi-attribute candidate rule sorting table.
5. The method for optimizing interpretable molecules based on a large language model according to claim 4, wherein based on the optimization task test subset and the multi-attribute candidate rule ranking table, the method for performing chemical validity check and screening of optimal reference molecules in combination with the standardized molecular set comprises the following specific steps of: Based on the optimization task test subset and the multi-attribute candidate rule ranking table, combining the standardized subset and performing structure editing through RDKit tools to correct editing failure entries; Generating a distributed outer structure variant expansion search space through ChatMol models, and outputting edited candidate subset; And selecting the molecules with the largest forward variation or the smallest disturbance from the single-attribute task based on the edited candidate subset, selecting the molecules with the highest comprehensive scores from the multi-attribute task, and outputting the reference molecules to be optimized.
6. The method for optimizing interpretable molecules based on a large language model according to claim 5, wherein the specific step of selecting the molecules with the largest or smallest disturbance of forward variation in a single-attribute task based on the edited candidate subset is: The method comprises the steps of editing candidate molecule sets, selecting rules which generate changes in the target direction and have the largest change amplitude in the candidate molecule sets after editing, selecting one with the smallest absolute change quantity from all candidates if no rules can generate improvement in the target direction, wherein the selection process is defined as: ; Wherein, the A generic index representing the candidate rule, An index number indicating the finally selected optimal rule, Represent the first The amount of change of the bar rule on the target attribute; Representing a subset of feasible rules, defined as: ; Indicating the desired direction of change of the target property, Representing the candidate rule set obtained by current round screening.
7. The method for optimizing interpretable molecules based on a large language model according to claim 6, wherein the specific step of selecting the molecule with the highest comprehensive score in the multi-attribute task based on the edited candidate molecule set is as follows: When multiple attributes need to be optimized simultaneously, sharing Rules, each rule has The variation of the attributes is defined as ; Given the target direction of each attribute And target variation Calculating three components of correct direction proportion, forward propulsion and reverse deviation for each candidate rule; Correct proportion The definition is as follows: ; Where k represents all numbers of the attribute to be optimized, Indicating the numerical variation of the kth attribute after the ith rule is applied, Indicating an oscillometric function, returning to 1 when the condition is satisfied, and returning to 0 when the condition is not satisfied; Forward thrust The definition is as follows: ; Wherein, the A target variation threshold representing a kth attribute; Degree of reverse deviation The definition is as follows: ; The rule that the final selection score is highest is defined as: 。
8. The method for optimizing interpretable molecules based on a large language model according to claim 1, wherein the specific steps of extracting a shared skeleton and splitting unique fragments for the reference molecule to be optimized and the successful reference molecule are as follows: Extracting the maximum public substructure of the reference molecule to be optimized and the successful reference molecule as the shared skeleton; and respectively removing the shared skeletons of the to-be-optimized reference molecule and the successful reference molecule, and outputting the unique fragments of the to-be-optimized reference molecule and the successful reference molecule.

Description

Interpretable molecule optimization method based on large language model Technical Field The invention relates to the technical field of computational chemistry, in particular to an interpretable molecular optimization method based on a large language model. Background In recent years, artificial intelligence methods have demonstrated bid values in a number of links to drug discovery, including tasks such as molecular generation, property prediction, virtual screening, and lead compound optimization. The molecular optimization plays a key role of gradually advancing an initial candidate structure to meet the multi-patent drug constraint, and the process is not one-time prediction or generation, but an iterative decision process accompanied by repeated structure modification, evaluation and feedback. The existing deep learning method has a great deal of success in molecular optimization, and particularly under specific targets or constraint conditions, the model is effectively guided to converge towards the expected properties through end-to-end training, reinforcement learning or multi-target loss design. However, such methods typically model the optimization task as an implicit mapping from molecular structure to target properties or reward signals, adapting to different optimization goals often relies on relearning or tuning of model parameters. Under this paradigm, the motivation for structural modification, potential chemical effects, and post-failure correction logic are black-boxed into the model parameters. This makes the optimization behavior, while effective at a numerical level, difficult to directly control, interpret, or multiplex between different tasks. Disclosure of Invention To solve the above problems, the present invention provides an interpretable LLMs-guided molecular optimization framework ELLM-MOM (Explainable Large Language Model-guided Molecular Optimization), as shown in fig. 1, that explicitly models the optimization process as a rule-based reasoning and structure editing flow. The method extracts structure-property priori from a large language model LLMs (Large Language Models) and organizes the structure-property priori into a reusable editing rule, evaluates rule applicability through intermediate physicochemical property mapping and binary inference, and iteratively corrects the optimization direction by combining a search-driven self-feedback mechanism, thereby allowing iterative improvement under the condition that molecules do not need to retrain the model. The invention provides an interpretable molecular optimization method based on a large language model, which comprises the following steps: s10, acquiring an original subset, carrying out standardization processing and effective molecular screening on the original subset, and outputting a standardized subset and an optimized task test subset; s20, carrying out multi-mode molecule embedding generation and LLM rule initial generation on the standardized molecular set, and outputting a multi-mode molecule embedding vector set and an initial molecule editing rule set; S30, carrying out semantic embedding and clustering de-duplication on the initial molecule editing rule set, outputting a re-representative rule set, extracting high-frequency molecule fragments based on the standardized molecule set, and outputting a final molecule editing rule base; S40, normalizing the data of the final molecular editing rule base and outputting a structural rule metadata table; s50, scoring and sorting the appointed target optimization attributes based on the optimization task test subset and the structural rule metadata table, and outputting a multi-attribute candidate rule sorting table; S60, based on the optimization task test subset and the multi-attribute candidate rule ranking table, combining the standardized molecular set to perform chemical validity check and screen optimal reference molecules, and outputting reference molecules to be optimized; s70, similar molecule retrieval is carried out based on the reference molecules to be optimized, the standardized molecule set and the multi-modal molecule embedding vector set, and successful reference molecules are output; S80, extracting a shared skeleton and splitting unique fragments from the reference molecule to be optimized and the successful reference molecule, carrying out matching mapping with the final molecule editing rule base, and outputting a correction editing rule set; And S90, performing molecular structure editing and property standard evaluation based on the reference molecules to be optimized and the correction editing rule set, and outputting an optimized final subset after iteration. In summary, the invention at least comprises the following beneficial effects: The ELLM-MOM model is characterized in that structural and property knowledge is induced from documents and databases and is explicit in a rule form, assumptions are deduced and controllable structural editing