Search

CN-122024914-A - Molecular generation and optimization method and system based on three-dimensional geometric perception

CN122024914ACN 122024914 ACN122024914 ACN 122024914ACN-122024914-A

Abstract

The invention discloses a molecular generation and optimization method and system based on three-dimensional geometric perception, and relates to the technical field of drug molecular generation. The invention is based on the original SMILES sequence, generates the corresponding three-dimensional dominant conformation, constructs a molecular diagram containing atom type, chemical bond type and three-dimensional coordinate of the atom, and encapsulates the molecular diagram into a composite tensor as the input of a potential space generation model. The model can accurately model the space relation among atoms after being pre-trained, and outputs the latent vector after being subjected to potential space re-parameterization. And then, outputting actions to update the hidden points according to the hidden vectors through the actor and criticizer strategy network after reinforcement learning optimization, and decoding the updated hidden vectors to obtain the SMILES sequence. The invention makes up the limitation that the two-dimensional representation can not accurately describe the three-dimensional structure of the molecule, and ensures that the molecular characteristics learned by the model are more fit with the binding rule of the molecule and the target protein in the actual drug research and development.

Inventors

  • HUANG BINGDING
  • Qin Ningxin
  • ZHANG QI
  • ZENG BIN
  • LIU CHENYANG
  • Jiang Chenran

Assignees

  • 深圳技术大学

Dates

Publication Date
20260512
Application Date
20260413

Claims (10)

  1. 1. A method for generating and optimizing molecules based on three-dimensional geometric perception, the method comprising: Acquiring an original SMILES sequence, and generating a component graph according to a three-dimensional dominant conformation corresponding to the original SMILES sequence, wherein node characteristics of the component graph are used for reflecting atom types, edge attributes are used for reflecting chemical bond types, and node coordinates are used for reflecting three-dimensional coordinates of atoms; Modeling the space relation among atoms according to the composite tensor corresponding to the molecular diagram through a pre-trained potential space generation model, and obtaining a potential vector after potential space re-parameterization; Updating the hidden points according to the output actions of the hidden vectors through the actor and criticizer strategy network after reinforcement learning optimization to obtain updated hidden vectors; and decoding according to the updated latent vector to obtain the SMILES sequence.
  2. 2. The method of claim 1, wherein generating component figures from three-dimensional dominant conformations corresponding to the original SMILES sequence comprises: Carrying out standardization processing and discrete symbol encoding on the original SMILES sequence to obtain a fixed-length discrete tensor; determining a three-dimensional dominant conformation according to the fixed-length discretization tensor, and constructing molecular point cloud data according to the three-dimensional dominant conformation; And constructing the molecular diagram according to the molecular point cloud data.
  3. 3. The method of molecular generation and optimization based on three-dimensional geometric perception according to claim 2, wherein determining a three-dimensional dominant conformation from the fixed-length discretized tensor, the step of constructing molecular point cloud data from the three-dimensional dominant conformation comprises: Generating a three-dimensional initial conformation according to the fixed-length discretization tensor by a distance geometric algorithm; Geometrically optimizing and minimizing the three-dimensional initial conformation through an MMFF94 force field, and screening out the dominant conformation with the lowest energy as the three-dimensional dominant conformation; and extracting atomic three-dimensional Cartesian coordinates and atomic type information according to the three-dimensional dominant conformation to construct the molecular point cloud data.
  4. 4. The method of molecular generation and optimization based on three-dimensional geometric perception according to claim 1, wherein the latent space generation model comprises: the geometric perception diagram encoder is used for modeling the space relation among atoms according to the composite tensor corresponding to the molecular diagram to obtain global potential representation; the re-parameterization module is used for predicting the mean and variance of potential distribution according to the global potential representation by utilizing a plurality of linear layers, and obtaining the potential vector by sampling through a re-parameterization technology; a sequence generation decoder for decoding the input latent vector into a corresponding SMILES sequence; And the auxiliary attribute predictor is used for predicting molecular properties of the input latent vectors to obtain predicted molecular properties.
  5. 5. The method of three-dimensional geometric perception based molecular generation and optimization of claim 4, wherein the pre-training step of the latent space generation model comprises: taking the original SMILES sequence used for training as an original training SMILES sequence; Generating a training molecular diagram according to the three-dimensional dominant conformation corresponding to the original training SMILES sequence; modeling, by the geometric sense graph encoder, a spatial relationship among atoms according to a composite tensor corresponding to the training molecular graph, thereby obtaining a training global potential representation; predicting the mean and variance of potential distribution according to the training global potential representation by using a plurality of linear layers through the re-parameterization module, and sampling through a re-parameterization technology to obtain a first training potential vector; Generating, by the sequence generator, a decoder for decoding the first training latent vector into a corresponding SMILES sequence; The auxiliary attribute predictor is used for predicting molecular properties of the first training latent vector to obtain corresponding predicted molecular properties; Calculating a reconstruction loss according to the original training SMILES sequence and the SMILES sequence, and calculating an attribute loss according to the predicted molecular property and the standard molecular property corresponding to the first training latent vector; and pre-training the potential space generating model according to the reconstruction loss and the attribute loss to obtain the pre-trained potential space generating model.
  6. 6. The molecular generation and optimization method based on three-dimensional geometric perception according to claim 1, wherein the reinforcement learning optimization step of actor and critique strategy network comprises: taking the original SMILES sequence used for training as an original training SMILES sequence, and taking a latent vector output by the pre-trained potential space generation model according to the original training SMILES sequence as a second training latent vector; outputting an action update hidden point according to the second training hidden vector through strategy branches in the actor and criticizer strategy network to obtain a training update hidden vector; Updating the latent vector evaluation action effect according to the training through the value branches in the actor and criticizer strategy network, and optimizing the strategy branches; And decoding the latent vector corresponding to the training updated latent vector into a SMILES sequence, calculating rewards, collecting experience data formed by the state of the latent vector, the action and the rewards through multi-step interaction between the state of the latent vector and the action, and optimizing the actor and a criticism strategy network according to the experience data through a near-end strategy optimization algorithm.
  7. 7. The molecular generation and optimization method based on three-dimensional geometric perception according to claim 6, wherein when optimizing the actor and critique strategy network according to the experience data through a near-end strategy optimization algorithm, clipping and limiting the probability ratio of new strategy to old strategy; The composite rewards function for calculating rewards includes weighted fusion of property rewards terms, self-consistency rewards terms, and potential space constraint penalty terms.
  8. 8. The method for generating and optimizing molecules based on three-dimensional geometric sense according to claim 1, wherein the step of decoding according to the updated latent vector to obtain a SMILES sequence further comprises: carrying out validity check, standardization processing and deduplication processing on the molecular structure of the SMILES sequence to obtain a target SMILES sequence; scoring and screening target SMILES sequences obtained based on a plurality of original SMILES sequences by adopting a plurality of evaluation indexes, wherein the plurality of evaluation indexes comprise at least one index of chemical effective rate, uniqueness, structural diversity, medicine similarity and synthesis feasibility; And determining target candidate molecules according to the scoring and screening results.
  9. 9. A three-dimensional geometric perception-based molecular generation and optimization system, the system comprising: The molecular characterization and preprocessing module is used for acquiring an original SMILES sequence and generating a component diagram according to a three-dimensional dominant conformation corresponding to the original SMILES sequence, wherein node characteristics of the component diagram are used for reflecting atom types, side attributes are used for reflecting chemical bond types, and node coordinates are used for reflecting three-dimensional coordinates of atoms; The potential space module is used for generating a model through the pre-trained potential space, modeling the space relation among atoms according to the composite tensor corresponding to the molecular diagram, and obtaining a potential vector after the potential space is re-parameterized; the reinforcement learning optimization module is used for updating the hidden points according to the hidden vector output actions through the actor and criticizer strategy network after reinforcement learning optimization to obtain updated hidden vectors; and the sequence generation decoder is positioned in the potential space generation model and used for decoding according to the updated latent vector to obtain the SMILES sequence.
  10. 10. A computer readable storage medium having stored thereon a plurality of instructions adapted to be loaded and executed by a processor to implement the steps of the three-dimensional geometry-based molecular generation and optimization method according to any one of claims 1 to 8.

Description

Molecular generation and optimization method and system based on three-dimensional geometric perception Technical Field The invention relates to the technical field of drug molecule generation, in particular to a method and a system for generating and optimizing molecules based on three-dimensional geometric perception. Background Molecular generation is a key technology in the field of artificial intelligence auxiliary drug design, small molecular compounds with specific structures or properties are automatically constructed through a calculation model, and compared with traditional drug research and development processes relying on artificial experience and experimental screening, the molecular generation method can be used for efficiently generating potential candidate molecules in a huge chemical space by analyzing existing molecular data and establishing a mapping relation between molecular structures and chemical properties. The current molecular generation technology mainly relies on two-dimensional representation forms of molecules, and realizes automatic molecular design through various deep learning models, and the method establishes a mapping relation between a molecular structure and chemical properties through learning of existing molecular data, so that a new molecular structure is generated in a computer, and can be combined with reinforcement learning or gradient optimization strategies to realize directional optimization of specific properties of the molecules, so that a wide chemical space can be explored theoretically, traditional experimental screening is replaced to a certain extent, and the discovery efficiency of a drug lead compound is improved. However, the two-dimensional representation form relied on by the existing molecular generation technology is difficult to accurately describe the three-dimensional geometric conformation and the pharmacophore spatial distribution of the molecule, and the three-dimensional structure of the molecule directly determines the combination mode and the bioactivity of the molecule and target protein, and the omission of spatial information limits the applicability and the accuracy of the generated molecule in the actual drug design task. The molecules represented by the discrete symbols lack continuous conductive structures in the generation process, the problems of large gradient variance, unstable training, slow convergence and the like are easy to generate in reinforcement learning optimization, grammar errors or invalid molecules which do not accord with chemical valence bond rules often occur in the generation stage, the generation efficiency of the molecules is reduced, the processes of subsequent molecular docking, property prediction, virtual screening and the like cannot be normally carried out, and the engineering usability and stability of the whole system are influenced; The latent variable generation model based on the variation self-encoder has stronger randomness in the sampling and decoding process, the same potential representation possibly obtains a molecular structure with larger difference through repeated decoding, the generated result lacks consistency and repeatability, and the stable and controllable molecular generation process is difficult to realize, so that the reliability of the model in practical application is influenced; when reinforcement learning searches in a potential space lacking effective constraint, model parameters are easy to deviate from pre-training distribution gradually, so that generated samples are distorted or even lose effectiveness, the overall generation quality and reliability are greatly reduced, and the existing method generally lacks constraint mechanisms for potential space stability and generation consistency, so that random fluctuation of a generated result is further amplified, and stable, controllable and chemically reasonable molecular property optimization is difficult to realize in the prior art. Accordingly, there is a need for improvement and development in the art. Disclosure of Invention The invention aims to solve the technical problems that aiming at the defects in the prior art, a method and a system for generating and optimizing molecules based on three-dimensional geometric perception are provided, and the method and the system aim to solve the problems that the two-dimensional representation forms relied on by the existing molecular generation technology are difficult to accurately describe the three-dimensional geometric conformation and the pharmacophore spatial distribution of the molecules, so that the applicability and the accuracy of the generated molecules in the actual drug design task are low. The technical scheme adopted by the invention for solving the problems is as follows: in a first aspect, an embodiment of the present invention provides a method for generating and optimizing a molecule based on three-dimensional geometric perception, the method comprising: Acqu