Search

CN-122024907-A - Molecular substructure attribute contribution degree calculation method and device

CN122024907ACN 122024907 ACN122024907 ACN 122024907ACN-122024907-A

Abstract

The invention provides a method and a device for calculating contribution degree of molecular substructure attributes, and relates to the field of molecular attribute prediction. The method comprises the steps of constructing a substructure set of target molecules, constructing a hierarchical molecular graph comprising atomic layer nodes, substructure layer nodes and molecular layer nodes, generating node characteristics and relation edge characteristics, inputting the hierarchical molecular graph into an attribute prediction model to obtain a first predicted value, constructing a covering vector aiming at the target substructure, carrying out gating covering on node contribution related to the target substructure through the covering vector, inputting the covering sample into the same attribute prediction model to obtain a second predicted value, and calculating the contribution degree of the substructure to the target attribute based on the difference value between the first predicted value and the second predicted value. Compared with the prior art, the method improves the prediction accuracy through unified modeling of atom-substructure-molecule multi-granularity and combination of coverage comparison reasoning.

Inventors

  • LUO HANYU
  • QIAN YING
  • DOU LIANG

Assignees

  • 华东师范大学

Dates

Publication Date
20260512
Application Date
20260209

Claims (11)

  1. 1. The molecular substructure attribute contribution degree calculation method is characterized by comprising the following steps of: s1, constructing a substructure set of target molecules; S2, constructing a layered molecular diagram and performing feature coding, namely constructing a layered molecular diagram comprising atomic layer nodes, substructure layer nodes and molecular layer nodes, respectively generating node features for atomic type nodes, substructure type nodes and global type nodes, and performing edge feature coding for three kinds of semantically different relation edges; s3, training the graphic neural network based on training data to obtain an attribute prediction model, inputting the characteristic representation of the molecules into the attribute prediction model during prediction to obtain a prediction result of a target attribute value, and marking the prediction result as a first prediction value; s4, constructing a covering vector to cover weight contributions corresponding to the target substructure; s5, inputting the covering sample obtained in the step S4 into an attribute prediction model, and using the covering vector for gating the weight vector in a picture level reading stage to obtain a second predicted value; and S6, carrying out differential operation on the first predicted value and the second predicted value to obtain differential values, and carrying out normalization processing on the differential values to obtain contribution degree values of the substructures.
  2. 2. The method for calculating the contribution degree of molecular substructure attributes according to claim 1, wherein the construction of the substructure set comprises: S11, dividing a target molecule into segments based on BRICS key breaking rules, and taking an atomic index set corresponding to each segment as a first substructure set; S12, obtaining a target molecular skeleton structure as a second substructure set according to the Bemis-Murcko scaffold molecular core skeleton extraction rule; and S13, obtaining a third substructure set based on the functional group identification rule.
  3. 3. The method for calculating the contribution degree of molecular substructure attributes according to claim 2, wherein the segment division of the target molecule based on BRICS bond-breaking rules comprises: a BRICS fragment division is performed on the structural representation of the target molecule using a RDKit tool, resulting in a first set of substructures consisting of a plurality of fragments.
  4. 4. The method for calculating the contribution degree of molecular substructure attribute according to claim 2, wherein the obtaining the second substructure set includes: Obtaining Murcko scaffold core structures of target molecules, determining skeleton atom sets corresponding to the Murcko scaffold core structures in original molecules, traversing chemical bonds of the original molecules, determining that only one connecting bond with one end positioned in the skeleton atom sets is used as a breakpoint connecting bond between a skeleton and a substituent, and dividing the atom sets of the target molecules based on the breakpoint connecting bond to obtain a second substructure set formed by a plurality of skeleton related substructures.
  5. 5. The method of claim 2, wherein deriving the third set of substructures based on the functional group identification rule comprises identifying functional group substructures in the structural representation of the target molecule by a predefined SMARTS matching rule, the substructures being characterized by an atomic index set as elements in the third set of substructures.
  6. 6. The method according to claim 1, wherein said step S2 of constructing a hierarchical molecular map comprises: S21, taking each atom in a target molecule as an atomic layer node, and generating a corresponding atomic characteristic vector based on the atomic element type, the atomic number, the formal charge, the hybridization mode, the aromaticity, the hydrogen atom number and the chiral information; s22, determining each substructure in the substructure set as a substructure layer node, wherein the substructure node features are obtained by aggregation of atomic layer node features; S23, generating molecular layer node characteristics, namely polymerizing the sub-structure layer node characteristics contained in the molecular layer node characteristics to obtain chemical semantic feature vectors of the molecular layer nodes; S24, atom-atom bonding edges between atomic layer nodes represent chemical bond connection relations between atoms, and edge characteristics are determined by bond types, conjugation, whether a ring is formed or not and stereochemical information; s25, establishing a substructure to an atom containing edge between the substructure layer node and an atom layer node contained in the substructure layer node to represent the containing relation between the substructure and the atom; S26, a substructure-molecule convergence edge between the molecular layer node and the substructure layer node is used for representing the information convergence relation between the substructure unit and the whole molecule, and the substructure-atom containing edge and the substructure-molecule convergence edge respectively correspond to different preset edge type codes.
  7. 7. The method for calculating the contribution degree of molecular substructure attributes according to claim 6, wherein the node feature vectors of the atomic layer node, the sub-structure layer node and the molecular layer node have the same dimension, and are obtained by splicing the chemical semantic feature vector and a node type identification vector according to dimensions, and the node type identification vector is used for distinguishing whether the node belongs to the atomic layer, the sub-structure layer or the molecular layer.
  8. 8. The method according to claim 1, wherein the attribute prediction model gates node weights based on a covering vector in a graph-level readout stage to suppress the contribution of the corresponding node of the target substructure to the graph-level representation.
  9. 9. The method for calculating the attribute contribution degree of the molecular substructure according to claim 1, wherein the construction coverage vector is that the lengths of coverage vectors smask and smask corresponding to the construction substructure are consistent with the number of nodes in the hierarchical molecular graph, the value of the node related to the target substructure is 0, the value of the rest nodes is 1, and the smask coverage is performed on the node contribution corresponding to the target substructure in the process of reading out the graph level of the attribute prediction model.
  10. 10. The method for calculating contribution degree of molecular substructure attribute according to claim 1, wherein the contribution degree score of the substructure is determined according to the predicted output difference before and after covering, comprising: Comparing a first predicted value obtained by uncovered molecules with a second predicted value obtained by covering the sample, and calculating a difference value between the first predicted value and the second predicted value; in the regression task, the first predicted value and the second predicted value are continuous numerical predicted results of the target property; In the classification task, the first predicted value and the second predicted value are classification output results, and the classification output results comprise any one of classification scores, classification probabilities or classification decision values of the model; And the contribution score is obtained by masking the difference value of the predicted output before and after and normalizing.
  11. 11. A molecular substructure contribution calculation apparatus, wherein the apparatus is configured to perform the method of any of claims 1-10, the apparatus comprising: The substructure dividing module is used for obtaining a substructure set of molecules; the hierarchical graph construction and feature extraction module is used for constructing a hierarchical molecular graph comprising atomic layer nodes, sub-structure layer nodes and molecular layer nodes and generating node features and edge features; The attribute prediction module is used for carrying out hierarchical molecular graph reasoning to obtain a first predicted value and carrying out reasoning on the covering molecules to obtain a second predicted value; A covering sample construction module for constructing covering samples for substructures in the substructures set, so that information of the substructures in the hierarchical separator graph is covered; And the contribution calculation module is used for determining a contribution score of the substructure according to the difference value between the first predicted value and the second predicted value.

Description

Molecular substructure attribute contribution degree calculation method and device Technical Field The invention relates to the technical field of molecular attribute prediction and machine learning, in particular to graph representation learning and molecular substructure attribute contribution calculation, and specifically relates to a molecular substructure attribute contribution calculation method and device. Technical Field Molecular property prediction is one of the key tasks in modern drug development design, and aims to predict the property performances of the molecules in terms of solubility, toxicity, biological activity and the like through the structural information of the known molecules. The traditional molecular screening process is highly dependent on biochemical experiments and expert experiences, has long period and high cost, and remarkably restricts the development efficiency of new drugs. In recent years, with the development of deep learning technology, researchers begin to model a molecular structure by using a graph neural network, and atoms in molecules are regarded as nodes and chemical bonds of a graph as edges, so that end-to-end prediction of the overall molecular properties is realized, and the efficiency of virtual screening and molecular evaluation is greatly improved. Although ensemble attribute prediction plays an important role in molecular screening, in practical applications, information concerning only the ensemble level of a molecule is still significantly insufficient. Some of the key properties of a molecule are often dominated by specific structural units within it, such as functional groups, heterocycles, branches, and the like. Therefore, the attribute contribution of specific substructures in the molecules is subjected to fine-granularity attribution analysis, and the method has important significance in tasks such as molecular generation, attribute optimization, model interpretability and the like. By identifying the positive and negative influence of a certain substructure on the target property, the molecular optimization direction can be guided more accurately, the success rate of the generated molecules meeting the constraint of the specific property is improved, and the target guidance of the molecular design is enhanced. While there have been studies beginning to focus on the importance of substructures in molecular property predictions in an attempt to achieve finer granularity of structure attribution, current approaches still have many limitations on modeling strategies, some of which attempt to estimate the average contribution of a substructures by averaging the overall attribute values of all molecules that appear in that substructures. However, the strategy defaults to the fact that the effect of molecular properties between different substructures is independent and linearly additive, ignoring the synergistic effect and context dependence between substructures under specific chemical circumstances. In fact, molecular properties are often the result of a synergistic effect of multiple structures, and it is difficult for the average strategy to truly reflect the functional changes of the substructures in different contexts, resulting in the structural modification decisions that affect molecular generation or optimization due to bias. In addition, other methods reverse analyze the reasoning process after model training is completed based on strategies such as gradient sensitivity, input disturbance, and the like, so as to identify key structural regions. Although the technology improves the interpretability of the model to a certain extent, the technology is highly dependent on the gradient path and structural design of a specific model, has poor stability under different tasks or model architectures, lacks generality, is difficult to cooperatively optimize with the model training process, and does not have good training friendliness. Recent work has proposed dividing the graph structure into chemical subfragments, with some progress in interpretation. However, the method only acts on the input end structure division, fails to link multi-granularity feature expression and hidden layer information fusion among the atomic level, the sub-structure level and the whole molecular level, and cannot realize unified modeling and reasoning on attribute contribution. Therefore, a unified molecular modeling method supporting multi-granularity fusion is needed, which can clearly express the hierarchical relationship among atoms, substructures and whole molecules structurally, and provide explicit modeling and attribution support for the attributes of the substructures on a model mechanism so as to realize accurate identification and interpretable prediction of the contribution of the structural attributes, and further improve the targeting guidance and decision reliability of the molecular generation and optimization process. Disclosure of Invention The invention