Search

CN-121983159-A - Atmospheric reaction kinetic parameter intelligent prediction method and system integrating multilayer quantum chemical characterization

CN121983159ACN 121983159 ACN121983159 ACN 121983159ACN-121983159-A

Abstract

The invention discloses an atmospheric reaction kinetic parameter intelligent prediction method and system integrating multilayer quantum chemical characterization. Collecting experimental measurement data, respectively obtaining a descriptor table data set and a plurality of quantum chemical descriptors according to the experimental measurement data, further respectively obtaining a new feature set as a model input and a reaction rate constant data set as a model output, then obtaining an optimal machine learning model according to the input and the output, and using the optimal machine learning model for predicting the reaction rate constant of the reaction of the organic compound to be tested and the hydroxyl radical. According to the method, parallel training is carried out by integrating the models of a plurality of decision trees, the complex nonlinear relation between the molecular structure parameters and the reaction dynamics is effectively analyzed, and compared with the traditional regression model, the analysis capability of high-dimensional data such as the interpretable quantum chemical descriptor is remarkably improved, and the calculation complexity is greatly reduced while the prediction precision is ensured.

Inventors

  • WU XIAOQING
  • LIU WENJING
  • WANG PENGFEI
  • Huo Wanli
  • Lv Laishui
  • CHEN JIALI

Assignees

  • 中国计量大学

Dates

Publication Date
20260505
Application Date
20260330

Claims (9)

  1. 1. An intelligent prediction method for atmospheric reaction kinetic parameters by fusing multilayer quantum chemical characterization is characterized by comprising the following steps: S1, acquiring experimental measurement data of the reaction of an organic compound and a hydroxyl radical, and preprocessing the experimental measurement data to obtain preprocessed experimental measurement data; S2, extracting the SMILES structural formula and the reaction rate constant of each organic compound from the preprocessed experimental measurement data, acquiring a descriptor table data set according to the SMILES structural formula, and acquiring a reaction rate constant data set according to the reaction rate constant; S3, extracting a molecular structural formula of each organic compound from the preprocessed experimental measurement data, and calculating a plurality of quantum chemical descriptors corresponding to each organic compound according to the molecular structural formula; The quantum chemical descriptor comprises local electron attachment energy, a maximum electrophilic Fukui index and a front line orbit energy gap; S4, performing feature selection on the descriptor table dataset by adopting a plurality of machine learning algorithms to obtain feature subsets corresponding to each machine learning algorithm, and adding all quantum chemical descriptors into each feature subset to obtain new feature sets corresponding to each machine learning algorithm; s5, taking each new feature set as input, taking a reaction rate constant data set as output, training each corresponding machine learning model, further obtaining an optimal machine learning model, and obtaining a prediction feature set according to the optimal machine learning model; S6, further acquiring the value of each characteristic of the organic compound to be detected in the prediction characteristic set according to the molecular structural formula and the SMILES structural formula of the obtained organic compound to be detected, and inputting the values of all the characteristics into an optimal machine learning model to predict and obtain a reaction rate constant of the organic compound to be detected; S7, invoking a SHAP interpreter in the prediction process, calculating the SHAP value of each feature in the feature value set, and sorting the features in descending order according to the SHAP value, so as to obtain the importance of each feature.
  2. 2. The intelligent prediction method for the atmospheric reaction kinetic parameters fused with multi-level quantum chemical characterization according to claim 1, wherein the method is characterized by comprising the following steps of: The preprocessing comprises cleaning of abnormal invalid values and filling of missing values, wherein the cleaning of the abnormal invalid values is to find out abnormal invalid values in descriptor data, the abnormal invalid values are replaced by null values, and the filling of the missing values is to fill the missing values in experimental measurement data into the null values.
  3. 3. The method for intelligently predicting the atmospheric reaction kinetic parameters by fusing multi-level quantum chemical characterization according to claim 1, wherein the step S2 is specifically: S21, extracting the SMILES structural formula and the reaction rate constant of each organic compound from the pretreated experimental measurement data; S22, generating corresponding descriptor data according to the SMILES structural formula of each organic compound, and arranging the descriptor data corresponding to each organic compound according to rows to obtain a primary descriptor table data set; S23, sequentially performing dimension reduction processing and Z-score standardization processing on the preliminary descriptor table data set to obtain the descriptor table data set; S24, carrying out logarithmic conversion on the reaction rate constant of each organic compound, and then summarizing to obtain a reaction rate constant data set.
  4. 4. The method for intelligently predicting the atmospheric reaction kinetic parameters by fusing multi-level quantum chemical characterization according to claim 3, wherein the method comprises the following steps of: the descriptor data corresponding to each organic compound comprises a plurality of molecular descriptors, each molecular descriptor is a characteristic, and each column of the descriptor table data set corresponds to one molecular descriptor.
  5. 5. The method for intelligently predicting the atmospheric reaction kinetic parameters by fusing multi-level quantum chemical characterization according to claim 3, wherein the method comprises the following steps of: The dimension reduction processing is to delete all zero columns and columns containing null values in the descriptor table data set, and then sequentially adopt a low variance feature removal method and a high correlation feature removal method to process the primary descriptor data.
  6. 6. The method for intelligently predicting the atmospheric reaction kinetic parameters by fusing multi-level quantum chemical characterization according to claim 1, wherein the step S4 is specifically: S41, taking a descriptor table data set as input, taking a reaction rate constant data set as output, adopting a plurality of machine learning algorithms, respectively taking each machine learning model as a base model of a recursive feature elimination method, and carrying out feature selection on the descriptor table data set so as to obtain feature subsets corresponding to each machine learning algorithm; S42, respectively taking each quantum chemical descriptor as a single-column feature, and adding all the three columns of features into a feature subset corresponding to each machine learning algorithm together, so as to obtain a new feature set after the corresponding addition of each machine learning algorithm.
  7. 7. The method for intelligently predicting the atmospheric reaction kinetic parameters by fusing multi-level quantum chemical characterization according to claim 1, wherein the step S5 is specifically: s51, taking each new feature set as input, taking a reaction rate constant data set as output, and training a corresponding machine learning model, so as to obtain each trained machine learning model and an evaluation index value corresponding to the model; s52, selecting an optimal evaluation index value from all the evaluation index values, wherein the optimal evaluation index value corresponds to the trained machine learning model to serve as an optimal machine learning model; s53, taking a new feature set corresponding to the optimal machine learning model as an optimal new feature set, and taking all the features contained in the optimal new feature set as prediction feature sets.
  8. 8. The method for intelligently predicting the atmospheric reaction kinetic parameters by fusing multi-level quantum chemical characterization according to claim 1, wherein the step S6 is specifically: S61, obtaining a molecular structural formula and an SMILES structural formula of an organic compound to be detected; S62, obtaining a value corresponding to each feature of the organic compound to be detected in the predicted feature set according to the molecular structural formula and the SMILES structural formula, so as to obtain a feature value set of the organic compound to be detected; S63, inputting the characteristic value set corresponding to the organic compound to be detected into an optimal machine learning model for prediction, and obtaining a corresponding reaction rate constant.
  9. 9. An intelligent prediction system for atmospheric reaction kinetic parameters fused with multi-level quantum chemical characterization by the method of any one of claims 1-8, comprising: The data acquisition pretreatment module acquires experimental measurement data of the reaction of the organic compound and the hydroxyl radical, and carries out pretreatment on the experimental measurement data to obtain the pretreated experimental measurement data; The data set construction module is used for extracting the SMILES structural formula and the reaction rate constant of each organic compound from the preprocessed experimental measurement data, acquiring a descriptor table data set according to the SMILES structural formula and acquiring a reaction rate constant data set according to the reaction rate constant; The supplementary descriptor acquisition module is used for extracting the molecular structural formula of each organic compound from the preprocessed experimental measurement data and calculating a plurality of quantum chemical descriptors corresponding to each organic compound according to the molecular structural formula; The new feature set acquisition module is used for selecting features of the descriptor table data set by adopting a plurality of machine learning algorithms to obtain feature subsets corresponding to each machine learning algorithm, and then adding all quantum chemical descriptors into each feature subset to obtain new feature sets corresponding to each machine learning algorithm; The model training and predicting feature set acquisition module takes each new feature set as input, takes a reaction rate constant data set as output, trains and corresponds to each machine learning model, further acquires an optimal machine learning model, and acquires a predicting feature set according to the optimal machine learning model; The prediction deployment module is used for further acquiring the value of each characteristic of the organic compound to be detected in the prediction characteristic set according to the acquired molecular structural formula and SMILES structural formula of the organic compound to be detected, and inputting the values of all the characteristics into the optimal machine learning model to predict and obtain the reaction rate constant of the organic compound to be detected; And the interpretability module is used for calling the SHAP interpreter in the prediction process, calculating the SHAP value of each feature in the feature value set, and ordering the features in a descending order according to the SHAP value, so that the importance of the features is obtained.

Description

Atmospheric reaction kinetic parameter intelligent prediction method and system integrating multilayer quantum chemical characterization Technical Field The invention belongs to the crossing field of artificial intelligence and environmental chemistry, and particularly relates to an atmospheric reaction kinetic parameter intelligent prediction method and system integrating multilayer quantum chemical characterization. Background Currently, the gas phase reaction of volatile organics (Volatile Organic Compounds, VOCs) with oxidants such as hydroxyl radicals (OH) is the primary way of their removal in the atmosphere, and the bimolecular rate coefficient k OH determines the life cycle of VOCs while affecting the generation of secondary organic aerosols and ozone. Since experimental assays are instrumentally limited and high levels of quantum chemistry are time consuming and expensive, quantitative structure-activity relationship (Quantitative Structure Activity Relationship, QSAR) models have become a key solution to balance accuracy and efficiency. With the development of artificial intelligence technology, the research trend in the field has been shifted from early linear algorithms to nonlinear machine learning algorithms. Nonlinear algorithms exhibit greater generalization ability and robustness in processing large-scale, diverse data sets, but their inherent "black box" nature results in lack of transparency of the predicted results. Especially in the atmospheric chemistry research requiring clear reaction mechanism and scientific basis, the traditional characteristic characterization system is often focused on macroscopic physicochemical properties or two-dimensional topological structures, and the deep characterization of microscopic local electronic effects which determine the key reaction rate is lacking, so that the model is difficult to capture the deep reaction rules of complex molecules. Therefore, development of a prediction framework combining multi-level quantum chemical characterization is urgently needed, the defects of the existing characterization system are overcome by introducing microscopic electronic characteristics with clear physical significance, and the interpretability of a model is realized while the high precision of a nonlinear algorithm is ensured so as to overcome the defects of the prior art. Patent CN202410354797.9 discloses a method for predicting the chloridizing degradation rate of pollutants based on machine learning, which comprises the steps of obtaining a secondary reaction rate constant of a compound standard substance through chloridizing degradation experiments, calculating a molecular descriptor to construct a basic database, and training a prediction model. The method has the disadvantages that 1, only a molecular descriptor is input into a prediction model, and deep description of local electronic effect is lacking, 2, chlorination degradation is typical of electrophilic reaction or oxidation reaction, but the model lacks electronic reactivity of reaction sites, 3, the rate constant is from a specific chlorination system and operation conditions, and model prediction can only be reliably extrapolated in a similar system. Patent CN202010380633.5 discloses a method for predicting the reaction rate constant of organic matters and singlet oxygen under different pH conditions in aqueous solution, which comprises the steps of collecting rate constant data of different pH values, dividing training/testing sets, establishing a QSAR model by statistical regression, verifying, inputting a molecular structure to be detected to obtain a predicted rate constant, distinguishing a molecular state/ionic state and providing a corresponding model. The method has the disadvantages that 1, a multiple linear regression algorithm is adopted, when a complex nonlinear relation is processed, interpretation capability and prediction reliability of a model are possibly limited by linear assumption, and 2, specific commercial software is required to acquire descriptor data, so that calculation thresholds are high, time is consumed, and quick high-throughput prediction is not facilitated. Patent CN202311324827.3 discloses a prediction method of atmospheric oxidation reaction rate constant of volatile organic compounds based on a graph neural network, which comprises the steps of converting SMILES codes of VOC and oxide into a molecular graph, inputting the molecular graph into the graph neural network and outputting a predicted value of the reaction rate constant. The method has the defects that 1, the feature extraction lacks physical meaning, the mechanism is opaque due to the 'black box' attribute, and 2, the microcosmic electronic effect characterization is fuzzy, and the identification capability on complex isomers is limited. Disclosure of Invention In order to solve the problems in the background art, the invention provides an atmospheric reaction kinetic parameter intelligent predic