Search

CN-121981335-A - Soil organic matter ecological toxicity prediction and risk assessment method based on QSAR-ML

CN121981335ACN 121981335 ACN121981335 ACN 121981335ACN-121981335-A

Abstract

The application discloses a soil organic matter ecological toxicity prediction and risk assessment method based on QSAR-ML, which comprises the steps of obtaining environmental parameters, biological species characteristics and molecular descriptors of target organic matters containing a plurality of species, carrying out fusion and standardization treatment to generate target multi-scale characteristic vectors corresponding to the target organic matters, respectively inputting the target multi-scale characteristic vectors into a trained ecological toxicity prediction model to obtain invalid concentration values corresponding to the target organic matters of the plurality of species, fitting a species sensitivity distribution curve based on the invalid concentration values, calculating a harmful concentration value of a preset percentage of species, deducing an ecological safety threshold of the target organic matters in a specific environment according to the harmful concentration values, and assessing the ecological risk of the target organic matters in the specific environment according to the ecological safety threshold and the actually measured exposure concentration of the target organic matters. The method can improve the accuracy of toxicity prediction and risk assessment of the organic pollutants.

Inventors

  • WANG XUEDONG
  • LIU YUAN

Assignees

  • 首都师范大学

Dates

Publication Date
20260505
Application Date
20260224

Claims (5)

  1. 1. A method for predicting the ecotoxicity and evaluating the risk of soil organisms based on QSAR-ML, which is characterized by comprising the following steps: Acquiring environmental parameters, biological species characteristics and molecular descriptors of target organic matters containing a plurality of species, wherein the molecular descriptors are calculated based on the molecular structure of the target organic matters through a density functional theory; respectively carrying out fusion and standardization treatment on the environmental parameters, biological species characteristics and molecular descriptors of all target organic matters to generate corresponding target multi-scale characteristic vectors; Respectively inputting each target multiscale feature vector into a trained ecological toxicity prediction model to obtain invalid concentration values corresponding to target organic matters of a plurality of species, wherein the trained ecological toxicity prediction model is a machine learning algorithm model which is obtained by training a plurality of machine learning ML algorithm models by using organic matter sample data containing the plurality of species and is selected to have optimal performance after internal cross verification and external independent verification; Fitting a species sensitivity distribution curve based on a plurality of non-effect concentration values, calculating a harmful concentration value protecting a preset percentage of species, and deducing an ecological safety threshold of a target organic matter under a specific environment corresponding to the environmental parameter of the target organic matter according to the harmful concentration value; and evaluating the ecological risk of the target organic matters in the specific environment according to the ecological safety threshold and the measured exposure concentration of the target organic matters.
  2. 2. The method of claim 1, wherein the fusing and normalizing the environmental parameters, the biological species characteristics, and the molecular descriptors of each target organic matter to generate corresponding multi-scale feature vectors comprises: aligning and splicing molecular descriptors, environmental parameters and biological species characteristics corresponding to all target organic matters to form initial multidimensional characteristic vectors corresponding to all target organic matters; and carrying out logarithmic transformation on the numerical type feature in each initial multidimensional feature vector to eliminate dimension, and carrying out numerical coding on the non-numerical type feature to obtain the target feature vector corresponding to each target organic matter.
  3. 3. The method of claim 1, wherein the training process of the ecotoxicity prediction model comprises: acquiring a toxicity effect value, an environmental parameter, a biological species characteristic and a molecular descriptor of an organic matter sample comprising a plurality of species; fusing and standardizing environmental parameters, biological species characteristics and molecular descriptors of the organic matter sample to generate a multi-scale characteristic matrix, and generating a toxic effect value label vector corresponding to the multi-scale characteristic matrix based on the toxic effect value of the organic matter sample; And training a plurality of machine learning algorithm models by utilizing the multi-scale feature matrix and the toxicity effect value label vector, and selecting the machine learning algorithm model with optimal performance from the plurality of machine learning algorithm models through internal cross verification and external independent verification to obtain the trained ecological toxicity prediction model.
  4. 4. A method according to claim 3, further comprising: Calculating a lever threshold according to the number of the organic matter samples and the number of the molecular descriptors corresponding to each organic matter sample; and determining the confidence level of the ecological toxicity prediction model based on the proportion of the organic matter samples with the lever value smaller than or equal to the lever threshold and the standardized residual being in the standardized residual threshold range.
  5. 5. A method according to claim 3, wherein the ecotoxicity prediction model comprises a random forest model, the method further comprising: Calculating importance values of the characteristics of each organic matter sample on the prediction result based on the random forest model, and sequencing according to the importance values to obtain a global characteristic importance sequencing list, wherein the characteristics of the organic matter sample comprise any one of environmental parameters, biological species characteristics and molecular descriptors of the organic matter sample; According to the extra impure reduction amount brought by the cooperative splitting of the candidate feature pairs in the decision path, quantifying the interaction strength between each pair of candidate features, generating a feature interaction strength sorting list, and determining the key feature pair with the highest interaction strength from the feature interaction strength sorting list; Generating a bivariate bias dependence graph based on the key feature pairs to visualize nonlinear effects of the combined effects of the key feature pairs on toxicity; Based on the analysis results of the global feature importance ranking list, the feature interaction strength ranking list and the partial dependency graph, and combining with the physical chemistry principle, an interaction mechanism between key feature pairs is generated, and the rationality of the prediction result is verified.

Description

Soil organic matter ecological toxicity prediction and risk assessment method based on QSAR-ML Technical Field The application relates to the technical field of environmental pollutant risk assessment and machine learning, in particular to a soil organic matter ecological toxicity prediction and risk assessment method based on QSAR-ML. Background With the rapid development of the global chemical industry, the annual output of artificially synthesized organic matters breaks through 5 hundred million tons, wherein about 12% of the artificially synthesized organic matters enter the soil environment through the three industrial wastes, the agricultural non-point source emission, the urban process and other approaches, and the organic pollution is highlighted to become a core bottleneck for restricting the ecological safety and the agricultural sustainable development of the soil, so that the organic matter pollution is necessary to be controlled, and the ecological risk assessment is a scientific basis for pollution control. Traditional ecotoxicity tests rely on in-house bioassays and the full life cycle toxicity data acquisition of individual compounds is time and cost consuming. Even more serious, organic contaminants are currently known in excess of hundreds of thousands, but less than 1% possess the exposure data required for complete retrospective risk assessment, resulting in the inability of a large number of emerging contaminants to be incorporated into a risk assessment system due to data loss. For example, perfluoroalkyl substances (PFAS) are a widely occurring class of persistent contaminants, and only <1% of PFAS have been tested for acute or chronic toxicity, severely limiting the scientificity of risk management decisions. The quantitative structure-activity relationship (Quantitative Structure Activity Relationship, QSAR) provides an economic and efficient alternative scheme for solving the problem of data starvation, the traditional QSAR model is based on multiple linear regression (Multiple Linear Regression, MLR), the prediction accuracy in cross-species and multi-exposure scenes is remarkably reduced, the machine learning (MACHINE LEARNING, ML) algorithm brings reform to the QSAR model, the bottleneck research of the traditional model is broken through, but the existing ML algorithm only considers the association relationship between a single influencing factor (such as a molecular structure descriptor of organic pollutants) and a toxicity effect value, and cross-species toxicity extrapolation still depends on empirical assumption, so that the accuracy of toxicity prediction and risk assessment of organic pollutants is restricted. Disclosure of Invention In view of this, an embodiment of the application provides a soil organic matter ecotoxicity prediction and risk assessment method based on QSAR-ML, which comprises the following steps: The method comprises the steps of obtaining environment parameters, biological species characteristics and molecular descriptors of target organic matters containing a plurality of species, respectively carrying out fusion and standardization processing on the environment parameters, biological species characteristics and molecular descriptors of each target organic matter to generate corresponding target multi-scale characteristic vectors, respectively inputting each target multi-scale characteristic vector into a trained ecological toxicity prediction model to obtain invalid concentration values corresponding to each target organic matter of the plurality of species, training a plurality of machine learning algorithm models by utilizing organic matter sample data containing the plurality of species, carrying out internal cross validation and external independent validation, selecting a machine learning algorithm model with optimal performance, fitting a species sensitivity distribution curve based on the plurality of invalid concentration values, calculating a harmful concentration value of a preset percentage species, deducing an ecological safety threshold of the target organic matters under specific environments corresponding to the environment parameters of the target organic matters according to the harmful concentration values, and evaluating the ecological safety threshold and the ecological risks of the target organic matters under specific environments according to the actual measured ecological safety threshold and the exposed concentration of the target organic matters. Optionally, the fusion and standardization processing are performed on the environmental parameters, the biological species characteristics and the molecular descriptors of the target organic matters respectively, so as to generate corresponding multi-scale characteristic vectors, which comprise: The method comprises the steps of aligning and splicing molecular descriptors, environment parameters and biological species characteristics corresponding to target organic matters to form