CN-121983143-A - Groundwater nitrogen metabolism function prediction and regulation method based on three-dimensional fluorescence spectrum and automatic machine learning
Abstract
The invention relates to a groundwater nitrogen metabolism function prediction and regulation method based on three-dimensional fluorescence spectrum and automatic machine learning, which comprises the following steps of collecting groundwater microorganism gene sequencing serial numbers in a literature and fluorescence emission spectrum data of corresponding water samples; according to the obtained sequence number of the gene sequencing data, downloading the microbial gene sequencing data, carrying out normalization processing on the required emission spectrum data, selecting functional genes reflecting the underground water nitrogen metabolism capability, calculating the relative abundance, and constructing a machine learning model of the underground water nitrogen metabolism. The optimal model selected by the invention can accurately predict and know the underground water nitrogen metabolism, reveal the influence of DOM fluorescence property on an underground water ecological system, and can be used for predicting the underground water microorganism nitrogen metabolism function of an unknown area.
Inventors
- WANG LONGFEI
- Yang Yezhi
- LI DIE
- ZOU YINA
- WANG ZIYI
Assignees
- 河海大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260409
Claims (10)
- 1. The underground water nitrogen metabolism function prediction and regulation method based on three-dimensional fluorescence spectrum and automatic machine learning is characterized by comprising the following steps: Step (1), collecting three-dimensional fluorescence spectrum data of a groundwater microorganism gene sequencing serial number and a corresponding water sample; downloading groundwater microbial gene sequencing data according to the obtained groundwater microbial gene sequencing sequence number, processing the microbial gene sequencing data by using biological information analysis software, classifying and annotating species to obtain the relative abundance of microbial functional genes, and carrying out logarithmic transformation on all the relative abundance data; Selecting functional genes reflecting the nitrogen metabolism capability of the groundwater environment, wherein the functional genes comprise denitrification genes, nitrite respiration genes, nitrate reduction genes and urea decomposition genes, the relative abundance of the functional genes is used as a prediction variable, and different characteristic groups are used as inputs, wherein the characteristic groups are selected by the characteristic groups 1) analyzing spectral information obtained by 3D-EEM based on the unfolded main component U-PCA, the characteristic groups 2) analyzing spectral information obtained by 3D-EEM based on PARAFAC, and the characteristic groups 3) analyzing water quality data; Constructing 4 types of ML models and 2 types of automatic machine learning models, wherein the ML models comprise a K nearest neighbor model KNN, an extreme gradient enhancement model XGBOOST, a random forest model RF and a support vector machine model GA-SVM optimized by genetic algorithm, and the automatic machine learning models comprise a TPOT automatic machine learning model and an H2O automatic machine learning model; And (5) sequencing the importance of the features based on shapley values, identifying features with direct causal relation with the nitrogen metabolism function of the microorganism based on causal analysis, constructing numerical response relation between causal relation pairs, and identifying critical points of the influence of the fluorescent features on the nitrogen metabolism function from positive to negative.
- 2. The method for predicting and controlling nitrogen metabolism of groundwater based on three-dimensional fluorescence spectrum and automatic machine learning according to claim 1, wherein in the step (1), the gene sequencing sequence number is 16S rRNA sequencing data sequence number.
- 3. The method for predicting and regulating nitrogen metabolism of groundwater based on three-dimensional fluorescence spectrum and automatic machine learning according to claim 1, wherein the step (1) is performed according to the following steps: (1) Searching groundwater microorganisms, high-throughput sequencing, DOM, marking the literature of groundwater microorganism gene sequencing sequence numbers to obtain microorganism sequencing data points, performing Raman and Rayleigh scattering removal on a 3D-EEM spectrum by a python program, and forming a spectrum with the same size; (2) Characteristic variables of the microorganism sequencing data points are collected, and the FAPROTAX database is used for predicting the relative abundance of the microorganism functional genes.
- 4. The method for predicting and regulating nitrogen metabolism of groundwater based on three-dimensional fluorescence spectrum and automatic machine learning according to claim 1, wherein in the step (3), the feature set 1) is that DOM key Ex/Em pair fluorescence intensity identified by U-PCA is used, and fluorescence index FI, freshness index beta: alpha, microbial source index BIX and humification index HIX; The feature set 2) is that DOM components identified by PARAFAC and the maximum fluorescence peak intensity are used for calculating DOM spectral indexes including FI, beta, alpha, BIX and HIX; The characteristic group 3) is water quality data including pH, conductivity, temperature, dissolved oxygen, soluble organic carbon, chloride ion, nitrate, nitrite, sulfate, magnesium ion, calcium ion, potassium ion and sodium ion.
- 5. The method for predicting and regulating nitrogen metabolism of groundwater based on three-dimensional fluorescence spectrum and automatic machine learning according to claim 4, wherein in the step (3), the U-PCA analysis expands 3D-EEM spectrum data into a one-dimensional vector along an excitation-emission plane, extracts the first n principal components with a cumulative variance contribution rate of 95% through singular value decomposition, identifies excitation/emission wavelength positions with high weight in each principal component load, screens fluorescence intensities of key fluorescence regions, and sorts the fluorescence intensities of the key fluorescence regions of all samples to form input data for predicting nitrogen metabolism, wherein a singular value decomposition formula is as follows: ; Wherein X is the standardized data matrix n×m, n is the sample number, m is the characteristic number, U is the left singular vector matrix n×n, S is the singular value diagonal matrix n×m, and V T is the transpose m×m of the right singular vector matrix; Extracting the cumulative variance contribution rate to 95%, taking the smallest k satisfying the following formula, wherein the square sum of the first k singular values of S is more than or equal to 0.95, ; Wherein the method comprises the steps of Is the j-th element on the diagonal line in the singular value matrix S, and m is the number of all singular values; ; In the formula, For the principal component matrix after dimension reduction, the dimension n multiplied by k is the first k principal components with the accumulated variance contribution rate reaching 95%; for right singular vector matrix m x k, right singular matrix is load matrix, for The corresponding principal components of (a) are superimposed and the data structure is reconstructed to identify the most critical fluorescence peaks.
- 6. The method for predicting and regulating nitrogen metabolism of groundwater based on three-dimensional fluorescence spectrum and automatic machine learning according to claim 4, wherein in the step (3), the PARAFAC analysis decomposes 3D-EEM data by using an alternate least squares method, and decomposes EEM data sets of a plurality of samples into single fluorescent components with specific excitation emission spectrum characteristics.
- 7. The method for predicting and regulating the nitrogen metabolism of groundwater based on three-dimensional fluorescence spectrum and automatic machine learning according to claim 1, wherein the model accuracy evaluation index in the step (4) is selected to determine the coefficient R 2 and the average absolute error MSE, the larger R 2 is, the better the fitting effect of the model is, the stronger the explanatory power of the independent variable to the dependent variable is, the smaller the MSE is, the predicted value of the model is close to the true value, and the prediction accuracy is higher; The calculation formula of R 2 is as follows: ; wherein n is the number of samples, Is an observation value of the microorganism function abundance of the underground water, Is the predicted value of the microorganism function abundance of the groundwater; Is the average value of the observed value of the microorganism function abundance of the underground water; the calculation formula of the MSE is as follows: ; Wherein, the Is an observation value of the microorganism function abundance of the underground water, Is the predicted value of the microorganism function abundance of the groundwater.
- 8. The method for predicting and regulating the nitrogen metabolism of groundwater based on three-dimensional fluorescence spectrum and automatic machine learning according to claim 1, wherein in the step (4), a five-fold cross validation method is used to divide the test set sample into 5 subsets of comparable sizes randomly, and then each of 4 ML models and 2 automatic machine learning models is evaluated and trained 5 times for evaluating model accuracy.
- 9. The method for predicting and regulating nitrogen metabolism of groundwater based on three-dimensional fluorescence spectrum and automatic machine learning according to claim 1, wherein the shapley values in the step (5) are calculated as follows, ; Wherein the method comprises the steps of Is the shapley value of feature i in any model f constructed based on dataset x, M is the total number of all input features, Is the set of all feature combinations comprising feature i, Is a combination of characteristics Is used for the number of features in the model, Respectively based on And Different predictive models are trained.
- 10. The method for predicting and regulating the nitrogen metabolism function of groundwater based on three-dimensional fluorescence spectrum and automatic machine learning according to claim 1, wherein in the step (5), a double machine learning method DML is adopted for causal analysis, an optimal model obtained in the step (3) is used as a basic learner to calculate an average causal effect ATE of each fluorescence feature on nitrogen metabolism, and when the ATE statistics is remarkable, the feature is considered to have a direct causal relationship on the result; The process of constructing the numerical response relationship between the causal relationship pairs is realized by an individual condition expected graph ICE, all other characteristics are fixed and unchanged, only one fluorescent characteristic is changed, and the output of an optimal model is observed to obtain a continuous response curve.
Description
Groundwater nitrogen metabolism function prediction and regulation method based on three-dimensional fluorescence spectrum and automatic machine learning Technical Field The invention relates to a groundwater nitrogen metabolism function prediction and regulation method based on three-dimensional fluorescence spectrum and automatic machine learning, and belongs to the technical field of environmental science and technology. Background Groundwater is an indispensable natural resource, providing nearly half of the world's drinking water. Groundwater is vulnerable to nitrogen-based fertilizer contamination, and nitrates and nitrites can be converted into oncogenic nitrosamine compounds. Microbial-mediated nitrogen conversion dominates groundwater nitrogen metabolism. Research shows that DOM property change can influence the gene abundance of microorganisms involved in nitrogen metabolism, so that the groundwater nitrogen conversion process is changed. Therefore, the establishment of the associated prediction method of DOM characteristics and the nitrogen metabolism function of the groundwater microorganisms has important significance for evaluating the ecological safety of the groundwater. Three-dimensional fluorescence spectroscopy (3D-EEM) is a common technique for rapidly assessing DOM properties, but raw data contains thousands of excitation/emission (Ex/Em) pairs, with noise and peak overlap problems. Parallel factor analysis (PARAFAC) can analyze independent fluorescent components, but is limited by strict half-test requirements, initial value sensitivity and high calculation cost, and is difficult to meet the requirement of rapid screening of large-scale samples. Particularly, because of the spatial dispersion of the groundwater sampling wells, limited sample size of single study, strong heterogeneity of multi-source data, the inherent mechanism of the PARAFAC requiring all samples to share the same fixed component number and stable spectrum profile, when EEM data of different aquifers, different seasons or different study areas are combined, component spectrum cross confusion and region specific signals are diluted evenly, and it is difficult to extract robust feature parameters which can be compared across the data sets. The Principal Component Analysis (PCA) can realize rapid dimension reduction through singular value decomposition, does not need preset group number and complex verification, and has better adaptability to multi-source heterogeneous data. However, in the traditional PCA method, EEM data is leveled into a one-dimensional vector, so that the Ex/Em two-dimensional coupling relation is lost, and the PCA load is an abstract weight vector and cannot be directly mapped to a fluorescence characteristic region, so that DOM characteristic parameters with clear ecological significance are difficult to extract for subsequent modeling analysis. In recent years, groundwater microbiological studies have accumulated a large amount of sequencing data, and machine learning provides a highly efficient tool for analyzing the relationship between multivariate and functional tags. But directly uses the original EEM high-dimensional data as input, has dimension disaster, noise interference and overfitting risks, and adopts PCA abstract score or PARAFAC component as input, and loses the physical identification of the original Ex/Em, so that the specific fluorescent substances on which the model prediction depends cannot be traced. Therefore, a technical scheme combining rapid EEM data dimension reduction and machine learning is needed, not only can the interpretable DOM characteristic parameters which can be compared across data sets be extracted through an improved PCA method, but also the accurate prediction of the nitrogen metabolism function of the groundwater microorganism can be realized by utilizing the parameters, and an efficient and reliable technical support is provided for large-scale groundwater ecological monitoring. Disclosure of Invention In order to solve the problems, the invention discloses a groundwater nitrogen metabolism function prediction and regulation method based on three-dimensional fluorescence spectrum and automatic machine learning, which comprises the following specific technical scheme: the underground water nitrogen metabolism function prediction and regulation method based on three-dimensional fluorescence spectrum and automatic machine learning comprises the following steps: Step (1), collecting three-dimensional fluorescence spectrum data of a groundwater microorganism gene sequencing serial number and a corresponding water sample; downloading groundwater microbial gene sequencing data according to the obtained groundwater microbial gene sequencing sequence number, processing the microbial gene sequencing data by using biological information analysis software, classifying and annotating species to obtain the relative abundance of microbial functional genes, and carrying o