CN-122024891-A - Machine learning-based large-scale total nitrogen concentration prediction method and system

CN122024891ACN 122024891 ACN122024891 ACN 122024891ACN-122024891-A

Abstract

The invention discloses a machine learning-based large-scale total nitrogen concentration prediction method and a machine learning-based large-scale total nitrogen concentration prediction system. The method comprises the steps of obtaining database construction information, sequentially carrying out site matching, watershed aggregation, time window creation and feature selection on the database construction information to obtain an effective feature data table, dividing the effective feature data table to obtain a training set and a testing set, completing training and optimizing an initial prediction model through the training set and the testing set to obtain an optimal prediction model, inputting real-time detection data of a target watershed or a monitoring site into the optimal prediction model, and outputting a corresponding total nitrogen concentration prediction result. The invention provides a machine learning-based large-scale total nitrogen concentration prediction method and a machine learning-based large-scale total nitrogen concentration prediction system, which are used for solving the technical problems that the dependence of a total nitrogen prediction method on local refined data is too strong, the mobility is poor and the trans-regional prediction capability is weak in the prior art.

Inventors

CAI XITIAN
HUANG HAIJUN
WU XIAOLU

Assignees

中山大学

Dates

Publication Date: 20260512
Application Date: 20251231

Claims (10)

1. A machine learning-based large scale total nitrogen concentration prediction method, comprising: Obtaining database construction information, wherein the database construction information comprises a global public data set, and the global public data set comprises hydrological data, land utilization data, population data, nitrogen fertilizer application data, river basin attribute data and total nitrogen concentration monitoring data; Sequentially performing site matching, drainage basin aggregation, time window creation and feature selection on the database construction information to obtain an effective feature data table, wherein the effective feature data table consists of a plurality of effective features; dividing the effective characteristic data table to obtain a training set and a testing set, and finishing training and optimizing an initial prediction model through the training set and the testing set to obtain an optimal prediction model, wherein the initial prediction model is obtained by constructing in advance by adopting an extreme gradient lifting tree algorithm; Inputting real-time detection data of the target river basin or the monitoring station into the optimal prediction model, and outputting a corresponding total nitrogen concentration prediction result.
2. The machine learning-based large-scale total nitrogen concentration prediction method of claim 1, wherein the sequentially performing site matching, drainage basin aggregation, time window creation and feature selection on the database construction information to obtain an effective feature data table comprises: performing site matching on the database construction information to obtain site coordinates; Performing drainage basin aggregation on the database construction information to obtain a drainage basin characteristic value; Creating a time window for the continuous drainage basin characteristic value to obtain a dynamic sequence of the continuous drainage basin characteristic value; and performing feature selection on the site coordinates, the drainage basin feature values and the dynamic sequence of the drainage basin feature values with continuity to obtain an effective feature data table.
3. The machine learning-based large-scale total nitrogen concentration prediction method of claim 2, wherein the performing the basin aggregation on the database construction information to obtain the basin characteristic value comprises: judging the data type of the database construction information, wherein the data type comprises a continuous variable and a cumulative variable, the continuous variable data comprises precipitation and qi Wen Bianliang in the hydrometeorologic data, and the cumulative variable data comprises a flow variable in the basin attribute data and a total nitrogen emission in the total nitrogen concentration monitoring data; if the database construction information is a continuous variable, calculating a drainage basin weighted average value, and taking the drainage basin weighted average value as a drainage basin characteristic value; If the database construction information is an accumulation variable, selecting and calculating a drainage basin accumulation value or a drainage basin area weighted accumulation value based on the judgment of whether the database construction information is a density related quantity, and taking the drainage basin accumulation value or the drainage basin area weighted accumulation value as a drainage basin characteristic value.
4. The machine learning based large scale total nitrogen concentration prediction method of claim 2, wherein creating a time window for the continuous basin feature value to obtain the dynamic sequence of continuous basin feature values comprises: presetting a plurality of time windows with different scales for the continuous river basin characteristic values; Calculating the average value of the continuous drainage basin characteristic values of the current time forward shifting k time periods under each time window based on the continuous drainage basin characteristic values and the time windows of different scales, wherein k is the size of the time window; And obtaining the dynamic sequence of the continuous river basin characteristic value based on the average value corresponding to each time window.
5. The machine learning based large scale total nitrogen concentration prediction method of claim 2, wherein said feature selection of said site coordinates, said basin feature values and said dynamic sequence of continuous basin feature values to obtain an effective feature data table comprises: obtaining a preliminary feature set based on the site coordinates, the drainage basin feature values and the dynamic sequence of the drainage basin feature values with continuity; Performing variable importance evaluation on the preliminary feature set by a backward feature selection method to obtain the actual contribution of each feature in the preliminary feature set to the total nitrogen concentration prediction; screening a preset number of effective features based on the actual contribution of each feature to the total nitrogen concentration prediction; and integrating the screened effective features to obtain an effective feature data table.
6. The machine learning-based large-scale total nitrogen concentration prediction method according to claim 1, wherein the dividing the effective feature data table to obtain a training set and a test set, and completing training and optimizing an initial prediction model through the training set and the test set to obtain an optimal prediction model comprises: Randomly dividing the effective characteristic data table into a training set and a testing set according to a preset proportion, wherein the training set is used for training the initial prediction model, and the testing set is used for evaluating the performance of the initial prediction model; Responding to an operation instruction of inputting the training set into a pre-constructed initial prediction model; automatically optimizing the super parameters of the initial prediction model through an automatic machine learning optimization library to generate different super parameter combinations; Based on the different super-parameter combinations, carrying out multiple training and performance evaluation on the initial prediction model by combining a five-fold cross validation method, and iteratively screening out an optimal super-parameter combination by taking a maximized model decisive coefficient as a target; And retraining an initial prediction model through the optimal super-parameter combination, inputting the test set into the initial prediction model, and evaluating the prediction performance of the initial prediction model to obtain an optimal prediction model.
7. The machine learning-based large-scale total nitrogen concentration prediction method as claimed in claim 1, wherein said inputting real-time detection data of a target river basin or a monitoring station into said optimal prediction model outputs a corresponding total nitrogen concentration prediction result, comprising: Acquiring real-time detection data of a target river basin or a monitoring station; Inputting the real-time detection data into the optimal prediction model, and outputting a total nitrogen concentration sequence through data preprocessing; and taking the total nitrogen concentration sequence as a total nitrogen concentration prediction result to complete total nitrogen concentration prediction.
8. A machine learning based large scale total nitrogen concentration prediction system, the system comprising: the information acquisition module is used for acquiring database construction information, wherein the database construction information comprises a global public data set and historical detection data of a target river basin or a monitoring station, and the global public data set comprises hydrological data, land utilization data, population data, nitrogen fertilizer application data, river basin attribute data and total nitrogen concentration monitoring data; the data preprocessing module is used for sequentially carrying out site matching, drainage basin aggregation, creation of a time window and feature selection on the database construction information to obtain an effective feature data table, wherein the effective feature data table consists of a plurality of effective features; The model training module is used for dividing the effective characteristic data table to obtain a training set and a testing set, and training and optimizing an initial prediction model through the training set and the testing set to obtain an optimal prediction model, wherein the initial prediction model is obtained by constructing an extreme gradient lifting tree algorithm in advance; And the concentration prediction module is used for inputting real-time detection data of the target river basin or the monitoring station into the optimal prediction model and outputting a corresponding total nitrogen concentration prediction result.
9. A computer readable storage medium, characterized in that the storage medium has stored therein a computer program or instructions which, when executed by a communication device, implements the machine learning based large scale total nitrogen concentration prediction method according to any of claims 1-7.
10. A computer program product comprising a computer program or instructions which, when executed by a communication device, implements the machine learning based large scale total nitrogen concentration prediction method of any one of claims 1-7.

Description

Machine learning-based large-scale total nitrogen concentration prediction method and system Technical Field The invention relates to the field of total nitrogen concentration prediction, in particular to a large-scale total nitrogen concentration prediction method and system based on machine learning. Background Along with the increasing demands of global water environment pollution control, the total nitrogen concentration prediction plays a vital role in water quality monitoring, pollution early warning and river basin management. The method for predicting the total nitrogen concentration is accurate, efficient and widely popularized, and is a key challenge in the field of current environmental data science. At present, the prediction method of total nitrogen concentration can be mainly divided into a model based on a physical mechanism and a data-driven machine learning model. The model based on the physical mechanism simulates the migration and conversion process of nitrogen by establishing a mathematical equation of a bio-geochemical process and combining a large amount of hydrologic, meteorological and underlying data. Although the mechanism of the model is clear, the model has the advantages of numerous parameters, complex calibration, strong dependence on high-quality input data and limited calculation efficiency in large-scale application. The data-driven machine learning model (such as a random forest, a support vector machine, a deep neural network and the like) predicts by mining the nonlinear relation in the data, and has the advantages of high modeling efficiency, strong multi-feature processing capability and the like. However, the machine learning methods in existing research focus on specific point sources or single watershed, and model construction relies heavily on localized, high frequency, fine monitoring data (e.g., pH, dissolved oxygen, conductivity, etc.). Such data are usually acquired through field monitoring or experimental measurement, which results in high data acquisition cost and complicated flow when the model is popularized and applied. Meanwhile, as training data are mostly derived from local long-term observation, the law learned by the model has stronger region specificity, and the difference mechanism of nitrogen migration under different climates, hydrology and land utilization conditions is difficult to capture, so that the cross-regional and large-scale generalization capability and universality of the training data are seriously insufficient, and the wide application value in actual river basin management is limited. Therefore, the core defect of the total nitrogen prediction method in the prior art is that the dependence of the model on local refined data is too strong, and the mobility is poor and the trans-regional prediction capability is weak due to the strong dependence, so that the popularization and application with low cost and wide coverage are difficult to realize. Disclosure of Invention The invention provides a machine learning-based large-scale total nitrogen concentration prediction method and a machine learning-based large-scale total nitrogen concentration prediction system, which are used for solving the technical problems that the dependence of a total nitrogen prediction method on local refined data is too strong, the mobility is poor and the trans-regional prediction capability is weak in the prior art. In order to solve the above technical problems, an embodiment of the present invention provides a machine learning-based large-scale total nitrogen concentration prediction method, including: Obtaining database construction information, wherein the database construction information comprises a global public data set, and the global public data set comprises hydrological data, land utilization data, population data, nitrogen fertilizer application data, river basin attribute data and total nitrogen concentration monitoring data; Sequentially performing site matching, drainage basin aggregation, time window creation and feature selection on the database construction information to obtain an effective feature data table, wherein the effective feature data table consists of a plurality of effective features; dividing the effective characteristic data table to obtain a training set and a testing set, and finishing training and optimizing an initial prediction model through the training set and the testing set to obtain an optimal prediction model, wherein the initial prediction model is obtained by constructing in advance by adopting an extreme gradient lifting tree algorithm; Inputting real-time detection data of the target river basin or the monitoring station into the optimal prediction model, and outputting a corresponding total nitrogen concentration prediction result. According to the embodiment of the invention, the model input is constructed by integrating the multisource data such as the global open hydrological weather, the land utilization