CN-115795380-B - Flue gas acid making data cleaning and optimizing method based on isolated forest and weighted random forest

CN115795380BCN 115795380 BCN115795380 BCN 115795380BCN-115795380-B

Abstract

The invention discloses a flue gas acid making data cleaning and optimizing method based on an isolated forest and a weighted random forest, which is used for analyzing a flue gas acid making desulfurization process, combining a large amount of production monitoring data, adopting a maximum information coefficient analysis method to perform correlation analysis on process variables such as fan outlet O 2 concentration, fan outlet flue gas temperature, primary power wave inlet pressure, furnace pressure, fan inlet flow, converter inlet temperature and the like, and obtaining key variables influencing indexes such as SO 2 conversion rate, sulfuric acid yield and the like. And then, analyzing the change trend of the original data of the key variable aiming at the key variable, and identifying and eliminating abnormal values and outliers in the data set by utilizing an isolated forest algorithm to obtain a missing data set. And finally, carrying out fitting prediction on the missing data set by adopting a weighted random forest algorithm, compensating the missing data in the missing data set, and realizing the cleaning and optimization of the data in the flue gas acid making process, thereby achieving the purpose of improving the desulfurization efficiency and the sulfuric acid yield.

Inventors

LI XIAOLI
LIU MINGHUA
ZHAO JINYUAN
LI GUIHAI
LIU ZHENGMING
WANG KANG

Assignees

北京工业大学
北京瑞太智联技术有限公司

Dates

Publication Date: 20260508
Application Date: 20221125

Claims (3)

1. The flue gas acid making data cleaning and optimizing method based on the isolated forest and the weighted random forest is characterized by comprising the following steps: Step 1, a monitoring system for a flue gas acid making process monitors and collects the production process in real time in the process of preparing sulfuric acid from flue gas generated by smelting in a copper factory; Step 2, based on the real-time monitoring and data acquisition in the step 1, adopting a maximum information coefficient analysis method to analyze the correlation between the process variable and the productivity index, and obtaining key variables affecting the SO 2 conversion rate and the sulfuric acid yield, wherein the key variables comprise the flue gas flow, the inlet temperature of each layer of the converter and the outlet pressure of a fan; step 3, designing an abnormal data identification model based on the isolated forest according to the variation trend of the key variable data obtained in the step 2, and identifying and eliminating outliers and abnormal values in the data set; step 4, establishing a fitting prediction model based on a weighted random forest, performing fitting prediction on the missing data set, and compensating the missing data in the missing data set to obtain a valuable data set; Establishing a missing data compensation model based on a weighted random forest, carrying out fitting prediction on a missing data set, compensating missing data in the missing data set, and optimizing the data in the flue gas acid making process; The traditional random forest obtains a prediction result by averaging the output results of all regression trees, namely the basic learner, so that the prediction precision of the random forest is affected; the weighted random forest regression adopts the prediction average absolute error percentage MAPE of the out-of-bag data as an evaluation index to evaluate the prediction capability of the base learner and give a weight to the prediction capability; , ; Wherein MAPE is the average absolute error percentage of prediction of the random forest regression model bag outside data of the flue gas acid making data, t is the number of the bag outside data, y i is the true value of the flue gas acid making data, f (x i ) is the regression prediction value of the random forest, MAPE i is the average absolute error percentage of prediction of the ith regression tree, w i is the weight of the ith regression tree, n is the number of regression trees in the random forest algorithm, when the value of MAPE i is larger, the prediction precision of the learner is lower, the value of w i is correspondingly reduced, which shows that the influence of the learner on the prediction result is smaller, and the specific steps of the weighted random forest algorithm are as follows: Step 1, extracting a subsampled matrix from a flue gas acid making data training matrix T in a put-back way to serve as a training sample of a regression tree, wherein the size of the subsampled matrix is the same as that of the training matrix; Step 2, the feature dimension of each flue gas acid making data sample is M, a constant M is designated, M < < M >, M feature subsets are randomly selected from M features, and the optimal feature subset is selected from the M features when the regression tree is split each time; step 3, growing each tree to the greatest extent, and stopping growing until the height of the tree is reached without pruning process; step 4, when training the ith regression tree, inputting the data outside the bag into the regression tree as a test set, and respectively calculating the predicted average absolute error percentage MAPE i and the weight w i of the regression tree; Step 5, repeating the steps to complete construction and training of n regression trees; Finally, integrating the n weighted regression trees to obtain a weighted random forest, and obtaining the final model output as follows: ; Wherein w i is the weight of the ith regression tree, and T i is the prediction result of the ith regression tree.
2. The method for cleaning and optimizing flue gas acid making data based on the isolated forest and the weighted random forest according to claim 1, wherein correlation between variables is analyzed by using a maximum information coefficient analysis method, and the maximum information coefficient between two variables is calculated by calculating mutual information between the two variables, wherein the calculation formula is as follows; ; ; ; Wherein, the variable X is SO 2 conversion rate, the variable Y is each variable in the flue gas acid making process, I [ X; Y ] is mutual information between the variable X and the variable Y, p (X, Y) is joint probability between the variable X and the variable Y, p (X) is probability distribution of the variable X, p (Y) is probability distribution of the variable Y, MIC [ X; Y ] is maximum information coefficient between the variable X and the variable Y, n is data quantity, B (n) is a variable, the size of which is related to the data quantity, thus obtaining key variable affecting SO 2 conversion rate, and key variable affecting sulfuric acid yield can be obtained in the same way; the method for analyzing the correlation between the variables by adopting the maximum information coefficient analysis method comprises the following steps: Step 1, given values of i and j, performing i-column j-row meshing on a scatter diagram formed by a variable X and a variable Y, and solving the maximum mutual information value; Step 2, normalizing the maximum mutual information value; step 3, selecting the maximum value of mutual information under different scales as an MIC value; according to the method, the correlation between the flue gas acid making process variable and the SO 2 conversion rate and the sulfuric acid yield are analyzed, and the variable with larger correlation is extracted as the object of data cleaning.
3. The method for cleaning and optimizing the flue gas acid making data based on the isolated forest and the weighted random forest according to claim 1, wherein an abnormal data identification model based on the isolated forest is established, and outliers and abnormal values in the extracted key variable dataset affecting the SO 2 conversion rate and the sulfuric acid yield are identified and removed; The isolated forest algorithm carries out multiple binary segmentation on the sample points until each sample point or a few sample points are segmented to the same area, wherein normal data is often needed to be segmented for a plurality of times and is positioned in a high-density area; the flue gas acid making data set is processed by the abnormal data identification model to form different high-low density areas, the area where the data is located is represented by calculating the abnormal value score of the data, and the data with high score is removed, and the calculation method is as follows: ; ; Wherein, C (u) is the average path length of all data in the flue gas acid making data set, S (h ij , u) is the abnormal value score of the flue gas acid making variable data, u is the sample number of the flue gas acid making data, h ij is the path length of the flue gas acid making data x ij , For Euler constant, E (h ij ) is the average path length of data x ij in n orphan trees; According to the calculation method, when the value of S (h ij , u) is close to 0.5, whether the data is an abnormal value in the flue gas acid making data set can not be clearly distinguished, when the value of S (h ij , u) is close to 0, the data is judged to be normal data, when the value of S (h ij , u) is close to 1, the data is judged to be an abnormal value, the data is removed from the flue gas acid making data set according to the abnormal value score of each data, and the steps when the abnormal value identification and removal experiment are carried out are as follows: step 1, randomly selecting samples with the capacity of n from the key variable data set extracted in the claim 2 as a training set for training an isolated tree; Step 2, randomly selecting a variable Q as a root node in a training set, and randomly selecting a cutting point T in the value range of Q; step 3, placing a sample with a variable value greater than or equal to T at a left node and placing a sample with a variable value less than T at a right node; and 4, repeating the step 2 and the step 3 for the data of the left node and the right node until the end condition is met, and completing the establishment of the isolated forest model, wherein the end condition is one of the following three conditions: 1) The height of the tree is reached to the maximum extent; 2) The values of the corresponding features of the samples on the nodes are all equal; 3) The node has only one sample.

Description

Flue gas acid making data cleaning and optimizing method based on isolated forest and weighted random forest Technical Field The invention belongs to the field of data processing, and particularly relates to a flue gas acid making data cleaning and optimizing method based on an isolated forest and a weighted random forest. Background The nonferrous metals such as copper, lead, aluminum, magnesium and the like are important strategic materials for the development of national economy and national defense industry in China, and are also raw materials for manufacturing various equipment such as airplanes, rockets, missiles, computers and the like. Along with the continuous acceleration of the industrial progress of China and the rapid development of national economy, the demand of each industry for nonferrous metal resources is increased. Therefore, the production of nonferrous metals has taken an important place in the industrial production of China. However, in nature, most nonferrous metal minerals exist in the form of sulfides, and a large amount of flue gas containing SO 2 is generated during smelting. The flue gas containing SO 2 is directly discharged into the atmosphere, which can cause a series of environmental problems such as air pollution, soil acidification and the like, and meanwhile, SO 2 as a 3-class cancerogenic substance also causes great threat to human health. Therefore, at present, environmental protection is increasingly becoming more conscious, how to effectively control SO 2 in flue gas is becoming a subject to be solved. Because of high concentration of SO 2 in smelting flue gas and wide variation range, a set of mature flue gas desulfurization process, namely flue gas acid production, exists at present. The smelting flue gas acid making industry is to produce high-concentration sulfuric acid by recycling SO 2 in the flue gas. The flue gas acid production is a complex multivariable and strongly coupled nonlinear process, the data of the operation process is an important basis for realizing the links of state monitoring, operation optimization control, fault diagnosis and the like in the flue gas acid production process, and is an information base for improving the sulfuric acid production efficiency and the production level. Because the operation environment of the flue gas acid making process is complex, the equipment is numerous, the coupling of each link is strong, the data obtained by the detection equipment can be seriously polluted, and abnormal conditions such as data missing, outlier and the like are easy to occur, so that great difficulty is brought to the data analysis and processing of the flue gas acid making process. Therefore, outliers in the data are accurately removed, and the missing data are compensated, so that the method has important significance for subsequent modeling and control of the flue gas acid making process. At present, various abnormal data identification methods are widely proposed aiming at the problem that the characteristics of abnormal values in a data set are difficult to identify, wherein the methods comprise probability distribution, density and distance between the data, and the specific methods comprise Laida, quartile, DBSCAN clustering and the like. However, the analysis method based on probability distribution is only suitable for data with known distribution characteristics, and the abnormal value detection method based on a clustering algorithm can only find out global outliers of the data, so that abnormal features of local data are difficult to identify. For the compensation problem of the missing data set, widely adopted methods include interpolation, support vector machine regression, BP neural network fitting method and the like. However, interpolation is too dependent on the quality of the history data and the adjacent data to accurately compensate for abnormal data in any set. The neural network is adopted to compensate the data, so that the validity of the data of the training network is ensured, and other algorithms are also needed to assist in judgment. In the actual flue gas acid making process, the abnormal data not only contains the abnormal characteristics of a single variable, but also contains various characteristics of synchronous or asynchronous data of a plurality of variables, and the existing abnormal data compensation method cannot effectively compensate. Therefore, in the flue gas acid making data cleaning and optimizing method based on the isolated forest and the weighted random forest, the isolated forest can rapidly and accurately identify and reject abnormal data, and the weighted random forest can carry out fitting prediction on the change trend of the data by adopting a regression tree integrated learning method according to the relation among variables, so that the rejected abnormal data can be effectively compensated. By identifying, removing and compensating the abnormal data, a valuable data set is obtained,