CN-121997007-A - Feature subset-based prediction model visual comparison and iterative optimization method and system

CN121997007ACN 121997007 ACN121997007 ACN 121997007ACN-121997007-A

Abstract

The invention provides a feature subset-based prediction model visual comparison and iterative optimization method and system, and belongs to the technical field of machine learning and data visualization. The method comprises the steps of obtaining multivariable time sequence data and constructing a feature subset, constructing a multivariable time sequence prediction model by using the selected feature subset through an integrated learning method, repeating and constructing different models, performing model correlation calculation from the types of the feature subset, the feature importance, the algorithm weight and the model evaluation index in a multidimensional mode, recording the exploration flow of a user through a hierarchical node link diagram, performing model comparison visual analysis and performing layout optimization, and obtaining an optimal feature subset after iteratively optimizing the feature subset. By applying the visual analysis method to the selection of the optimal feature subset, the user is assisted in iteratively selecting the feature subset in combination with domain knowledge to build a model, thereby better understanding, diagnosing and comparing different feature subsets and models, enhancing the accuracy and reliability of the prediction results.

Inventors

SHAN YUXIANG
GAO YANGHUA
JIN YONG

Assignees

浙江中烟工业有限责任公司

Dates

Publication Date: 20260508
Application Date: 20260114

Claims (11)

1. A method of feature subset-based visual comparison and iterative optimization of a predictive model, the method performed on a computer device comprising a processor, a memory, and a display coupled to the processor, comprising: Acquiring tobacco sales multivariable time sequence data of a target terminal, wherein the multivariable time sequence data at least comprises historical sales flow characteristic data, market state characteristic data and external environment characteristic data, and performing preprocessing and characteristic reconstruction to construct an initial characteristic subset; Based on the initially selected feature subset, constructing a time sequence prediction model fused with XGBoost, lightGBM and Pathformer algorithms, optimizing model parameters by adopting a batch gradient descent iterative algorithm, and using a root mean square error value (RMSE) as a model evaluation index; Receiving an adjustment instruction of a user aiming at a feature subset and algorithm weight in a graphical user interface, reconstructing or reselecting the feature based on the adjustment instruction to generate a new feature subset, and training to generate a new prediction model based on the new feature subset and the adjusted algorithm weight parameter; Aiming at any two models in the exploration history, calculating the similarity between different models from four dimension weighted fusion of feature subset types, feature importance, algorithm weights and model evaluation indexes to obtain similarity values between the different models; Constructing a node link diagram based on the similarity value and the model generation sequence, and representing a model evolution path, wherein each node corresponds to a prediction model, the edge represents the model generation sequence, and the iteration selection of a subsequent feature subset is guided according to the node link diagram; Carrying out automatic layout optimization on the node link graph, establishing an integer linear programming ILP model, taking the number of intersections of edges in the minimized node link graph as an objective function, solving by combining a node level, a relative position variable, edge intersection variables and transitive constraints to obtain layout coordinates of node positions and connecting lines; And iteratively optimizing the feature subset and outputting an optimal feature subset based on the exploration history record and the interactive operation of the node link graph, wherein the feature reconstruction comprises feature conversion and feature combination, and the output weights of the XGBoost, lightGBM and Pathformer algorithms are set by user definition.
2. The feature subset-based prediction model visual comparison and iterative optimization method of claim 1, wherein said constructing a temporal prediction model fused XGBoost, lightGBM and Pathformer algorithms based on the initially selected feature subset comprises: for the processed feature set Modeling by XGBoost algorithm, and taking the output of the model as the first Adding the features to the feature set to obtain The feature set is used as a predicted result obtained by LightGBM input ; Employing XGBoost pairs of feature sets Modeling to obtain prediction result ; Employing Pathformer pairs of feature sets Modeling to obtain prediction result ; For a pair of 、、 And carrying out weighted fusion on the three results to be used as the final output of the model.
3. The feature subset-based predictive model visual comparison and iterative optimization method of claim 2, wherein 5-fold cross-validation is employed and averaged to avoid overfitting during single model training at XGBoost and LightGBM.
4. The visual comparison and iterative optimization method of feature subset-based predictive models of claim 1, wherein the computing of similarity between different models from a weighted fusion of four dimensions of feature subset class, feature importance, algorithm weight, and model evaluation index for any two models in the exploration history comprises data reconstruction: Storing a feature subset, model weights and final RMSE indexes used by the model by using a json file; The feature subset used by the final model is Calculating the feature importance of the feature to the model for each feature to obtain a feature importance dictionary , wherein, Representing characteristics Is of the characteristic importance of ; The model weights are represented by the set R, , wherein, For the weights taken up by XGBoost in the fusion model, For the weights taken up by LightGBM in the fusion model, The weight of LSTM in the fusion model; The model evaluation index is represented by M, Wherein m is the value of RMSE of the model; Fusing the feature subset, feature importance, model weight and model index into a set G according to a formula (1) and representing the set G in json format, The model a and the model B are represented according to the formula (2) and the formula (3), respectively, 。
5. The method for visual comparison and iterative optimization of feature subset-based predictive models of claim 4, wherein said computing similarity between different models from a weighted fusion of four dimensions of feature subset class, feature importance, algorithm weight, and model evaluation index for any two models in the exploration history, further comprises: calculating similarity of the model feature subsets: according to formula (4), a Jaccard similarity set is used Sum set The similarity between the two is set to be similar, Wherein, the Representing a collection And Is used to determine the size of the intersection of (a) and (b), Representing a collection And Is a size of the union of (a) and (b), The larger the value of (2) the higher the similarity of the two sets; similarity of model feature importance is calculated: According to formula (5), a cosine similarity calculation set is adopted Sum set The similarity between the two is that, Wherein, the Representing the result of multiplying the elements of the two vectors one by one; And Representing the length of the vector, if And If the features in the sets are not identical, unifying the feature space, mapping the elements in each set into a common feature space, and supplementing the missing features to 0; the larger the value of (2) the higher the similarity; similarity of model algorithm weights is calculated: According to the formula (6) and the formula (7), the Euclidean distance calculation set is adopted And Is a function of the similarity of the sequences, Wherein equation (7) is used to quantize the result to between 0 and 1; Calculating similarity of model evaluation indexes: The absolute value of the difference is used to measure according to the formula (8) and the formula (9) And Is a function of the similarity of the sequences, Wherein equation (9) is used to quantize the result to between 0 and 1.
6. The method for visual comparison and iterative optimization of feature subset-based predictive models of claim 5, wherein said computing similarity between different models from a weighted fusion of four dimensions of feature subset class, feature importance, algorithm weight, and model evaluation index for any two models in the exploration history further comprises computing different model multidimensional similarity values: according to equation (10), the similarity measures of four different dimensions are weighted and summed, the magnitude of the similarity is quantized to interval (0, 1), Wherein the weight coefficient 、、、 Constraint conditions need to be satisfied: 。
7. The visual comparison and iterative optimization method of feature subset-based predictive models of claim 1, wherein the step of recording the results of each time a user selects a feature subset to build a model and forming a historical exploration record therefrom, and helping the user to optimize the subsequent feature selection process by comparing the results with the historical models comprises: A first annular graph for visualizing a single model composite primitive, at least comprising a first annular graph for representing weight duty cycle of each algorithm in the model, a second annular graph for representing RMSE index values of the model, and a surrounding radial histogram for representing importance magnitudes of each feature; The whole exploration flow of the user is visually displayed in the form of a node link diagram, wherein each node is the model composite primitive, and the connecting edges between the nodes comprise father-son relationship edges representing the sequence of generating the models and association relationship edges representing that the comprehensive similarity between the non-father-son models exceeds a preset threshold value; And (3) carrying out automatic layout optimization on the node link graph, and solving by taking the number of intersections of the edges in the minimized graph as an objective function and combining node levels, relative position variables, edge intersection variables and transitive constraints through establishing an integer linear programming ILP model to obtain the clear layout with the minimum edge intersection.
8. The visual comparison and iterative optimization method of feature subset-based predictive models according to claim 7, wherein the method is characterized in that the whole exploration flow of the user is visually displayed in the form of a node link graph, wherein each node is the model compound primitive, the connecting edges between the nodes comprise father-son relationship edges representing the sequence of generating the models, and association relationship edges representing the comprehensive similarity between the non-father-son models exceeding a preset threshold value, and the method comprises the following steps: Definition diagram given a suitable hierarchical node-link diagram G, comprising node sets V and edge sets E, wherein each node There is a hierarchical allocation function Where K is the total number of layers, ensuring all edges Connecting nodes of different levels; Defining constraints given a graph G and defining a standard model by position variables, cross constraints, and transitive constraints , wherein, In the position variable, using Representing the positions of nodes i and j in the same layer, if node i is above node j Otherwise A kind of electronic device ; In the cross variables, using Representing edges in a graph And Whether or not to cross, if the edges Edge-blending Crossing, then Otherwise A kind of electronic device ; In cross constraint, cross variables are defined according to equation (12) and equation (13) And ensure if edge And Crossing rule , Wherein equation (12) is used to ensure that At the position of Upper and upper At the position of Under the condition, the cross variable Equation (13) is used to ensure that At the position of Under and at the bottom At the position of When above, the cross variable ; In the transitivity constraint, the transitivity constraint is directly applied to all node triples of the same layer according to the formula (14) and the formula (15), Wherein if it And is also provided with =1, Then If (1) And is also provided with =0, Then 。
9. The feature subset-based predictive model visual comparison and iterative optimization method of claim 8, wherein said building an integer linear programming ILP model comprises: Constructing the position variable, the cross constraint and the transitive constraint into an ILP model, wherein the position variable and the cross variable effectively replicate the layout of the corresponding graph, the sum of all the cross variables is the number of edge crossings in the graph, the cross variable replication with the minimum sum represents the layout with the minimum cross, and the minimization of all the cross variables according to an objective function formula (16) And (c) a sum of the two, Wherein, the Represent the first Layer and the first Edge sets between layers; Representing edges as cross variables Edge-blending If crossing, if crossing is 1, otherwise, 0.
10. The feature subset-based prediction model visual comparison and iterative optimization method of claim 9, wherein the solving in combination with node hierarchy, relative position variables, edge-crossing variables and transitive constraints results in a clear layout with minimal edge-crossing, comprising: For a given input graph, it is encoded using cross-constraints and transitive constraints as Model, and is transferred to The solver is provided with a logic circuit, The solver finds the assignment of the minimized objective function, processes the assignment and generates an optimized hierarchical node link graph, wherein, Selecting a first node pair in the same layer And Corresponding to Assume a node At the node The solution efficiency of the solver is accelerated by introducing a symmetry breaking method, and meanwhile, the frequency of occurrence in cross constraint is selected to be the largest on the basis Variable and fix it to 0.
11. A feature subset-based predictive model visual comparison and iterative optimization system, the system comprising: The data acquisition module is used for acquiring tobacco sales multivariable time sequence data, including historical sales, inventory status and market environment data; The model construction module is used for constructing a time sequence prediction model fused with XGBoost, lightGBM and Pathformer algorithms based on the feature subsets; the similarity calculation module is used for calculating multidimensional similarity between different iteration version models; a layout optimization module for calculating a topological layout of a model evolution graph using the ILP model according to any one of claims 7-10; and a feedback update module for responding to the feature adjustment instruction and triggering the model retraining until the optimal feature subset and the model are obtained to output the predicted tobacco sales.

Description

Feature subset-based prediction model visual comparison and iterative optimization method and system Technical Field The invention relates to the technical field of machine learning and data visualization, in particular to a method and a system for iteratively selecting feature subsets, comparing model performances and making optimization decisions through an interactive visual interface on computer equipment in the time sequence prediction model construction process. Background The cigarette sales data is affected by seasonal fluctuations, holiday effects, macro economic policies, price changes, stock levels and other complex factors, and has significant nonlinearity, volatility and hysteresis. In the digital transformation and supply chain lean management of the tobacco industry, high-precision sales demand prediction is a key basis for guiding cigarette production and scheduling, logistics distribution and retail customer ordering. In the field of machine learning and data mining, feature engineering is a key link in constructing high-performance predictive models. However, in the face of high-dimensional multivariate time series data (including historical running water, social inventory, gear structures, holiday factors, etc.) generated by tobacco sales, the traditional feature selection method often has difficulty in effectively capturing complex coupling relations between variables. The prior art relies on manual trial and error to find the optimal feature combination, so that the model iteration process is blind, low in efficiency and difficult to trace. In addition, a large number of candidate model versions are generated during the iterative optimization of the model. The existing analysis method lacks a quantitative calculation means for evolution relations among different models, and cannot clearly present a specific influence path of feature subset changes on model accuracy (RMSE). When the number of models is increased, the topological structure of the model evolution relationship becomes extremely complex, so that the calculation resource waste and the optimization direction are lost. Therefore, there is an urgent need for a prediction model construction method capable of quickly converging to an optimal feature subset by quantifying model similarity and automatically optimizing an evolution path topology in combination with tobacco marketing characteristics. Disclosure of Invention The embodiment of the invention aims to provide a prediction model visual comparison and iterative optimization method and system based on feature subsets, which solve the technical problems of difficult feature selection and low iterative efficiency of a tobacco sales prediction model and greatly improve the prediction precision and the efficiency and effect of feature engineering and model optimization. To achieve the above object, an embodiment of the present invention provides a method for visual comparison and iterative optimization of a prediction model based on feature subsets, the method being performed on a computer device, the computer device including a processor, a memory, and a display connected to the processor, including: Acquiring tobacco sales multivariable time sequence data of a target terminal, wherein the multivariable time sequence data at least comprises historical sales flow characteristic data, market state characteristic data and external environment characteristic data, and performing preprocessing and characteristic reconstruction to construct an initial characteristic subset; Based on the initially selected feature subset, constructing a time sequence prediction model fused with XGBoost, lightGBM and Pathformer algorithms, optimizing model parameters by adopting a batch gradient descent iterative algorithm, and using a root mean square error value (RMSE) as a model evaluation index; Receiving an adjustment instruction of a user aiming at the feature subset and the algorithm weight in a graphical user interface, reconstructing or reselecting the features based on the adjustment instruction to generate a new feature subset, and training to generate a new prediction model based on the new feature subset and the adjusted algorithm weight parameter; Aiming at any two models in the exploration history, calculating the similarity between different models from four dimension weighted fusion of feature subset types, feature importance, algorithm weights and model evaluation indexes to obtain similarity values between the different models; Constructing a node link diagram based on the similarity value and the model generation sequence, and representing a model evolution path, wherein each node corresponds to a prediction model, the edge represents the model generation sequence, and the iteration selection of a subsequent feature subset is guided according to the node link diagram; Carrying out automatic layout optimization on the node link graph, establishing an integer linear programming ILP mod