CN-121480859-B - Construction method and system of multisource time sequence data fusion prediction model in BI analysis

CN121480859BCN 121480859 BCN121480859 BCN 121480859BCN-121480859-B

Abstract

The invention provides a construction method and a construction system of a multisource time sequence data fusion prediction model in BI analysis, and relates to the technical field of enterprise-level business intelligent analysis; obtaining a multi-dimensional feature set through time sequence feature extraction and business feature extraction; adopting an attention mechanism to dynamically fuse time sequence characteristics and business characteristics, constructing an LSTM and XGBoost combined prediction model for training; the system comprises a data acquisition and preprocessing unit, a multidimensional feature extraction unit, a fusion prediction model construction unit and a model verification and optimization unit. The method solves the problems of insufficient multi-source time sequence data fusion, difficulty in capturing complex space-time characteristics of the model and lack of dynamic self-adaptation capability, improves the accuracy and instantaneity of enterprise-level BI analysis such as sales trend prediction and inventory optimization, and improves the prediction precision.

Inventors

LIU GUANGJUN

Assignees

北京浩太同益科技发展有限公司

Dates

Publication Date: 20260508
Application Date: 20251112

Claims (10)

The construction method of the multisource time sequence data fusion prediction model in BI analysis is characterized by comprising the following steps: step 1, acquiring multi-source original time sequence data based on business requirements of enterprise-level business intelligent analysis, and preprocessing the multi-source original time sequence data to obtain standardized multi-source time sequence data, wherein the multi-source original time sequence data comprises enterprise internal business system data and external associated data; Step 2, carrying out feature extraction on the standardized multi-source time sequence data to obtain a multi-dimensional feature set, wherein the feature extraction comprises time sequence feature extraction and service feature extraction, the time sequence feature extraction calculates trending, periodicity and stationarity features through a time sequence analysis algorithm, and the service feature extraction is combined with product category, area and customer group service dimension mining service index relevance and dimension attribute features; Step 3, constructing a multisource feature fusion module and a prediction model framework, adopting an attention mechanism to perform weight calculation on the time feature subset and the service feature subset to obtain a fusion feature vector, inputting the fusion feature vector into a combined prediction model formed by an LSTM network and a XGBoost model for training, and obtaining an initial fusion prediction model; The LSTM network adopts a three-layer gating circulation unit structure, 128 neurons are arranged on each layer, the discarding rate is set to 0.2, the length of an input sequence is set to 30 time steps, LSTM network parameters are initialized by adopting an Xavier normal distribution, the training period is set to 100 rounds, the batch size is 32, an Adam optimizer is used for parameter updating, a learning rate attenuation strategy is adopted, a XGBoost model adopts a gradient lifting decision tree framework, the maximum depth of the tree is set to 10, the learning rate is set to 0.1, an early stopping rule for stopping training when the validation set loss is continuously 10 rounds and is not reduced is set during training, the LSTM network and XGBoost model output are integrated by adopting a weighted average method, the LSTM weight is set to 0.6, the XGBoost weight is set to 0.4, and the weight is dynamically adjusted according to the error performance on the validation set; The training data is divided into a training set and a verification set according to the proportion of 7:3, a time sequence cross verification method is adopted to ensure time sequence continuity, and an RMSE threshold value of 0.05 is set in a model training stage as a performance acceptance standard; And 4, verifying and optimizing the initial fusion prediction model, selecting an independent verification data set to calculate a prediction error index, wherein the prediction error index comprises an average absolute error and a root mean square error, if the error exceeds a preset threshold value, adjusting the attention mechanism weight parameter and the model superparameter, and repeating training until the error meets the precision requirement to obtain a final prediction model.
2. The method of claim 1, wherein the preprocessing in step 1 includes data cleaning, format unification and missing value filling, the data cleaning specifically includes calculating deviation of each data point from a mean value by adopting a statistical method based on Z-score, judging as an abnormal value and eliminating when the absolute value of Z-score is larger than 3, simultaneously identifying and eliminating repeated values by combining a time stamp and a main key, verifying according to sales data and business rules, regarding the sales as an abnormal value if the sales is negative, the format unification includes converting the time stamp of multi-source data into a standard ISO 8601 format and unifying time zones as UTC time, aligning the low-frequency data to a high-frequency time axis by adopting a sliding window aggregation method for data with different frequencies, guaranteeing time dimension consistency, the missing value filling includes complementing by adopting a time sequence linear interpolation method for internal business system data, deducing the missing period increment period, such as by utilizing the missing period increment, if the missing period is lower than 10%, and deducing the missing period increment period based on business rules if the missing period is higher than 10%.
3. The method of claim 2, wherein step 1 further comprises a data quality evaluation process, wherein data integrity evaluation is performed, variance of sampling point time intervals of each data source is calculated, variance formula is calculated based on sampling point time sequences, variance formula is calculated, when variance exceeds a threshold value of 0.5, a data interpolation completion process is triggered, time sequence correlation verification is performed, pearson correlation coefficients of adjacent data points are calculated, if the correlation coefficients are lower than 0.8, data time sequence continuity is insufficient, and window length is 7 days through sliding window smoothing.
4. The method of constructing a computer program product of claim 1, wherein the time series feature extraction in step 2 specifically includes a sliding window method for calculating the time series feature, wherein the window length is 30 days, linear regression slope of data in each window is calculated for daily frequency data as a trend intensity index, if absolute value of slope is greater than a threshold value of 0.1, it is determined that the data has significant trend, trend direction is recorded, trend contribution rate is calculated by Hodrick-Prescott filtering to separate trend components, the trend contribution rate is regarded as strong trend data when the contribution rate exceeds 60%, periodic feature analysis is based on the periodicity rule of autocorrelation function detection data, autocorrelation coefficients of the time series data under different hysteresis orders are calculated, hysteresis orders corresponding to correlation coefficient peaks are selected as main periods, the stationary feature evaluation adopts an Augmented Dickey-Fuller test, ADF statistics is calculated, if the absolute value of the slope absolute value is less than 1% significance level critical value-3.5 and the p value is less than 0.01, the data is determined to be stationary, and the non-stationary data is processed by a differential process, and the stationary data is calculated as a supplementary feature.
5. The construction method of claim 1, wherein the service feature extraction in the step 2 comprises mining three service dimensions of a product category, a region and a customer group by service index relevance, calculating pearson correlation coefficients and mutual information entropy among the service indexes, judging that strong relevance exists if the absolute value of the correlation coefficients is larger than 0.7, simultaneously measuring nonlinear relevance through the mutual information entropy, recording as a significant dependency relationship if the entropy exceeds 0.5, wherein the relevance features comprise transverse relevance and longitudinal relevance, constructing a service index relevance map through a graph neural network, extracting node centrality features as relevance strength indexes, and performing data aggregation by dimension attribute feature extraction based on the service dimensions to generate inventory turnover rate, sales volume occupation rate and gross interest rate variation coefficients of the product category dimension, the base coefficient and concentration index of the region dimension, and customer activity, buyback rate and high-value customer occupation rate of the customer group dimension.
6. The construction method of claim 1, wherein step 2 further comprises feature fusion and optimization, wherein a multi-dimensional feature matrix is constructed by using a time sequence feature subset and a service feature subset, min-Max normalization is performed on continuous features to eliminate dimension influence, single-heat encoding is performed on discrete features, key features are selected by adopting a recursive feature elimination method, random forests are used as base models, feature importance scores are calculated, 10% of low-contribution features after the scores are removed, feature numbers are kept to be not more than 50 dimensions, the feature matrix is reduced in dimension by principal component analysis, and a variance contribution rate of 95% is kept to generate a final multi-dimensional feature set.
7. The construction method of claim 1, wherein the specific process of calculating weights of the time sequence feature subset and the service feature subset by the attention mechanism in the step 3 comprises the steps of calculating importance weights of the time sequence feature by a time attention layer, wherein a weight value is determined by a feature variance and mutual information entropy of a prediction target together, the feature with larger variance and higher mutual information entropy obtains higher weight, the importance weight of the service feature is calculated by the service attention layer, weights are dynamically distributed on the basis of pearson correlation coefficients of the feature and the service target value, if the absolute value of the correlation coefficients is larger than 0.6, the feature weight is improved by 20%, the attention mechanism adopts a normalization exponential function to carry out weight normalization processing, the sum of the feature weights is ensured to be 1, and after the weight calculation is completed, the time sequence feature and the service feature are subjected to weighted fusion according to the weights, so as to form a fusion feature vector.
8. The construction method of claim 1, wherein in step 4, an independent verification data set of enterprise-level business intelligence is selected, the verification data set and the training data set are independent in a time range and a business dimension, data of a continuous time period after covering the training set are obtained, verification data sources comprise enterprise internal business system data and external associated data, a data preprocessing mode is consistent with the training set, an average absolute error and a root mean square error of a prediction result and an actual business target value are calculated, a preset precision threshold is dynamically set according to a business scene, the average absolute error threshold is set to be 0.05, the root mean square error threshold is set to be 0.08, the average absolute error threshold is set to be 0.1 for an inventory turnover prediction task, error calculation and differentiation business dimension is respectively calculated for product category dimension, if the error index exceeds the threshold, a concentration mechanism weight parameter and super-parameters of a combined prediction model are adjusted, the adjustment strategy comprises the number of hidden layer nodes of LSTM and the tree depth of XGBoost, the weight update step size is set to be 0.01 for a time sequence feature subset by a gradient descent method, the relative weight coefficient is set to be 0.01, the iteration error is set to be equal to the iteration error coefficient, and the iteration error is set to be equal to the iteration error coefficient is set to be equal to the iteration coefficient when the iteration coefficient is equal to 3, and iteration coefficient is set to the iteration coefficient is equal to the iteration coefficient.
9. The system for constructing the multi-source time sequence data fusion prediction model in BI analysis is used for realizing the method of any one of claims 1-8 and is characterized by comprising a data acquisition and preprocessing unit, a multi-dimensional feature extraction unit, a fusion prediction model construction unit and a model verification and optimization unit; the data acquisition and preprocessing unit is used for acquiring multi-source original time sequence data and preprocessing the multi-source original time sequence data to obtain standardized data; The multi-dimensional feature extraction unit is used for carrying out feature extraction on the standardized data to obtain a multi-dimensional feature set; the fusion prediction model construction unit is used for fusing the characteristics by adopting an attention mechanism and obtaining an initial prediction model by training an LSTM model and a XGBoost model; the model verification and optimization unit is used for carrying out model optimization through the verification data set to obtain a final prediction model.
10. The system of claim 9, wherein the data acquisition and preprocessing unit further comprises a data integrity evaluation module and a time sequence correlation verification module for guaranteeing data quality, the multi-dimensional feature extraction unit further comprises a feature fusion and optimization module for normalizing and dimension reduction processing of a feature matrix, the attention mechanism in the fusion prediction model construction unit is configured to calculate weights based on variances and correlation coefficients, the LSTM network is set to a three-layer gating structure, the XGBoost model is set to a gradient lifting decision tree, and the model verification and optimization unit further comprises a parameter adjustment module and an iterative optimization module for dynamically adjusting parameters and controlling a training process.

Description

Construction method and system of multisource time sequence data fusion prediction model in BI analysis Technical Field The invention relates to the technical field of enterprise-level business intelligent analysis, in particular to a method and a system for constructing a multisource time sequence data fusion prediction model in BI analysis. Background With the deep advancement of enterprise digital transformation and upgrading, the business intelligent analysis system has become a core tool for enterprises to promote operation efficiency, optimize resource allocation and support strategic decisions. In the enterprise-level BI analysis process, the fusion analysis of the multi-source time sequence data is important to key business scenes such as sales trend prediction, inventory optimization, risk early warning and the like. The data comprise enterprise internal business system data (such as sales data and inventory data) and external association data (such as industry trend data and macro economy data), and have the characteristics of various sources, heterogeneous frequencies, complex dimensions and the like. However, conventional single source prediction methods face significant challenges in processing multi-source temporal data. Firstly, the multisource data has differences in acquisition frequency, data format and semantic dimension, for example, sales data is daily frequency or weekly frequency, and macroscopic economic data is monthly or quarterly frequency, so that the data alignment is difficult and the fusion is insufficient. Secondly, the model is difficult to capture complex space-time characteristics and business coupling between multi-source data, such as lag correlation of sales data and logistics data and cross influence of user behavior data and promotion activities, and the cooperation rules are often ignored, so that prediction accuracy is limited. In addition, the existing model lacks dynamic self-adaptation capability, cannot respond to rapid changes of a service environment in time, is easy to be interfered by noise, has larger prediction result deviation, and influences the accuracy and timeliness of enterprise decision. At present, a multisource time sequence data fusion prediction technology in enterprise-level BI analysis is still immature, and the existing method is mostly dependent on simple aggregation or a single model, so that the problems of heterogeneous data integration, feature mining, model generalization and the like cannot be effectively solved. Therefore, how to realize efficient fusion of multi-source time sequence data and construct a prediction model capable of capturing complex business rules has become a technical problem to be solved by those skilled in the art. Disclosure of Invention The invention aims to provide a construction method and a construction system of a multisource time sequence data fusion prediction model in BI analysis, so as to solve the problems that multisource time sequence data fusion is insufficient, complex space-time characteristics are difficult to capture by the model, dynamic self-adaptation capability is lacked and the like in the prior art, and the specific technical scheme is as follows: The invention provides a construction method of a multisource time sequence data fusion prediction model in BI analysis, which comprises the following steps: Step 1, acquiring multi-source original time sequence data based on business requirements of enterprise-level business intelligent analysis, and preprocessing the multi-source original time sequence data to obtain standardized multi-source time sequence data, wherein the multi-source original time sequence data comprises enterprise internal business system data and external associated data; Step 2, carrying out feature extraction on the standardized multi-source time sequence data to obtain a multi-dimensional feature set, wherein the feature extraction comprises time sequence feature extraction and service feature extraction, the time sequence feature extraction calculates trending, periodicity and stationarity features through a time sequence analysis algorithm, and the service feature extraction is combined with product category, area and customer group service dimension mining service index relevance and dimension attribute features; Step 3, constructing a multisource feature fusion module and a prediction model framework, adopting an attention mechanism to perform weight calculation on the time feature subset and the service feature subset to obtain a fusion feature vector, inputting the fusion feature vector into a combined prediction model formed by an LSTM network and a XGBoost model for training, and obtaining an initial fusion prediction model; And 4, verifying and optimizing the initial fusion prediction model, selecting an independent verification data set to calculate a prediction error index, wherein the prediction error index comprises an average absolute error and a root mean