CN-121984897-A - Network digital twinning-oriented dual-view flow data sampling method and system

CN121984897ACN 121984897 ACN121984897 ACN 121984897ACN-121984897-A

Abstract

The invention discloses a network digital twin-oriented double-view flow data sampling method and system, which mainly solve the problems of difficult capturing of a critical path and low flow reconstruction accuracy caused by lack of frequency domain perception in a flow sampling method in the prior art. The method comprises the steps of collecting historical flow interaction data in a sliding window of a network node to form a third-order flow tensor, normalizing, extracting node frequency domain features, then smoothly updating to obtain a smooth feature vector, mapping the vector to determine node types, dividing all network source-destination OD (optical density) and flow types, calculating all OD (optical density) pair statistics lever scores, dynamically distributing all types of sampling budget according to the calculated scores, generating a sampling set and a binary sampling mask by adopting a deterministic and random combination strategy, collecting position flow data corresponding to the current moment based on the mask, constructing a tensor complement optimization model, solving the complement, and outputting the restored all network flow data. According to the invention, through the double-view synergy of the frequency domain characteristics and the statistical lever scores, the accurate coverage of the high-value flow path is realized, the sampling cost is reduced, and the flow reconstruction precision of the digital twin network is effectively improved. The method can be applied to measurement and reconstruction of large-scale network flow.

Inventors

MA YINGHONG
YUAN YUEYUE
DONG XU
JIAO YI
LIU QIN
LIU WEI

Assignees

西安电子科技大学

Dates

Publication Date: 20260505
Application Date: 20260205

Claims (10)

1. The network digital twinning-oriented double-view-angle flow data sampling method is characterized by comprising the following steps of: (1) Collecting historical flow interaction data of a network node in a sliding window to form a third-order flow tensor, and carrying out normalization processing on the third-order flow tensor to obtain a normalized flow tensor; (2) Extracting node frequency domain features from the normalized flow tensor, and carrying out smooth updating on the node frequency domain features to obtain a smooth feature vector; (3) Mapping the smooth feature vector, determining node category by combining the frequency domain feature, and dividing the whole network source-destination node OD pair into different flow categories; (4) Calculating the statistical lever scores of all OD pairs, dynamically distributing sampling budget of each flow category according to the scores, selecting partial OD pairs by adopting a strategy combining certainty and random to generate a sampling set Corresponding binary sample mask ; (5) And acquiring flow data at a position corresponding to the current moment based on the sampling mask, constructing a tensor complement optimization model, solving and complementing the observed data, and outputting the recovered whole network flow data.
2. The method of claim 1, wherein the normalization processing is performed on the third-order traffic tensor in (1), and the specific formula is: , Wherein, the Is the flow value after normalization, And Respectively, the minimum value and the maximum value of the flow data in the current sliding window, epsilon is the minimum value which is introduced to prevent the denominator from being zero.
3. The method of claim 1, wherein the extracting node frequency domain features in the normalized traffic tensor and the smoothing updating thereof in (2) comprises: (2a) For normalized flux tensor The first k time steps of (a) are historical data slices, and fast Fourier transform FFT is performed along the time dimension to obtain a frequency domain tensor For a pair of Is defined by the frequency slice of each of the frequency slices of the (a) And (3) performing truncated singular value decomposition: , Wherein, the 、、 A left singular vector matrix, a singular value matrix and a right singular vector matrix of the kth frequency slice respectively; (2b) Based on the decomposition result, respectively constructing instantaneous source node feature matrixes of current time step t And instantaneous destination node feature matrix Each row of the matrix represents a frequency domain feature vector of a node, the vector is formed by splicing singular vectors of the node on all frequency slices after weighting and logarithmic transformation, and the calculation formula is as follows: , Wherein, the 、、 A left singular vector matrix, a singular value matrix and a right singular vector matrix of the kth frequency slice respectively, The operation of the splice is indicated and, The value of the modulus is expressed, Representing a logarithmic transformation operation; (2c) Updating the feature vector using an exponentially weighted moving average mechanism: , Wherein the method comprises the steps of In order to smooth the coefficient of the coefficient, Representing the characteristic matrix smoothed at the current moment and respectively corresponding to the source node smoothing matrix And destination node smoothing matrix , Representing the instantaneous feature matrix calculated in step (2 b), Representing the smoothed feature matrix cached at the previous time.
4. The method of claim 1, wherein mapping the smoothed feature vector and classifying traffic classes in (3) comprises: (3a) Generating a random projection matrix compliant with a standard normal distribution Projecting the smooth feature vector of each node to a low-dimensional space, and mapping the smooth feature vector to a plurality of discrete hash buckets according to projection values to realize the preliminary clustering of the nodes with similar frequency domain fluctuation modes; (3b) Carrying out semantic analysis on the nodes in the hash bucket, and calculating the high-low frequency energy Ratio of the node feature vector: , Wherein, the For the magnitude of the eigenvector at the kth frequency component, And Respectively representing a preset high-frequency band set and a preset low-frequency band set; (3c) Marking a source node and a destination node as Z flow categories with different frequency domain characteristics according to the numerical interval of Ratio; (3d) And combining the source node category and the destination node category to obtain a traffic category set to which each OD pair of the whole network belongs.
5. The method of claim 1, wherein the calculating of the statistical leverage scores for all OD pairs in (4) comprises: (4a) Calculating source node lever score And destination node lever score : , Wherein, the Representing the feature vector of the ith source node of the kth time step, A feature vector representing a jth destination node of a kth time step; (4b) Calculate each OD pair Is a comprehensive statistical lever score of (1) : 。
6. The method of claim 1, wherein the dynamically allocating sampling budgets for each traffic class according to the statistical leverage score in (4) and selecting partial OD pairs using a deterministic and random combined strategy comprises: (4c) Statistics of mth traffic class set The sum of the statistical leverage scores of all the OD pairs contained in a list is defined as the weight of that class Counting the total number of OD pairs contained in the category and defining the total number as the capacity of the category ; (4D) Setting total sampling budget A according to weight of each category Accounting for the proportion of total weight of the whole network, calculating initial budget : ; (4E) Budgeting initial targets for each category And its capacity Comparison is performed: if the target budget for that class exceeds its capacity, execute (4 f), Otherwise, adopting an initial target budget; (4f) Truncating the final number of samples for which the target budget exceeds its capacity class to Recovering the overflowed budget part to a public residual budget pool, iteratively distributing the quota in the residual budget pool to the categories which do not reach the capacity limit according to the weight proportion again until all budgets are distributed, and obtaining the final sampling quantity of each category ; (4G) A mixed sampling strategy is adopted in each flow category, and according to a preset deterministic sampling proportion The final sampling number of the category Splitting into deterministic sampling portions Random sampling Satisfies the following conditions 。
7. The method of claim 6, wherein (4 g) is the final sample number for the class Splitting into deterministic sampling portions Random sampling The implementation is as follows: first, for all OD pairs in the category, the lever score is counted according to the combination Sorting in descending order, before selecting The OD pairs form a deterministic sampling set ; Then, the OD pairs contained in the deterministic sample set are removed from all OD pairs of the class, a residual candidate set is constructed, and random extraction is carried out from the residual candidate set The OD pairs form a random sampling set ; Finally, aggregate the total samples of the category All OD pairs contained in (a) are masked in binary sampling Corresponding position marks of 1, rest position marks of 0, and generates final sampling mask The formula is as follows: , Wherein, the Element values representing the sampling mask tensor at the kth time step, the ith source node, the jth destination node location, Is a collection of samples.
8. The method of claim 1, wherein the (5) is based on a sampling mask The method comprises the steps of collecting flow data at a position corresponding to the current moment, constructing a tensor complement optimization model, solving and complementing observation data, and realizing the steps of: (5a) Based on the low rank characteristic of the network traffic, constructing a convex optimization objective function targeting a minimization of the kernel norm: , Wherein, the For the traffic tensor to be recovered, Is the actual observation tensor that is used to determine, Representing the tensor kernel norm, i.e., the sum of the singular values of all the expansion matrices of the tensor, For the projection operator, defined as: the operator constrains the recovered tensor to be at the sample mask A corresponding position with a value of 1, the value of which must be consistent with the observed data; (5b) Introducing auxiliary tensors And lagrangian multiplier tensor Converting the constrained convex optimization objective function into an unconstrained augmented Lagrangian function form: ; Wherein, the In order to penalize the parameters, The inner product operation is represented by the equation, Representing the Frobenius norm, in this function, As the lagrangian multiplier tensor, To assist the tensor, for handling the kernel norm low rank constraint, The method comprises the steps of processing data consistency constraint for a traffic tensor to be recovered; (5c) Solving the augmented lagrangian function by using an alternate direction multiplier method: 5c1) Fix the first Low rank auxiliary tensor for wheel And lagrangian multiplier tensor By solving extremum problems To obtain the (k+1) th round : , Wherein, the A value representing a reserved non-sampled area; 5c2) Fixing the data consistency tensor for the k+1st round And the Lagrangian multiplier tensor of the kth round By solving extremum problems To obtain the (k+1) th round : , Wherein, the Representing the threshold value as Is a singular value threshold operator; 5c3) Computing data consistency tensors Sum and low rank auxiliary tensor Constraint residual error between the two, and taking the residual error as gradient direction of dual function to punish parameters Updating Lagrangian multiplier tensors for step size to maximize dual objective function : ; 5C4) And (5 c 1) to (5 c 3) are circularly executed, the relative error of the results of two adjacent iterations is calculated, the iteration is stopped until the error is smaller than a preset threshold value, and the finally converged tensor is output as the recovered whole network flow data.
9. A network digital twinning-oriented dual-view traffic data sampling system, comprising: The flow data acquisition module is used for monitoring a network port or reading a log file in real time, and collecting historical flow interaction data of a source node and a destination node in a sliding window in a network to obtain an initial third-order flow tensor; the preprocessing module is used for receiving the initial third-order flow tensor, and carrying out numerical scaling processing on the initial third-order flow tensor by adopting a maximum and minimum normalization method to obtain a normalized flow tensor; The frequency domain feature extraction module is used for carrying out fast Fourier transform on the normalized flow tensor along the time dimension, carrying out truncated singular value decomposition on the frequency slice, and extracting an instantaneous frequency domain feature vector reflecting the node flow fluctuation energy distribution; the frequency domain feature updating module is used for receiving the instantaneous frequency domain feature vector, and carrying out smooth updating on the feature vector by utilizing an exponential weighting moving average mechanism to obtain a smooth feature vector; The flow category dividing module is used for carrying out mapping based on local sensitive hash on the smooth feature vector, determining node categories by combining frequency domain energy bits, and dividing source-destination node pairs of the whole network into different flow category sets; The dual-view sampling decision module is used for calculating comprehensive statistical lever scores of all OD node pairs, dynamically distributing sampling budgets of flow categories by using a weighted water injection algorithm according to the scores, selecting part of OD pairs by adopting a mixed sampling strategy combining certainty with random in each flow category, and generating a sampling set Corresponding binary sample mask tensor ; A traffic tensor complement and reconstruction module for performing a binary sampling mask tensor based on the binary sampling mask tensor And acquiring sparse observation data at the current moment, constructing a tensor complement optimization model by utilizing the low-rank characteristic of the network traffic, and carrying out iterative solution by an alternate direction multiplier method to output full-network complete traffic data after recovery.
10. The system of claim 9, wherein the flow tensor completion and reconstruction module comprises: A sparse traffic acquisition sub-module for receiving the binary sample mask tensor According to the position indication with the value of 1, the actual flow data of the corresponding network OD node pair at the current moment is directionally acquired, and the non-sampling position is set to zero, so that a sparse observation tensor is obtained ; A model construction module for receiving the sparse observation tensor Constructing a minimum kernel norm tensor complement optimization model which takes the consistency of the observed data as a constraint; the model conversion sub-module is used for receiving the tensor complement optimization model and converting the model into an unconstrained augmented Lagrangian function form by introducing an auxiliary tensor and a Lagrangian multiplier; And the iteration solving sub-module is used for solving the augmented Lagrangian function by using an alternate direction multiplier method, and circularly executing data consistency tensor update, low-rank auxiliary tensor update and Lagrangian multiplier update until the algorithm converges, and outputting the restored whole-network complete flow data.

Description

Network digital twinning-oriented dual-view flow data sampling method and system Technical Field The invention belongs to the technical field of network measurement and digital twin, in particular to a flow data sampling and recovering method and a system, which can be applied to measurement and reconstruction of large-scale network flow. Background The digital twin network DTN is a network architecture that accurately maps the topology, state properties and operational behavior of a physical network to virtual space through real-time data interactions. The network flow matrix records flow interaction data between all source-destination node pairs OD in the network, and is a data base stone for constructing a high-fidelity digital twin model, carrying out refined flow engineering and detecting network abnormality. However, with the increase of the network scale, real-time continuous direct measurement is performed on all OD pairs of the whole network, and serious challenges such as high bandwidth overhead and high deployment cost are faced. Therefore, a sparse sampling technology is adopted to collect flow data of a small number of key OD pairs, and a matrix or tensor complement algorithm is utilized to infer the whole network flow, so that the method becomes a key path for realizing low-overhead and high-precision network sensing. Patent document with publication number CN110138614a discloses a method and a system for online network traffic anomaly detection based on tensor model. The method comprises the steps of acquiring current time data in real time through a fixed window, carrying out iterative training and factor updating on the current data by utilizing historical tensor data, and decomposing observation data into a low-rank normal part and a sparse abnormal part, so that the online detection of flow abnormality is realized. The method mainly focuses on separating abnormal traffic by using low-rank sparse decomposition, and lacks an active sampling decision mechanism aiming at a scene with limited sampling resources, so that the structural difference of different OD streams on a frequency domain cannot be fully utilized to guide the optimal configuration of the sampling resources when traffic data are processed. The patent document with the publication number of CN115225528A discloses a network traffic data distributed measurement scheduling method, a system and a medium, which introduces Jensen-Shannon divergence and variance to quantify the distribution change and fluctuation degree of network traffic in adjacent time windows, calculates the importance of each factor matrix row vector, constructs a linear equation set according to the importance, preferentially collects measurement points with high importance, and finally recovers the whole network traffic by using historical data and newly sampled data. Although the method relies on statistical difference in time domain to select sampling points, because the method fails to deeply mine fluctuation characteristics of flow data in frequency domain, fine sampling and budget allocation are difficult to realize for flows with the same statistical distribution but different frequency domain modes. Disclosure of Invention Aiming at the defects of the prior art, the invention provides a network digital twin-oriented double-view flow data sampling method and system so as to realize the fine sampling of different frequency domain mode flows and further improve the reconstruction accuracy of the flows in a digital twin network. In order to achieve the above purpose, the technical scheme adopted by the invention comprises the following steps: 1. The network digital twinning-oriented double-view-angle flow data sampling method is characterized by comprising the following steps of: (1) Collecting historical flow interaction data of a network node in a sliding window to form a third-order flow tensor, and carrying out normalization processing on the third-order flow tensor to obtain a normalized flow tensor; (2) Extracting node frequency domain features from the normalized flow tensor, and carrying out smooth updating on the node frequency domain features to obtain a smooth feature vector; (3) Mapping the smooth feature vector, determining node category by combining the frequency domain feature, and dividing the whole network source-destination node OD pair into different flow categories; (4) Calculating the statistical lever scores of all OD pairs, dynamically distributing sampling budget of each flow category according to the scores, selecting partial OD pairs by adopting a strategy combining certainty and random to generate a sampling set Corresponding binary sample mask; (5) And acquiring flow data at a position corresponding to the current moment based on the sampling mask, constructing a tensor complement optimization model, solving and complementing the observed data, and outputting the recovered whole network flow data. Further, the mapping of the smoo