CN-121984898-A - Full-flow microsecond measurement method and system based on data plane micro burst measurement
Abstract
The invention discloses a full-flow microsecond measuring method and system based on data plane micro burst measurement, wherein the method specifically comprises the steps of carrying out time-sharing flow statistics on data flow in a data plane to obtain a current time-slot flow observation value and a last time-slot flow observation value; the method comprises the steps of calculating a first order difference based on a current time slot flow observation value and a last time slot flow observation value to obtain a flow rate change amount, executing lightweight chi-square inspection based on the flow rate change amount and a historical accumulated change amount through a logarithmic projection technology to calculate a burst score, comparing the burst score with a preset threshold, generating a burst report and reporting the burst report to a control surface if the burst score exceeds the preset threshold, receiving the burst report at the control surface, filling data between adjacent burst reports, and reconstructing a continuous flow rate curve of a data stream. The invention solves the contradiction between the communication cost, the hardware suitability and the measurement granularity of the traditional method, thereby realizing microsecond telemetry with high fidelity and low cost.
Inventors
- XIE KUN
- Cong Yinchuan
- WEN JIGANG
- ZHANG GUANGXING
Assignees
- 湖南大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260409
Claims (10)
- 1. The full-flow microsecond measuring method based on the data plane micro burst measurement is characterized by comprising the following steps of: In the data plane of the programmable switch, carrying out time-slot flow statistics on the data flow by utilizing a time sketch data structure with a time stamp to obtain a current time-slot flow observation value and a last time-slot flow observation value; Calculating a first order difference on the data plane based on the current time slot flow observation value and the last time slot flow observation value to obtain a flow rate variation; based on the flow rate variation and the historical accumulated variation, performing lightweight chi-square inspection on a data surface by a logarithmic projection technology, and calculating burst scores; comparing the burst score with a preset threshold, if the burst score exceeds the preset threshold, generating a burst report containing a flow identifier, a time stamp and a flow observation value, and reporting the burst report to a control surface; and receiving burst reports at a control plane, filling data between adjacent burst reports by using a linear interpolation algorithm, and reconstructing a continuous flow rate curve of the data stream.
- 2. The method of claim 1, wherein the performing the slotted traffic statistics on the data stream using the time-stamped time sketch data structure to obtain the current time-slot traffic observation and the previous time-slot traffic observation specifically comprises: Maintaining a plurality of parallel sketch instances with time stamp fields on a data surface, wherein a counter unit of each sketch instance is associated with a time stamp for recording the latest update time slot; when the data packet arrives, the data packet is hashed and mapped to the corresponding counter unit of each sketch example according to the flow identification of the data packet, and the time stamp of the counter unit is read and compared with the current global time slot index; if the time stamp of the counter unit is inconsistent with the current global time slot index, judging that the data of the counter unit is out of date, resetting the count value of the counter unit to be the current data packet length, and updating the time stamp to be the current global time slot index; If the time stamp of the counter unit is consistent with the current global time slot index, accumulating the length of the current data packet to the count value of the counter unit; And managing roles of a plurality of sketch examples through a three-sketch rotation mechanism, and outputting the current time slot flow observation value and the last time slot flow observation value.
- 3. The method of claim 2, wherein the sketch instances include an active sketch, a stable sketch and a stale sketch, wherein the managing roles of the sketch instances through a three-sketch rotation mechanism outputs a current time slot traffic observation value and a last time slot traffic observation value, and the method specifically comprises: the sketch example corresponding to the modulo result of the current global time slot index is distributed as an active sketch, and the flow data of the current time slot is accumulated and counted through the active sketch; allocating a sketch example corresponding to the modulo result obtained by subtracting one from the current global time slot index as a stable sketch, and storing and providing flow data of the last time slot through the stable sketch; Allocating a sketch example corresponding to the modulo result obtained by subtracting the current global time slot index from two as a stale sketch, and storing the sketch example as flow data of the time slot obtained by subtracting the current global time slot index from two through the stale sketch; In any time slot, reading a current accumulated count value of the data stream from the active sketch as a current time slot flow observation value, and reading a historical count value of the data stream from the stable sketch as a previous time slot flow observation value; when the current global time slot index increases, the roles of the sketch examples are redistributed according to the modulo rule, so that the original old sketch is emptied and converted into a new active sketch, the original active sketch is converted into a stable sketch, and the original stable sketch is converted into the old sketch.
- 4. The method of claim 1, wherein the calculating the first order difference based on the current time slot traffic observation and the last time slot traffic observation on the data plane to obtain the traffic rate variation specifically includes: reading the previous time slot flow observation value and the current time slot flow observation value in an arithmetic logic unit of a data plane; Calculating the difference value between the current time slot flow observation value and the last time slot flow observation value in an arithmetic logic unit, and taking an absolute value of the difference value to obtain a flow rate variation; The flow rate change amount is updated into a history accumulated change amount register corresponding to the data stream, the history accumulated change amount register being used to provide a history accumulated change amount when calculating the burst score.
- 5. The method according to claim 4, wherein the calculating the burst score by performing a lightweight chi-square test on the data plane by a logarithmic projection technique based on the traffic rate variation and the historical cumulative variation comprises: Multiplying the flow rate variation by the current time slot sequence number to obtain a first product; subtracting the first product from the historical accumulated variation to obtain an absolute value, and obtaining a molecular term; multiplying the historical accumulated change quantity by the result of subtracting one from the current time slot sequence number to obtain a second product; Calculating the square root of the second product to obtain denominator term; Dividing the numerator term by the denominator term to obtain a normalized burst score; and carrying out statistical test on the normalized burst score on the data surface by a logarithmic projection technology to obtain the burst score approximate to the hardware constraint.
- 6. The method according to claim 5, wherein the statistical test on the normalized burst score by the logarithmic projection technique on the data plane obtains the burst score approximated under the hardware constraint, specifically comprising: Pre-storing a logarithmic lookup table and an index lookup table in a data surface, wherein the logarithmic lookup table is used for storing logarithmic values corresponding to the numerical values of an integer domain, and the index lookup table is used for storing approximate index results corresponding to the logarithmic values; carrying out parameter splitting and searching on a molecular item containing the flow rate variable quantity, the current time slot serial number and the historical accumulated variable quantity by utilizing a logarithmic lookup table to obtain a corresponding first logarithmic value; Carrying out parameter splitting and searching on denominator items containing a result of subtracting one from the current time slot sequence number by utilizing a logarithmic lookup table to obtain a corresponding second logarithmic value; in an arithmetic logic unit of the data plane, performing subtraction operation according to the first logarithmic value and the second logarithmic value to obtain a result logarithmic value; and taking the result pair value as an index, querying an index lookup table, and obtaining the approximate burst score.
- 7. The method of claim 6, wherein the method further comprises: obtaining burst scores of a data plane, which are obtained by calculating a preset number of samples on the data plane, on a control plane, and obtaining burst scores of a CPU, which are obtained by calculating the same samples on the CPU based on a chi-square test formula; Calculating the mean square error of all samples based on the burst score of the data plane and the burst score of the CPU, wherein the mean square error is used for representing the overall numerical deviation degree of approximate calculation; Calculating accuracy scores of all samples by comparing relative errors between the data surface burst score and the CPU burst score and applying an exponential decay function, wherein the accuracy scores are used for representing sensitivity and fidelity of approximate calculation on the relative errors; And evaluating and adjusting configuration parameters of the logarithmic lookup table and the index lookup table according to the calculation result of the mean square error and the accuracy rate score.
- 8. The method of claim 1, wherein comparing the burst score with a preset threshold, and if the burst score exceeds the preset threshold, generating a burst report including the flow identifier, the timestamp, and the traffic observation value, and reporting the burst report to the control plane, specifically includes: in the data plane, comparing the burst score with a preset detection threshold value to obtain a threshold value comparison result; If the threshold comparison result is that the burst score exceeds the detection threshold, judging that the current data stream generates statistical significance burst, and triggering a report generation flow; The trigger report generating process includes: extracting a current time slot flow observation value from the active sketch, and extracting a last time slot flow observation value from the stable sketch; combining the flow identification of the data flow, the timestamp corresponding to the current time slot, the current time slot flow observation value and the last time slot flow observation value, and assembling to generate a burst report; The burst report is uploaded to the control plane through a standard digest mechanism or a mirror mechanism built in the programmable switch.
- 9. The method according to claim 1, wherein the receiving the burst report at the control plane, filling the data between the adjacent burst reports by using a linear interpolation algorithm, reconstructing a continuous flow rate curve of the data stream, specifically comprises: receiving burst report from data plane through abstract message channel at control plane, analyzing and extracting flow mark, time stamp and corresponding flow observation value contained in the burst report; Sequencing burst reports belonging to the same data stream according to a time stamp sequence to form a time sequence comprising a plurality of discrete burst report points; For any two time continuous burst report points, calculating the average rate change slope between the two points based on the respective time stamps and the flow observation values of the two burst report points; Calculating and filling estimated flow rate values at each moment between two burst report points by using an average rate change slope through a linear interpolation formula; traversing all adjacent burst reporting points of the data stream and executing filling operation, and connecting the points and the filling points to generate a continuous flow rate curve of the data stream.
- 10. A full-flow microsecond measurement system based on data plane micro burst measurement, the system specifically comprising: The time sketch measuring unit is used for carrying out time-sharing flow statistics on the data flow by utilizing a time sketch data structure with a time stamp in the data surface of the programmable switch to obtain a current time-slot flow observation value and a last time-slot flow observation value; The flow rate difference unit is used for calculating a first-order difference on the data surface based on the current time slot flow observation value and the last time slot flow observation value to obtain a flow rate variation; The arithmetic approximate calculation unit is used for executing light-weight chi-square test on the data surface through a logarithmic projection technology based on the flow rate variation and the historical accumulated variation, and calculating a burst score; the burst filtering unit is used for comparing the burst score with a preset threshold value, and if the burst score exceeds the preset threshold value, generating a burst report containing the flow identifier, the time stamp and the flow observation value and reporting the burst report to the control surface; and the data filling and reconstructing unit is used for receiving burst reports at the control plane, filling data between adjacent burst reports by using a linear interpolation algorithm, and reconstructing a continuous flow rate curve of the data stream.
Description
Full-flow microsecond measurement method and system based on data plane micro burst measurement Technical Field The invention relates to the technical field of computer network management and monitoring, in particular to a full-flow microsecond measuring method and system based on data plane micro burst measurement. Background With the popularity of cloud computing, big data, and high performance computing, the link rates of modern data center networks have been increased to 100Gbps and even higher. In high-speed networks, traffic exhibits a high degree of burstiness, and micro-bursts (microbursts) tend to occur and disappear in microsecond (μs) levels of time. These short flow fluctuations, while of very short duration, are sufficient to cause packet loss, increase latency, severely impacting the performance of time-sensitive applications such as RDMA. Thus, obtaining network visibility on the order of microseconds is critical for fault diagnosis, congestion control, and resource scheduling. Traditional network monitoring methods are mainly classified into two types, one is based on periodic polling (such as SNMP) or sampling (such as sFlow/NetFlow) of a CPU, the methods usually run at millisecond or even second level granularity, transient events of microsecond level cannot be captured, and sampling can lose a great deal of details, and the other is based on telemetry methods of Sketch (Sketch), although measurement can be performed on a data plane, the control plane needs to pull the whole Sketch data from a switch at high frequency (such as every few microseconds) in order to obtain fine granularity data. However, the prior art has the following fundamental contradictions: 1. communication bandwidth bottleneck, namely, if microsecond level monitoring of the whole network by flow is to be realized, large bandwidth (for example, reaching 100Gbps level) is consumed when all measurement data are reported to a control plane in real time, which is not acceptable in a production environment; 2. Hardware resources are limited-the data plane of programmable switches (e.g., intel Tofino) is extremely limited in terms of resources, although processing speed is fast. It does not support floating point operations, does not support cyclic operations, and has limited on-chip memory (SRAM, TCAM) capacity. This makes complex statistical detection algorithms difficult to run at line speeds; 3. The state maintenance difficulty, namely, how to efficiently empty or reset the counter at the last moment without blocking the pipeline in microsecond-level continuous measurement, is a troublesome engineering problem. Existing solutions, such as umon, attempt to utilize wavelet transform compression techniques, but often require buffering of longer window data, resulting in detection delays of several milliseconds, which cannot meet the requirements of real-time control. Thus, there is a need for a telemetry method that can capture microsecond transients, control communication overhead to very low levels, and adapt to programmable switch hardware constraints. Disclosure of Invention The invention aims to provide a full-flow microsecond measuring method and system based on data surface micro-Burst measurement, and a microsecond full-flow measuring method (BurstMon method) based on a 'Burst-on-Burst' concept, which solves the contradiction between communication overhead, hardware suitability and measurement granularity of the existing method, thereby realizing microsecond telemetry with high fidelity and low overhead, and solving at least one of the problems in the prior art. In a first aspect, the present invention provides a full-flow microsecond measurement method based on data plane micro burst measurement, where the method specifically includes: In the data plane of the programmable switch, carrying out time-slot flow statistics on the data flow by utilizing a time sketch data structure with a time stamp to obtain a current time-slot flow observation value and a last time-slot flow observation value; Calculating a first order difference on the data plane based on the current time slot flow observation value and the last time slot flow observation value to obtain a flow rate variation; based on the flow rate variation and the historical accumulated variation, performing lightweight chi-square inspection on a data surface by a logarithmic projection technology, and calculating burst scores; comparing the burst score with a preset threshold, if the burst score exceeds the preset threshold, generating a burst report containing a flow identifier, a time stamp and a flow observation value, and reporting the burst report to a control surface; and receiving burst reports at a control plane, filling data between adjacent burst reports by using a linear interpolation algorithm, and reconstructing a continuous flow rate curve of the data stream. In a second aspect, the present invention provides a full-flow microsecond measurement system