Search

CN-122001796-A - Method and device for detecting network performance abnormality of phase sensing

CN122001796ACN 122001796 ACN122001796 ACN 122001796ACN-122001796-A

Abstract

The invention discloses a method and a device for detecting network performance abnormality of phase perception, and belongs to the technical field of network abnormality detection. The method comprises the steps of obtaining sample time sequence data and a communication phase label, dividing the sample time sequence data into a training set and a verification set, training a time sequence detection model, inputting the verification set into the trained time sequence detection model to obtain the abnormal score of each index in the verification set at each time point so as to set an abnormal judgment threshold value, collecting the detection time sequence data and identifying the communication phase, inputting the detection time sequence data into the trained time sequence detection model to obtain the abnormal score of each index in the detection time sequence data at each time point, comparing the abnormal score of each index in the detection time sequence data with the corresponding abnormal judgment threshold value, and determining an abnormal detection result. The invention distinguishes different communication phases, and independently detects each index, thereby solving the problems of false report and missing report in abnormal detection.

Inventors

  • ZHANG YIRAN
  • Xia hang
  • ZHOU AO
  • WANG SHANGGUANG

Assignees

  • 北京邮电大学

Dates

Publication Date
20260508
Application Date
20260130

Claims (10)

  1. 1. The method for detecting the network performance abnormality of the phase perception is characterized by comprising the following steps of: Acquiring sample time sequence data of a plurality of indexes changing along with time in a network system and a communication phase label of the sample time sequence data, wherein the communication phase label is used for indicating a communication phase to which each time point belongs; Dividing the sample time sequence data into a training set and a verification set, and training a time sequence detection model by using the training set and the verification set so as to enable the time sequence detection model to learn a normal network behavior mode under each communication phase; Inputting the verification set into a trained time sequence detection model to obtain the anomaly score of each index in the verification set at each time point so as to respectively set corresponding anomaly judgment thresholds for each index in each communication phase; Collecting detection time sequence data in the network system, and identifying a communication phase to which each time point in the detection time sequence data belongs; inputting the detection time sequence data into a trained time sequence detection model to obtain the abnormal score of each index in the detection time sequence data at each time point; And comparing the abnormality score of each index at each time point in the detection time sequence data with an abnormality judgment threshold value of the corresponding index in the corresponding communication phase, and determining an abnormality detection result of the detection time sequence data according to the comparison result.
  2. 2. The method of claim 1, wherein training a timing detection model using the training set and the validation set comprises: Inputting the training set and the validation set into the timing detection model; Reconstructing the training set through the time sequence detection model to obtain corresponding reconstruction data, and learning the data association relationship in the training set through a multi-head self-attention mechanism of the time sequence detection model; Performing multi-round iterative training on the time sequence detection model by using the training set with the minimum of a joint loss function formed by reconstruction loss and association difference loss as a training target, wherein the reconstruction loss represents the difference between reconstruction data obtained by reconstructing the time sequence detection model on the training set and original data of the training set, and the association difference loss represents the difference between the data association relationship learned by the time sequence detection model from the training set and the data association relationship under a predefined normal network behavior mode; And after each round of iterative training, verifying the time sequence detection model by using the verification set, and obtaining the trained time sequence detection model under the condition that the preset iterative condition is reached.
  3. 3. The method of claim 2, wherein inputting the validation set into a trained timing detection model yields an anomaly score for each indicator in the validation set at each point in time, comprising: Inputting the verification set into the training time sequence detection model; reconstructing the sample time sequence data through the time sequence detection model to obtain corresponding reconstructed time sequence data, wherein the reconstructed time sequence data comprises reconstructed data of each index at each time point; Extracting, for each index, raw data of the index at each time point from the verification set, and extracting reconstructed data of the index at each time point from the reconstructed time series data; calculating the difference between the original data and the reconstruction data of the index at each time point to obtain a reconstruction error of the index at each time point; And determining the abnormal score of the index at each time point according to the reconstruction error of the index at each time point.
  4. 4. The method of claim 2, wherein inputting the validation set into a trained timing detection model yields an anomaly score for each indicator in the validation set at each point in time, comprising: Inputting the verification set into the trained time sequence detection model; Calculating the data association strength between every two time points in the verification set based on a multi-head self-attention mechanism through the time sequence detection model to obtain a content association matrix, wherein each element in the content association matrix represents the data association strength between the two time points; Acquiring a predefined prior correlation matrix, wherein each element in the content correlation matrix identifies the expected data correlation strength between two time points in a normal network behavior mode; Calculating the difference between the content association matrix and a predefined prior association matrix to obtain an association difference score of the verification set, wherein the association difference score characterizes the degree of deviation of the verification set from a normal network behavior mode; And decomposing the associated difference score into independent scores of each index at each time point, and taking each independent score as an abnormal score of the corresponding index at the corresponding time point.
  5. 5. The method according to claim 1, wherein setting the corresponding abnormality determination threshold for each index in each communication phase includes: Dividing the anomaly score of each index at each time point according to the communication phase of each time point in the verification set to obtain an anomaly score subset corresponding to each index under each communication phase; And calculating an abnormal score subset of the index by adopting a preset quantile statistical method aiming at each index under each communication phase to obtain an abnormal judgment threshold value of the index in the communication phase, wherein the abnormal judgment threshold value is used for representing the upper bound of a normal fluctuation range of the abnormal score of the corresponding index under the corresponding communication phase.
  6. 6. The method of claim 1, further comprising, after obtaining sample time series data of a plurality of metrics over time in the network system and the communication phase tag of the sample time series data: calculating standardized parameters of indexes in each communication phase in the sample time sequence data according to the communication phase labels, and carrying out standardized processing on the sample time sequence data by utilizing the standardized parameters to obtain standardized sample time sequence data; Dividing the standardized sample time sequence data into a training set and a verification set, and training a time sequence detection model by using the training set and the verification set so as to enable the time sequence detection model to learn a normal network behavior mode under each communication phase; after collecting detection time sequence data in the network system and identifying communication phases to which each time point in the detection time sequence data belongs, the method further comprises the steps of: Carrying out standardization processing on the detection time sequence data based on the standardization parameters to obtain standardized detection time sequence data; and inputting the standardized detection time sequence data into a trained time sequence detection model to obtain the abnormal score of each index in the detection time sequence data at each time point.
  7. 7. The method of claim 6, wherein calculating a normalization parameter for each indicator in each communication phase in the sample timing data according to the communication phase tag, and performing normalization processing on the sample timing data using the normalization parameter to obtain normalized sample timing data, comprises: determining a communication phase to which each time point in the sample time sequence data belongs according to the communication phase label; Calculating, for each index in each communication phase, the mean and variance of the index in the communication phase as normalized parameters of the index in the communication phase; for each index at each time point in the sample time sequence data, according to the communication phase of the time point, carrying out standardized calculation on the data of the index at the time point by utilizing the standardized parameter corresponding to the index at the time point to obtain standardized data of the index at the time point; And combining the standardized data of all indexes at all time points to obtain the standardized sample time sequence data.
  8. 8. A phase-aware network performance anomaly detection apparatus, comprising: The acquisition module is used for acquiring sample time sequence data of a plurality of indexes changing along with time in a network system and a communication phase label of the sample time sequence data, wherein the communication phase label is used for indicating the communication phase of each time point; The training module is used for dividing the sample time sequence data into a training set and a verification set, and training the time sequence detection model by utilizing the training set and the verification set so as to enable the time sequence detection model to learn a normal network behavior mode under each communication phase; The threshold setting module is used for inputting the verification set into the trained time sequence detection model to obtain the anomaly score of each index in the verification set at each time point so as to set corresponding anomaly judgment thresholds for each index in each communication phase respectively; The acquisition module is used for acquiring detection time sequence data in the network system and identifying the communication phase of each time point in the detection time sequence data; the input module is used for inputting the detection time sequence data into a trained time sequence detection model to obtain the abnormal score of each index in the detection time sequence data at each time point; And the comparison module is used for comparing the abnormality score of each index at each time point in the detection time sequence data with an abnormality judgment threshold value of the corresponding index in the corresponding communication phase and determining an abnormality detection result of the detection time sequence data according to the comparison result.
  9. 9. An electronic device comprising a processor, a memory and a computer program stored on the memory and capable of running on the processor, which computer program, when executed by the processor, carries out the steps of the method according to any one of claims 1-7.
  10. 10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the method according to any of claims 1-7.

Description

Method and device for detecting network performance abnormality of phase sensing Technical Field The invention belongs to the technical field of network anomaly detection, and particularly relates to a method and a device for detecting network performance anomalies through phase sensing. Background With the rapid development of distributed computing and large-scale model training, modern network systems exhibit significant staged operating features. For example, in a large model training network, task execution typically involves multiple communication phases, such as computation, gradient aggregation, parameter distribution, etc., where there are significant differences in network traffic patterns, load strengths, and performance index distributions between the phases. The occurrence of performance anomalies such as network congestion is often closely related to a specific communication stage, and at present, various methods have been proposed in the field of timing anomaly detection. However, in the phased network scenario, the index distribution difference between different communication phases is huge, and the conventional method often misjudges the normal index mutation caused by phase switching as abnormal because the phase characteristics cannot be distinguished, so that the misinformation rate is high and the detection accuracy is reduced. Meanwhile, due to different fluctuation intensities among multiple indexes, the strong fluctuation indexes are easy to cover abnormal signals of the weak fluctuation indexes, and accurate identification of performance anomalies such as network congestion is further affected. Therefore, the related art has difficulty in achieving accurate anomaly detection in the face of a highly structured, multi-modal traffic phasing network. Disclosure of Invention The embodiment of the invention aims to provide a method and a device for detecting network performance abnormality of phase sensing, which can solve the problems existing in the background technology. In order to solve the technical problems, the invention is realized as follows: in a first aspect, an embodiment of the present invention provides a method for detecting network performance anomalies by using phase sensing, including: Acquiring sample time sequence data of a plurality of indexes changing along with time in a network system and a communication phase label of the sample time sequence data, wherein the communication phase label is used for indicating a communication phase to which each time point belongs; Dividing the sample time sequence data into a training set and a verification set, and training a time sequence detection model by using the training set and the verification set so as to enable the time sequence detection model to learn a normal network behavior mode under each communication phase; Inputting the verification set into a trained time sequence detection model to obtain the anomaly score of each index in the verification set at each time point so as to respectively set corresponding anomaly judgment thresholds for each index in each communication phase; Collecting detection time sequence data in the network system, and identifying a communication phase to which each time point in the detection time sequence data belongs; inputting the detection time sequence data into a trained time sequence detection model to obtain the abnormal score of each index in the detection time sequence data at each time point; And comparing the abnormality score of each index at each time point in the detection time sequence data with an abnormality judgment threshold value of the corresponding index in the corresponding communication phase, and determining an abnormality detection result of the detection time sequence data according to the comparison result. Optionally, training the sequence detection model using the training set and the validation set includes: Inputting the training set and the validation set into the timing detection model; Reconstructing the training set through the time sequence detection model to obtain corresponding reconstruction data, and learning the data association relationship in the training set through a multi-head self-attention mechanism of the time sequence detection model; Performing multi-round iterative training on the time sequence detection model by using the training set with the minimum of a joint loss function formed by reconstruction loss and association difference loss as a training target, wherein the reconstruction loss represents the difference between reconstruction data obtained by reconstructing the time sequence detection model on the training set and original data of the training set, and the association difference loss represents the difference between the data association relationship learned by the time sequence detection model from the training set and the data association relationship under a predefined normal network behavior mode; And after each round of