CN-120416100-B - Network communication monitoring method, device, electronic equipment and medium

CN120416100BCN 120416100 BCN120416100 BCN 120416100BCN-120416100-B

Abstract

The embodiment of the invention relates to the technical field of network communication monitoring and provides a network communication monitoring method, a device, electronic equipment and a medium, wherein the method comprises the steps of collecting meta information of each data block of a network card outlet when the network card executes collective communication operation in real time; the method comprises the steps of performing data processing on meta information to generate an evolution rate sequence vector set of all queue pairs of all communication endpoints under the same set communication operation, inputting the evolution rate sequence vector set into a fault detection model to obtain a prediction rate of each queue pair output by the fault detection model, comparing the prediction rate of each queue pair with a rate error of corresponding actual flow, calculating to obtain error mean values and standard deviations of all queue pairs based on the rate error of each queue pair, performing network communication anomaly judgment based on the error mean values and the standard deviations, and determining the positions of abnormal points based on the queue pair numbers. Therefore, the network communication state is efficiently monitored in real time, and the fault node is rapidly identified.

Inventors

XIN QI
HU XIAOHE

Assignees

北京基流科技有限公司

Dates

Publication Date: 20260508
Application Date: 20250508

Claims (9)

1. A method for monitoring network communications, comprising: Acquiring meta information of each data block of a network card outlet when the network card executes collective communication operation in real time, wherein the meta information comprises the size of each data block, a transmission time window and a queue pair number corresponding to each data block, the queue pair comprises a sending queue and a receiving queue, and the queue pair number is a unique identifier distributed to each queue pair and is used for identifying different communication channels; performing data processing on the meta-information to generate an evolution rate sequence vector set of all queue pairs of all communication endpoints under the same set communication operation; inputting the evolution rate sequence vector set into a fault detection model to obtain the prediction rate of each queue pair output by the fault detection model; Comparing the predicted rate of each queue pair with the corresponding actual rate to obtain a rate error; calculating an error mean value and a standard deviation of all the queue pairs based on the rate error of each queue pair; Performing network communication abnormity judgment based on the error mean value and the standard deviation, and determining abnormal point positions based on the queue pair numbers; performing network communication anomaly determination based on the error mean and standard deviation, and determining an outlier position based on the queue pair number, including: when the error mean value is larger than an error threshold value, determining that network communication is abnormal, wherein the error threshold value is dynamically adjusted by combining preset adjusting parameters through the error mean value and standard deviation; when the network communication is determined to be abnormal, inquiring a queue pair number corresponding to abnormal traffic, and determining an abnormal point position based on the queue pair number; and when the error mean value is smaller than or equal to the error threshold value, determining that network communication is normal.
2. The method of claim 1, wherein the performing data processing on the meta-information generates a set of evolution rate sequence vectors for all queue pairs of all communication endpoints under a same set of communication operations, comprising: Performing data cleaning on the meta-information to obtain effective meta-data; And carrying out normalization processing on the effective metadata, and carrying out data aggregation on the effective metadata after normalization processing to generate an evolution rate sequence vector set of all queue pairs of all communication endpoints under the same set communication operation.
3. The method according to claim 1 or 2, wherein the fault detection model is trained by: Acquiring a sample evolution rate sequence vector set of all queue pairs created when the network card executes the collective communication operation; And inputting the sample evolution rate sequence vector set into a transducer model for online learning, and stopping model training when a preset early-stop condition is met, so as to obtain the fault detection model, wherein the preset early-stop condition comprises no improvement of verification loss, no improvement of verification accuracy or increase of the difference between the training loss and the verification loss.
4. The method of claim 3, wherein said inputting the set of sample evolution rate sequence vectors into a transducer model for online learning comprises: and inputting the sample evolution rate sequence vector set into a converter model, and optimizing the prediction performance of the converter model according to the data fluctuation degree in a time sequence by dynamically adjusting the Patch size by adopting an adaptive Patch processing method when the converter model learns based on the sample evolution rate sequence vector set.
5. The method of claim 4, wherein optimizing the predicted performance of the transducer model based on the degree of data fluctuation in the time series by dynamically adjusting the Patch size using an adaptive Patch processing method comprises: determining the data change rate of the time sequence according to the fluctuation degree of the data in the time sequence; When the data change rate is larger than a first fluctuation threshold, capturing the change of the sample evolution rate sequence vector set by adopting a minimum Patch; When the data change rate is smaller than a second fluctuation threshold, capturing the change of the sample evolution rate sequence vector set by adopting a maximum Patch; And when the data change rate is larger than the second fluctuation threshold and smaller than the first fluctuation threshold, adopting a preset formula to adaptively adjust the size of the Patch.
6. A network communication monitoring apparatus, comprising: The system comprises a data acquisition module, a data processing module and a data processing module, wherein the data acquisition module is used for acquiring meta information of each data block of a network card outlet when the network card executes collective communication operation in real time, the meta information comprises the size of each data block, a transmission time window and a queue pair number corresponding to each data block, the queue pair comprises a sending queue and a receiving queue, and the queue pair number is a unique identifier distributed to each queue pair and is used for identifying different communication channels; the prediction module is used for inputting the evolution rate sequence vector set into a fault detection model to obtain the prediction rate of each queue pair output by the fault detection model; The comparison module is used for comparing the predicted rate and the corresponding actual rate of each queue pair to obtain a rate error; the judging module is used for judging network communication abnormality based on the error mean value and the standard deviation and determining the position of an abnormal point based on the queue pair number, judging network communication abnormality based on the error mean value and the standard deviation and determining the position of the abnormal point based on the queue pair number, wherein the judging module is used for determining that network communication is abnormal when the error mean value is larger than an error threshold value, the error threshold value is dynamically adjusted through the error mean value and the standard deviation by combining with preset adjusting parameters, inquiring the queue pair number corresponding to abnormal flow when the network communication is determined to be abnormal, determining the position of the abnormal point based on the queue pair number, and determining that the network communication is normal when the error mean value is smaller than or equal to the error threshold value.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, wherein the processor implements the network communication monitoring method of any one of claims 1 to 5 when executing the computer program.
8. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the network communication monitoring method according to any of claims 1 to 5.
9. A computer program product comprising a computer program which, when executed by a processor, implements the network communication monitoring method according to any one of claims 1 to 5.

Description

Network communication monitoring method, device, electronic equipment and medium Technical Field The present invention relates to the field of network communication monitoring technologies, and in particular, to a network communication monitoring method, device, electronic apparatus, and medium. Background Existing C4 diagnostics techniques utilize aggregate communication library logs to check network communication status in real time. However, the monitoring information collected by the method is complex, the abnormality detection process is complex, the detection of faults can be delayed, and the real-time performance is inferior to that of some active detection methods. Or monitoring the network communication state by means of performing (Compute Unified Device Architecture, CUDA) Event dotting in a large language model (Large Language Model, LLM) training framework, visualizing a three-dimensional parallel training time process along a time axis, and positioning the fault to a certain step in the training process in real time. The active monitoring mode can affect the model training efficiency, and specific fault nodes cannot be identified. Or LLaMA (Large Language Model Meta AI ‌) model, an inflight collective communication library (NVIDIA Collective Communications Library, NCCL) Extended is designed to track network activity by sending small information packets to provide a failure snapshot. These extra packets slow down the transmission rate of the data stream and even cause extra congestion. Therefore, how to efficiently and real-timely monitor the network communication state and quickly identify the fault node becomes a urgent problem to be solved. Disclosure of Invention The invention provides a network communication monitoring method, a device, electronic equipment and a medium, which are used for solving the defects of untimely detection and low efficiency of network communication abnormality in the prior art, realizing high-efficiency and real-time monitoring of network communication state and rapidly identifying fault nodes. The invention provides a network communication monitoring method, which comprises the following steps: Acquiring meta information of each data block of a network card outlet when the network card executes collective communication operation in real time, wherein the meta information comprises a queue pair number corresponding to each data block, and the queue pair comprises a sending queue and a receiving queue; performing data processing on the meta-information to generate an evolution rate sequence vector set of all queue pairs of all communication endpoints under the same set communication operation; inputting the evolution rate sequence vector set into a fault detection model to obtain the prediction rate of each queue pair output by the fault detection model; Comparing the predicted rate of each queue pair with the rate error of the corresponding actual flow; calculating an error mean value and a standard deviation of all the queue pairs based on the rate error of each queue pair; and carrying out network communication abnormity judgment based on the error mean value and the standard deviation, and determining abnormal point positions based on the queue pair numbers. In one possible embodiment, the method further comprises: Performing data cleaning on the meta-information to obtain effective meta-data; And carrying out normalization processing on the effective metadata, and carrying out data aggregation on the effective metadata after normalization processing to generate an evolution rate sequence vector set of all queue pairs of all communication endpoints under the same set communication operation. In one possible embodiment, the method further comprises: when the error mean value is larger than an error threshold value, determining that network communication is abnormal, wherein the error threshold value is dynamically adjusted by combining preset adjusting parameters through the error mean value and standard deviation; when the network communication is determined to be abnormal, inquiring a queue pair number corresponding to abnormal traffic, and determining an abnormal point position based on the queue pair number; and when the error mean value is smaller than or equal to the error threshold value, determining that network communication is normal. In one possible embodiment, the method further comprises: The fault detection model is obtained through training the following steps: Acquiring a sample evolution rate sequence vector set of all queue pairs created when the network card executes the collective communication operation; And inputting the sample evolution rate sequence vector set into a transducer model for online learning, and stopping model training when a preset early-stop condition is met, so as to obtain the fault detection model, wherein the preset early-stop condition comprises no improvement of verification loss, no improvement of verification accur