CN-122027523-A - Large model training monitoring method and system based on flow measurement

CN122027523ACN 122027523 ACN122027523 ACN 122027523ACN-122027523-A

Abstract

The invention discloses a large model training monitoring method and system based on flow measurement, which belong to the field of network measurement, wherein the method comprises the steps of carrying out flow statistics on RDMA flows to obtain RDMA flow measurement data; according to the existing communication mode of the communication library in the large model training process, determining communication characteristic information of each communication operator in the training process, wherein the communication characteristic information comprises communication participations and communication data quantity, screening transport layer flow data corresponding to the communication participations from RDMA flow measurement data to construct a communication rate time sequence reflecting data transmission behavior in the communication process, analyzing the communication rate time sequence according to the communication data quantity, deducing an execution time interval of the communication operator, extracting rate data in the execution time interval from the communication rate time sequence, and monitoring the communication rate of the communication operator. The invention realizes microsecond-level rate monitoring of the aggregate communication behavior in the large model training process, and improves the monitoring time precision.

Inventors

TIAN CHEN
MENG QINGKAI
XIAO YIBO

Assignees

南京大学

Dates

Publication Date: 20260512
Application Date: 20260206

Claims (9)

1. The large model training monitoring method based on flow measurement is characterized by comprising the following steps of: Carrying out flow statistics on the RDMA flow to obtain RDMA flow measurement data; Determining communication characteristic information of each communication operator in the training process according to the existing communication mode of a communication library in the large model training process, wherein the communication characteristic information comprises communication participation nodes and communication data volume; screening transport layer flow data corresponding to the communication participation nodes from RDMA flow measurement data, and constructing a communication rate time sequence reflecting data transmission behavior in the communication process; and analyzing the communication rate time sequence according to the communication data volume, deducing an execution time interval of the communication operator, and extracting rate data in the execution time interval from the communication rate time sequence to monitor the communication rate of the communication operator.
2. The traffic measurement-based large model training monitoring method according to claim 1, wherein the traffic statistics on RDMA streams are specifically: and executing counting operation on the RDMA stream passing through the network card, generating a corresponding trigger event when the transmission data quantity of the RDMA stream in the preset time exceeds a preset threshold value, executing time window-by-time window flow measurement on the triggered RDMA stream, and periodically collecting the generated RDMA flow measurement data.
3. The traffic measurement-based large model training monitoring method according to claim 1, wherein the analysis of the communication rate time sequence according to the communication data volume concludes that the execution time interval of the communication operator is specifically: According to the communication data quantity of the communication operator, a time interval meeting the requirement that the accumulated transmission data quantity reaches the communication data quantity of the communication operator is determined in the communication rate time sequence, the starting time of the time interval is determined as the starting time point of the communication operator, and the ending time of the time interval is determined as the ending time point of the communication operator.
4. The method for monitoring and training the large model based on the flow measurement according to claim 1, wherein the determining the communication characteristic information of each communication operator in the training process according to the existing communication mode of the communication library in the large model training process specifically comprises: acquiring type information of a corresponding communication operator and node identification participating in communication when a communication library function is called by adopting a function interception technology, wherein the function interception technology is realized by a eBPF-based system call or function tracking mechanism or a LD_PRELOAD-based dynamic link function replacement mechanism; And deducing the theoretical communication data volume of each communication operator according to the training configuration and the existing behavior characteristics of the communication library.
5. A large model training monitoring system based on flow measurement, comprising: The RDMA flow measurement module is used for carrying out flow statistics on the RDMA flow to obtain RDMA flow measurement data; The communication operator monitoring module is used for determining communication characteristic information of each communication operator in the training process according to the existing communication mode of the communication library in the large model training process, wherein the communication characteristic information comprises communication participators and communication data quantity, screening transport layer flow data corresponding to the communication participators from RDMA flow measurement data, constructing a communication rate time sequence reflecting data transmission behavior in the communication process, analyzing the communication rate time sequence according to the communication data quantity, deducing an execution time interval of the communication operator, and extracting rate data in the execution time interval from the communication rate time sequence to monitor the communication rate of the communication operator.
6. The flow measurement based large model training monitoring system of claim 5, wherein the RDMA flow measurement module comprises an aggregation layer, a measurement layer, and a collection layer; The aggregation layer is deployed in a data forwarding pipeline of the network card, performs counting operation on the RDMA stream passing through the network card, generates a corresponding trigger event when the transmission data quantity of the RDMA stream in preset time exceeds a preset threshold value, and reports the event to a microprocessor on the network card; The measuring layer runs on a microprocessor of the network card, and in the measuring layer, independent storage space is distributed for each measured RDMA stream in a plurality of continuous time windows and used for recording the data quantity actually transmitted by the RDMA stream in each time window; the collection layer is responsible for periodically collecting RDMA flow measurement data generated by the measurement layer from the network card microprocessor.
7. The traffic-based large model training monitoring system of claim 6, wherein during collection of RDMA traffic measurement data, the microprocessor of the network card checks whether there is an RDMA stream with allocated measurement memory but no data transfer during a past collection period, and if so, determines the RDMA stream as an inactive stream and reclaims its corresponding memory space.
8. The traffic measurement-based large model training monitoring system according to claim 5, wherein the determining the communication characteristic information of each communication operator in the training process according to the existing communication mode of the communication library in the large model training process specifically comprises: acquiring type information of a corresponding communication operator and node identification participating in communication when a communication library function is called by adopting a function interception technology, wherein the function interception technology is realized by a eBPF-based system call or function tracking mechanism or a LD_PRELOAD-based dynamic link function replacement mechanism; And deducing the theoretical communication data volume of each communication operator according to the training configuration and the existing behavior characteristics of the communication library.
9. The traffic measurement based large model training monitoring system of claim 5, wherein the analyzing the communication rate time sequence according to the traffic data volume deduces an execution time interval of the traffic operator, specifically, determining a time interval meeting the accumulated transmission data volume reaching the traffic data volume of the traffic operator in the communication rate time sequence according to the traffic data volume of the traffic operator, determining a starting time of the time interval as a starting time point of the traffic operator, and determining an ending time of the time interval as an ending time point of the traffic operator.

Description

Large model training monitoring method and system based on flow measurement Technical Field The invention relates to the field of large model training monitoring and network measurement, in particular to a large model training monitoring method and system based on flow measurement. Background In recent years, with the continuous expansion of the scale of deep learning models, large model training gradually evolves from a stand-alone environment to a training mode that relies on a large-scale distributed cluster. In a typical large model training process, training tasks usually run on a cluster formed by a plurality of computing nodes, and frequent data exchange is performed between the nodes through a high-speed network, especially, collective communication operations represented by a communication operator AllReduce, allGather, reduceScatter and the like are becoming key basic mechanisms in parallel strategies such as Data Parallel (DP), tensor Parallel (TP), expert Parallel (EP) and the like. Therefore, the communication behavior in the large model training process is accurately monitored, particularly the performance abnormality and the slow node are positioned, and the method has important significance in guaranteeing training efficiency and improving resource utilization rate. In response to the above needs, the academia and industry have proposed some technical solutions for monitoring and analyzing the performance of large model training, and these solutions mainly model and analyze the training process by collecting time information at the application layer or the communication library layer. Training phase monitoring method based on CUDA event in the prior art MEGASCALE is a representative prior art, which is disclosed in non-patent document "MEGASCALE: SCALING LARGE Language Model TRAINING WITH Megatron-LM" (published at USENIX NSDI 2024). According to the method, CUDA events are inserted into a training framework, and execution time of a plurality of key code segments in the training process is collected, so that time consumption conditions of communication with calculation are analyzed, and the method is used for evaluating training performance bottlenecks. The method is mainly characterized in that: 1) The execution time of the code segment can be acquired with lower expenditure through a CUDA event mechanism of the GPU side; 2) Can be used to analyze the overall time-consuming distribution of the different phases (e.g., forward, reverse, and gradient sync) in model training. However, MEGASCALE et al methods do not give uniform, precise definition to "key code segments", whose monitoring granularity typically stays at a coarser stage level, and is difficult to refine further to the node level or communication behavior level. In addition, the method relies on explicit insertion of CUDA events in a training framework or application code, belongs to an invasive monitoring mode, and is high in deployment and maintenance cost. Training monitoring method based on communication library pile insertion in the prior art Aegis is another typical prior art, which is disclosed in non-patent document "Evolution of Aegis: fault Diagnosis for AI Model Training Cloud SERVICE IN Production" (published in USENIX NSDI 2025). Aegas performs pile insertion in a set communication library (such as NCCL), records the starting time and the ending time of each communication operator, and reversely deduces the starting and ending time of a calculation stage based on the time information, so that the performance monitoring and scheduling optimization of the training process are realized. Similarly, holmes is disclosed in non-patent literature "Holmes: localizing Irregularities IN LLM TRAINING WITH MEGA-scale GPU clusters" (published at USENIX NSDI 2025), which also diagnoses performance anomalies in distributed training by instrumentation of aggregate communication operators at the communication library layer, collecting time information of the communication operators, and combining statistical analysis methods. The method based on the communication library pile insertion has the following common characteristics: 1) The monitoring point is positioned in the communication library layer and can acquire the starting time and the ending time of the aggregate communication operator; 2) The execution condition of the calculation stage is indirectly inferred through the communication time, so that the whole monitoring of the training process is realized. Although the above prior art has achieved monitoring of large model training processes to some extent, there are at least the following disadvantages in practical applications: (1) The monitoring granularity is thicker, and slow nodes are difficult to locate The large model trains a strong synchronization process of highly dependent set communication. When a node in the cluster has reduced performance, other nodes also exhibit the same or similar communication start time and end