CN-122021799-A - Training data processing method and device, electronic equipment and storage medium

CN122021799ACN 122021799 ACN122021799 ACN 122021799ACN-122021799-A

Abstract

The invention provides a training data processing method, a device, electronic equipment and a storage medium, which belong to the technical field of artificial intelligence and comprise the steps of reading training data from a remote distributed file system and storing the training data into a memory of a DPU; preprocessing the training data in the memory of the DPU, and directly writing the preprocessed training data into the video memory of each local GPU from the memory of the DPU. According to the invention, the DPU is used for reading training data, preprocessing the training data and directly writing the training data into the video memory of each GPU, the tasks of data reading and data preprocessing which are conventionally executed by the CPU are completely migrated to the DPU for execution, a data shortcut which directly reaches the video memory of the GPU from the remote storage is constructed by using the DPU, the host CPU and the memory bottleneck are bypassed, and the communication delay of the preparation stage of the distributed training data is reduced.

Inventors

GUO SHAOYONG
HUANG LVCHAO
LIU TIANJI
YANG CHAO
YU ZHENQI
LI QINGFENG
CHEN JIEWEI
LI WENJING
XIU JIAPENG
YANG ZHENGQIU
WANG SHIQING
LI HAOSONG

Assignees

北京邮电大学
国网信息通信产业集团有限公司
国网辽宁省电力有限公司电力科学研究院

Dates

Publication Date: 20260512
Application Date: 20251209

Claims (10)

1. A training data processing method, comprising: reading training data from a remote distributed file system and storing the training data into a memory of the DPU; Preprocessing the training data in the DPU memory; And directly writing the preprocessed training data into the video memory of each local GPU from the memory of the DPU.
2. The training data processing method of claim 1, wherein the reading training data from the remote distributed file system comprises: the training data is read by actively initiating a data read request to the remote distributed file system via RDMA technology.
3. The training data processing method according to claim 1, wherein the writing the preprocessed training data directly from the memory of the DPU into the local video memory of each GPU includes: and writing the training data into the video memory of each local GPU from the memory of the DPU through a data direct connection technology.
4. The training data processing method according to claim 1, wherein after the pre-processed training data is directly written from the memory of the DPU into the video memory of each local GPU, further comprising: receiving local gradient data obtained by each local GPU based on the training data and storing the local gradient data into a memory of the DPU; aggregating each local gradient data in the memory of the DPU to obtain an aggregation gradient; synchronizing the aggregation gradient to DPUs of other distributed computing nodes to obtain a global gradient; the global gradient is distributed to the local GPUs.
5. The training data processing method of claim 4 wherein said synchronizing said aggregated gradient to DPUs of other respective ones of said distributed computing nodes comprises: the aggregation gradient is synchronized to the DPU of each of the other distributed computing nodes by RDMA techniques.
6. The training data processing method of claim 4 wherein said distributing said global gradient to local GPUs comprises: and distributing the global gradient to each local GPU through a data straight-through technology.
7. A training data processing device, comprising: The reading module is used for reading training data from the remote distributed file system and storing the training data into the memory of the DPU; the processing module is used for preprocessing the training data in the DPU memory; and the writing module is used for directly writing the preprocessed training data into the video memory of each local GPU from the memory of the DPU.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the training data processing method according to any of claims 1 to 6 when executing the computer program.
9. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when executed by a processor, implements the training data processing method according to any of claims 1 to 6.
10. A computer program product comprising a computer program which, when executed by a processor, implements the training data processing method of any of claims 1 to 6.

Description

Training data processing method and device, electronic equipment and storage medium Technical Field The present invention relates to the field of artificial intelligence technologies, and in particular, to a training data processing method, apparatus, electronic device, and storage medium. Background With the rapid increase of the parameters of the deep learning model and the size of the training data set, the distributed parallel training has become a standard mode for model training of the cloud data center. However, in the distributed training process, huge data flow and communication overhead between nodes have become main performance bottlenecks that limit training efficiency and system extensibility. In a conventional distributed training process, the data stream typically follows a fixed high overhead path. Firstly, in a data preparation stage, each training node needs to acquire training data from a remote distributed file system (such as a data lake), the data is firstly loaded into a memory of a central processing unit (CPU, central Processing Unit) of the node, preprocessing operations such as data enhancement and the like are executed by the CPU, and finally, the data is transmitted to a video memory of a graphics processor (GPU, graphics Processing Unit) through a high-speed serial computer expansion bus standard (PCIe, PERIPHERAL COMPONENT INTERCONNECT EXPRESS) bus. The process not only occupies CPU computing resources frequently, but also causes huge cost of context switching caused by repeated triggering of hardware interrupt, and serious contention of bus resources in nodes caused by repeated carrying of data among storage-CPU-GPU, so that communication delay of distributed training is high. Disclosure of Invention The invention provides a training data processing method which is used for solving the technical problem of high distributed training communication delay in the prior art. The invention provides a training data processing method, which comprises the following steps: reading training data from a remote distributed file system and storing the training data into a memory of the DPU; Preprocessing the training data in the DPU memory; And directly writing the preprocessed training data into the video memory of each local GPU from the memory of the DPU. According to the training data processing method provided by the invention, the training data is read from a remote distributed file system, and the method comprises the following steps: the training data is read by actively initiating a data read request to the remote distributed file system via RDMA technology. According to the training data processing method provided by the invention, the method for directly writing the preprocessed training data into the video memory of each local GPU from the memory of the DPU comprises the following steps: and writing the training data into the video memory of each local GPU from the memory of the DPU through a data direct connection technology. According to the training data processing method provided by the invention, after the preprocessed training data is directly written into the video memory of each local GPU from the memory of the DPU, the method further comprises the following steps: receiving local gradient data obtained by each local GPU based on the training data and storing the local gradient data into a memory of the DPU; aggregating each local gradient data in the memory of the DPU to obtain an aggregation gradient; synchronizing the aggregation gradient to DPUs of other distributed computing nodes to obtain a global gradient; the global gradient is distributed to the local GPUs. According to the training data processing method provided by the invention, the DPU for synchronizing the aggregation gradient to other distributed computing nodes comprises the following steps: the aggregation gradient is synchronized to the DPU of each of the other distributed computing nodes by RDMA techniques. According to the training data processing method provided by the invention, the global gradient is distributed to local GPUs, and the method comprises the following steps: and distributing the global gradient to each local GPU through a data straight-through technology. The invention also provides a training data processing device, which is characterized by comprising: The reading module is used for reading training data from the remote distributed file system and storing the training data into the memory of the DPU; the processing module is used for preprocessing the training data in the DPU memory; and the writing module is used for directly writing the preprocessed training data into the video memory of each local GPU from the memory of the DPU. The invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the training data processing method as described in any of the above