CN-121792535-B - Data communication method, device, equipment, storage medium and program product under distributed training task

CN121792535BCN 121792535 BCN121792535 BCN 121792535BCN-121792535-B

Abstract

The application provides a data communication method, a device, equipment, a storage medium and a program product under a distributed training task, wherein the method comprises the steps of starting full-aggregation communication operation aiming at data to be processed in an input gradient calculation stage, writing full-tensor data generated by aggregation aiming at the data to be processed into a preset data buffer area in the process of executing the full-aggregation communication operation, performing first matrix operation on the basis of the full-tensor data generated by aggregation to obtain a first gradient result, reading the full-tensor data aggregated by the full-aggregation communication operation from the preset data buffer area in a weight gradient calculation stage, and performing second matrix operation on the basis of the full-tensor data read from the preset data buffer area in the process of executing reduction scattering communication operation on the first gradient result to obtain a second gradient result. By the method and the device, redundant communication in the weight gradient calculation stage can be eliminated, and deep overlapping of the calculation task and the communication task is realized.

Inventors

HUANG QINGYANG

Assignees

上海东方算芯科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260304

Claims (12)

1. A method of data communication under distributed training tasks, the method comprising: Starting full aggregation communication operation aiming at data to be processed in an input gradient calculation stage of the distributed training task; Writing the full tensor data generated by converging the data to be processed into a preset data buffer area in the process of executing the full-aggregate communication operation, and performing first matrix operation based on the full tensor data generated by converging to obtain a first gradient result; reading the full-aggregate communication operation-aggregated full-tensor data from the preset data buffer in a weight gradient calculation stage of the distributed training task, and shielding the full-aggregate communication operation used for acquiring the full-aggregate tensor data in the weight gradient calculation stage; And in the process of executing the reduced scattering communication operation on the first gradient result, performing a second matrix operation based on the full tensor data read from the preset data buffer area to obtain a second gradient result, wherein the first gradient result and the second gradient result are used for performing model training on the target model.
2. The method of claim 1, wherein the full aggregate communication operation consists of a plurality of sliced transmission sub-tasks, and the first matrix operation consists of a plurality of sliced computation sub-tasks; In the process of executing the total aggregation communication operation, writing the total tensor data generated by aggregating the data to be processed into a preset data buffer area, and performing a first matrix operation based on the total tensor data generated by aggregating to obtain a first gradient result, wherein the method comprises the following steps: Performing a transmission subtask for a current data slice of the plurality of data slices; Under the condition that the transmission subtask of the current data fragment is completed, storing the data of the current data fragment into the preset data buffer area; Starting a calculation subtask aiming at the current data fragment by utilizing the data of the current data fragment to obtain a gradient calculation result of the current data fragment; and starting a transmission subtask for the next data slice in the plurality of data slices based on the gradient calculation result.
3. The method of claim 1, wherein prior to initiating a full aggregate communication operation for data to be processed during an input gradient computation phase of the distributed training task, the method further comprises: in the initialization stage of the distributed training task, acquiring architecture parameters of the target model; Determining a target capacity of a data buffer based on the architecture parameters; distributing a memory space according to the target capacity, and configuring the preset data buffer area in the memory space; And establishing the preset mapping relation between the preset data buffer zone and the target model.
4. The method of claim 3, wherein the architectural parameters of the object model include batch size, sequence length, and hidden layer dimensions; the determining the target capacity of the data buffer based on the architecture parameters includes: determining a product result of the batch size, the sequence length and the hidden layer dimension; And determining the product result as the target capacity of the preset data buffer zone.
5. The method of claim 3, wherein the allocating memory space according to the target capacity to configure the preset data buffer and establishing the preset mapping relationship between the preset data buffer and the target model comprises: In the case that the target model includes a plurality of converter layers, for each converter layer, memory space is allocated for the converter layer according to the target capacity to configure the preset data buffer; and establishing the preset mapping relation between the preset data buffer zone and the converter layer of the target model.
6. The method of claim 5, wherein the method further comprises: and after the distributed training task is completed, releasing the preset data buffer area of each transducer layer.
7. The method according to any one of claims 1 to 6, wherein the aggregating the generated full-tensor data comprises outputting a gradient and inputting an activation value, and wherein the writing the generated full-tensor data for the aggregation of the data to be processed to a preset data buffer comprises: invoking an input gradient calculation operator to identify the preset mapping relation of the target model; In response to the input gradient computation operator identifying the preset mapping relation of the target model, detecting a convergence state of the output gradient and the input activation value in the process of executing the full-convergence communication operation; When the tensor data of the output gradient are converged, storing the converged output gradient into the preset data buffer area; And when the tensor data aggregation of the input activation values is completed, storing the aggregated input activation values into the preset data buffer.
8. The method of claim 7, wherein performing a second matrix operation based on the full tensor data read from the preset data buffer to obtain a second gradient result comprises: In response to the weight gradient computation operator identifying the preset mapping relationship and the full tensor data convergence being completed, starting a reduced scatter communication operation for the first gradient result, and dispersing the first gradient result into the data slices of the distributed training task; And in the process of executing the reduction scattering communication operation, executing the second matrix operation by using the output gradient and the input activation value read from the preset data buffer area to obtain the second gradient result.
9. A data communication apparatus under distributed training tasks, the apparatus comprising: The input gradient processing module is used for starting full aggregation communication operation aiming at the data to be processed in an input gradient calculation stage of the distributed training task; writing the full tensor data generated by converging the data to be processed into a preset data buffer area in the process of executing the full-aggregate communication operation, and performing first matrix operation based on the full tensor data generated by converging to obtain a first gradient result; The weight gradient processing module is used for reading the full-aggregate tensor data collected by the full-aggregate communication operation from the preset data buffer area in a weight gradient calculation stage of the distributed training task, shielding the full-aggregate communication operation used for obtaining the full-aggregate tensor data in the weight gradient calculation stage, performing a second matrix operation on the first gradient result based on the full-aggregate tensor data read from the preset data buffer area in the process of performing the reduced scattering communication operation to obtain a second gradient result, and the first gradient result and the second gradient result are used for performing model training on the target model.
10. An electronic device, the electronic device comprising: A memory for storing computer executable instructions or computer programs; a processor for implementing the data communication method under distributed training tasks of any of claims 1 to 8 when executing computer executable instructions or computer programs stored in said memory.
11. A computer-readable storage medium storing computer-executable instructions or a computer program, which when executed by a processor implements the method of data communication under distributed training tasks according to any of claims 1 to 8.
12. A computer program product comprising computer executable instructions or a computer program which, when executed by a processor, implements the method of data communication under distributed training tasks according to any of claims 1 to 8.

Description

Data communication method, device, equipment, storage medium and program product under distributed training task Technical Field The present application relates to the field of artificial intelligence, and in particular, to a method, an apparatus, a device, a storage medium, and a program product for data communication under a distributed training task. Background With the rapid development of deep learning technology, the parameter scale of the neural network model is increasingly larger, and the computing capability and storage resources of a single device are often difficult to meet the training requirement of a large-scale model. Therefore, the distributed training technology becomes an important means for accelerating model training and improving model performance. In a distributed training scenario, it is necessary to split data or models into multiple computing nodes for parallel processing and synchronize the data or gradients through inter-node communication. The distributed training communication scheme in the related art mainly comprises data parallelism, tensor model parallelism, pipeline parallelism and sequence parallelism. However, in the related art, in the gradient computing process of back propagation, the collective communication operation needs to be frequently invoked to meet the dependence of different computing stages on the total data, so that repeated occupation of network bandwidth and idle waiting of computing resources are caused. Disclosure of Invention The embodiment of the application provides a data communication method, a device, a storage medium and a program product under a distributed training task, which can multiplex full tensor data by utilizing a preset data buffer area to eliminate redundant communication in a weight gradient calculation stage and realize deep overlapping of the calculation task and the communication task, thereby obviously reducing communication overhead and improving the overall efficiency of the distributed training task. The technical scheme of the embodiment of the application is realized as follows: The embodiment of the application provides a data communication method under a distributed training task, which comprises the steps of starting full-aggregation communication operation for data to be processed in an input gradient calculation stage of the distributed training task, writing full-tensor data generated by aggregation of the data to be processed into a preset data buffer zone in the process of executing the full-aggregation communication operation, performing first matrix operation on the basis of the full-tensor data generated by aggregation to obtain a first gradient result, wherein the preset data buffer zone is a data buffer zone with a preset mapping relation with a target model of the distributed training task, reading the full-tensor data aggregated by the full-aggregation communication operation from the preset data buffer zone in a weight gradient calculation stage of the distributed training task, shielding the full-aggregation communication operation for obtaining the full-tensor data in the weight gradient calculation stage, performing second matrix operation on the first gradient result based on the full-tensor data read from the preset data buffer zone in the process of executing reduction scattering communication operation, and obtaining a second gradient result, and performing the first gradient result on the first gradient result and the target model. The embodiment of the application provides a data communication device under a distributed training task, which comprises an input gradient processing module, a weight gradient processing module and a first gradient result, wherein the input gradient processing module is used for starting full-aggregation communication operation aiming at data to be processed in an input gradient calculation stage of the distributed training task, writing full-aggregation communication operation aiming at the data to be processed into a preset data buffer zone in the process of executing the full-aggregation communication operation, performing first matrix operation on the basis of the full-aggregation tensor data generated by aggregation to obtain the first gradient result, the preset data buffer zone is a data buffer zone with a preset mapping relation with a target model of the distributed training task, the weight gradient processing module is used for reading the full-aggregation communication operation converged full-tensor data from the preset data buffer zone in the weight gradient calculation stage of the distributed training task, shielding the full-aggregation communication operation used for acquiring the full-tensor data in the weight gradient calculation stage, performing second gradient operation on the second gradient result based on the full-tensor data read from the preset data buffer zone in the process of executing the reduction scattering communication operation on the