Search

CN-121979690-A - Aggregate communication offloading method, system, device and medium

CN121979690ACN 121979690 ACN121979690 ACN 121979690ACN-121979690-A

Abstract

The application discloses a method, a system, equipment and a medium for unloading aggregate communication, which are applied to the technical field of computers and comprise the steps that in a model deployment stage, a distributed reasoning controller generates a communication primitive blueprint based on a calculation graph description file of a tensor parallel reasoning model and transmits the communication primitive blueprint to a DPU, the DPU establishes a hardware-level communication context semantic environment based on the communication primitive blueprint, a trigger signal is sent to the DPU when the GPU/NPU calculates to the communication boundary in the model reasoning stage, the DPU parallelly executes a DMA data pulling pipeline and an RDMA data sending pipeline based on the hardware-level communication context semantic environment, executes aggregate calculation on all tensor fragment data to be synchronized, and writes an aggregate calculation result back to the GPU/NPU. By advancing the hardware-level communication context semantic environment to the model deployment stage to be established and executing the double pipelines in parallel in the DPU, the end-to-end communication delay is obviously reduced, and zero participation of a host CPU is realized.

Inventors

  • LI HUI
  • HAO ZHIKUN
  • LIU XIANZHENG
  • ZHOU CHANGQING
  • CHEN HAO

Assignees

  • 翼华科技(北京)股份有限公司
  • 北京翼华云网科技有限公司

Dates

Publication Date
20260505
Application Date
20260408

Claims (10)

  1. 1. A method of aggregate communication offloading, comprising: In a model deployment stage, a distributed reasoning controller generates a communication primitive blueprint based on a computational graph description file of a tensor parallel reasoning model and transmits the communication primitive blueprint to a data processing unit corresponding to the tensor parallel reasoning model; And when the data processing unit receives the trigger signal, the data processing unit parallelly executes a DMA to-be-synchronized tensor fragment data pulling pipeline and an RDMA to-be-synchronized tensor fragment data sending pipeline based on the hardware-level communication context semantic environment, executes aggregation calculation on all to-be-synchronized tensor fragment data, writes an aggregation calculation result back to the inference calculation unit, and triggers the inference calculation unit to continuously execute the forward propagation calculation based on the aggregation calculation result.
  2. 2. The aggregate communication offload method of claim 1, wherein generating the communication primitive blueprint based on the computational graph description file of the tensor parallel inference model comprises: Reading a calculation graph description file of the tensor parallel reasoning model, and analyzing the calculation graph description file to obtain static characteristic information, wherein the static characteristic information comprises tensor slicing information, set communication mode information and communication participant information, the tensor slicing information comprises the dimension, the slicing number and the slicing index of tensor slicing, the set communication mode information comprises AllReduce, allGather, reduceScatter, and the communication participant information comprises an reasoning calculation unit identifier and a data processing unit identifier which participate in communication; Based on the static characteristic information, a communication primitive blueprint comprising a communication tree topology, a data slicing strategy and a lock-free memory slot pre-allocation table is generated.
  3. 3. The aggregate communication offload method of claim 1, wherein establishing a hardware-level communication context semantic environment based on the communication primitive blueprint comprises: performing the following communication semantic context curing operations based on the communication primitive blueprint to form the hardware-level communication context semantic environment: Establishing a receiving slot pool in a local memory; Establishing RDMA unilateral write target address mapping of each receiving slot physical address and a preregistered memory address of an opposite terminal data processing unit; Solidifying the communication tree topology and generating a hardware forwarding flow table; A collective communication state machine is initialized.
  4. 4. The aggregate communication offload method of claim 1, wherein sending a trigger signal to the data processing unit when calculating the communication boundary comprises: When the communication boundary is calculated, a single write operation trigger signal is sent to the data processing unit through the lightweight doorbell register.
  5. 5. The aggregate communication offload method of claim 1, wherein the DMA to-be-synchronized tensor fragmented data pull pipeline comprises pulling to-be-synchronized tensor fragmented data from the inference computation unit and storing in a transmit buffer by a PCIe DMA engine of the inference computation unit; The RDMA to-be-synchronized tensor fragmented data sending pipeline comprises the steps of starting RDMA unilateral writing operation based on communication tree topology, a preconfigured hardware forwarding flow table and RDMA unilateral writing target address mapping in the hardware-level communication context semantic environment, and writing to-be-synchronized tensor fragmented data in a sending buffer zone into preset receiving slots of an opposite-end data processing unit.
  6. 6. The aggregate communication offload method of claim 1, wherein performing an aggregate calculation on all tensor shard data to be synchronized comprises: After each receiving slot position is monitored to receive all tensor fragment data to be synchronized, triggering an aggregation calculation unit in a data processing unit to execute the following aggregation calculation according to the communication primitive types: If the communication primitive type is AllReduce operation, performing element-by-element summation/averaging on tensor fragment data to be synchronized in each receiving slot; If the communication primitive type is ALLGATHER, splicing the tensor fragment data to be synchronized in each receiving slot into a complete tensor according to a preset offset; If the communication primitive type is ReduceScatter, only the reduction result required by the user is reserved after the reduction of the tensor fragment data to be synchronized in each receiving slot is completed.
  7. 7. The aggregate communication offload method of claim 1, wherein writing back the aggregate computation result to the inference computation unit and triggering the inference computation unit to continue performing forward propagation computation based on the aggregate computation result comprises: The aggregated calculation result is written back to the inference calculation unit through the PCIe DMA engine, and a lightweight completion notification based on the doorbell register is sent to the inference calculation unit to inform the inference calculation unit to continue to execute the forward propagation calculation of the next stage.
  8. 8. A collective communication offload system, comprising: The distributed reasoning controller is used for generating a communication primitive blueprint based on a computational graph description file of the tensor parallel reasoning model in a model deployment stage and transmitting the communication primitive blueprint to a data processing unit corresponding to the tensor parallel reasoning model; The system comprises a data processing unit, a reasoning calculation unit, a data processing unit, a model deployment stage, a data processing unit and a reasoning calculation unit, wherein the data processing unit is used for establishing a hardware-level communication context semantic environment based on a communication primitive blueprint sent by a distributed reasoning controller when receiving the communication primitive blueprint in the model deployment stage; And the reasoning calculation unit is used for executing forward propagation calculation based on the tensor parallel reasoning model in a model reasoning stage, sending a trigger signal to the data processing unit when the calculation reaches a communication boundary, and continuously executing the forward propagation calculation based on the aggregation calculation result written back by the data processing unit.
  9. 9. An electronic device comprising a memory, a processor, a distributed inference controller, at least one data processing unit, and at least one inference calculation unit, wherein the memory has stored thereon computer instructions executable by the processor, which when executed by the processor cause the distributed inference controller, the at least one data processing unit, and the at least one inference calculation unit to cooperatively perform the collective communication offload method of any of claims 1-7.
  10. 10. A computer readable storage medium storing computer instructions which, when executed by a processor, implement the aggregate communication offload method of any of claims 1-7.

Description

Aggregate communication offloading method, system, device and medium Technical Field The application relates to the technical field of computers, in particular to a method, a system, equipment and a medium for unloading collective communication, which are particularly suitable for tensor parallel distributed reasoning scenes of a large language model. Background Tensor parallel reasoning becomes a key technology for distributed reasoning as the scale of AI (ARTIFICIAL INTELLIGENCE ‌, artificial intelligence) models such as large language models increases. In this scenario, frequent, fine-grained, tightly synchronized aggregate communications are required among multiple computing nodes. The control plane delay of the integrated communication mode initiated by a CPU (Central Processing Unit, a central processing unit) and executed by a network card is a main bottleneck under the ultra-high speed network. In the existing aggregate communication offloading scheme based on the data processing unit (Data Processing Unit, DPU), although the data plane is offloaded, the control plane still depends on the host CPU to initiate communication primitive configuration and state management, and the message transfer mode based on MPI (MESSAGE PASSING INTERFACE, message transfer interface) communication semantics is not matched with the tensor parallel communication mode in AI pushing, so that microsecond-level extreme synchronization cannot be realized, the host CPU resources are occupied, and the network bandwidth utilization is limited. Disclosure of Invention The application provides a method, a system, equipment and a medium for unloading aggregate communication, which are used for solving the problems of higher end-to-end delay, mismatching of communication modes and limited bandwidth utilization rate of the traditional aggregate communication unloading scheme, and the technical scheme provided by the application is as follows: in one aspect, the present application provides a method for aggregate communication offloading, including: In the model deployment stage, the distributed reasoning controller generates a communication primitive blueprint based on a computational graph description file of the tensor parallel reasoning model and transmits the communication primitive blueprint to a data processing unit corresponding to the tensor parallel reasoning model; In the model reasoning stage, a reasoning calculation unit executes forward propagation calculation based on a tensor parallel reasoning model, when calculation reaches a communication boundary, a trigger signal is sent to a data processing unit, when the data processing unit receives the trigger signal, a DMA (Direct Memory Access ) to-be-synchronized tensor sliced data pulling pipeline and an RDMA (Remote Direct Memory Access ) to-be-synchronized tensor sliced data sending pipeline are executed in parallel based on a hardware-level communication context semantic environment, and after aggregation calculation is executed on all to-be-synchronized tensor sliced data, an aggregation calculation result is written back to the reasoning calculation unit, and the reasoning calculation unit is triggered to continue to execute forward propagation calculation based on the aggregation calculation result. Optionally, generating the communication primitive blueprint based on the computational graph description file of the tensor parallel reasoning model includes: Reading a calculation graph description file of a tensor parallel reasoning model, and analyzing the calculation graph description file to obtain static characteristic information, wherein the static characteristic information comprises tensor fragment information, set communication mode information and communication participant information, the tensor fragment information comprises the dimension, the fragment number and the fragment index of the tensor fragment, the set communication mode information comprises AllReduce, allGather, reduceScatter, and the communication participant information comprises an reasoning calculation unit identifier and a data processing unit identifier which participate in communication; based on the static characteristic information, a communication primitive blueprint comprising a communication tree topology, a data slicing strategy and a lock-free memory slot pre-allocation table is generated. Optionally, establishing a hardware-level communication context semantic environment based on the communication primitive blueprint includes: The following communication semantic context curing operations are performed based on the communication primitive blueprint to form a hardware-level communication context semantic environment: Establishing a receiving slot pool in a local memory; Establishing RDMA unilateral write target address mapping of each receiving slot physical address and a preregistered memory address of an opposite terminal data processing unit; Solidifying the communication tre