CN-122019450-A - Heterogeneous GPU data transmission system and method

CN122019450ACN 122019450 ACN122019450 ACN 122019450ACN-122019450-A

Abstract

The application relates to the technical field of robot middleware and distributed computing, in particular to a heterogeneous GPU data transmission system and method. The system comprises a sending end node, a receiving end node and a topology sensing scheduling unit, wherein the sending end node comprises a first GPU and can call a cross-platform tensor encapsulation tool library to encapsulate tensor data into a memory object conforming to an Apache Arrow format, the receiving end node comprises a second GPU heterogeneous with the first GPU and can call the cross-platform tensor encapsulation tool library to analyze and load data, and the topology sensing scheduling unit is used for sensing physical position relations among nodes and adaptively selecting a same-machine zero-copy transmission path or a cross-machine network transmission path for the memory object. The sending end node sends the memory object through the selected path, and the receiving end node analyzes and loads data to the second GPU by utilizing the tool library. The application realizes the efficient, transparent and position-independent data transmission among heterogeneous GPUs.

Inventors

ZHANG XIAODONG
MA WEI
YANG ZIJIANG

Assignees

中国科学技术大学

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. The heterogeneous GPU data transmission system comprises a sending end node, a receiving end node and a topology aware scheduling unit, and is characterized by comprising the following components: The sending end node is provided with a first GPU and is configured to be capable of calling a cross-platform tensor encapsulation tool library so as to encapsulate tensor data from the first GPU into a memory object conforming to an Apache Arrow format; The topology aware scheduling unit is used for perceiving the physical position relation between the sending end node and the receiving end node, and selecting a same-machine zero-copy transmission path or a cross-machine network transmission path for the packaged memory object based on the position relation; The sending end node is further configured to send the memory object conforming to the Apache array format to the receiving end node through the path selected by the topology aware scheduling unit; the receiving end node is provided with a second GPU and is configured to call the cross-platform tensor encapsulation tool library so as to load the data in the memory object into the second GPU; wherein the first GPU is heterogeneous with the second GPU.
2. The system of claim 1, wherein the cross-platform tensor encapsulation tool library encapsulates tensor data into a memory object compliant with an Apache Arrow format, comprising: extracting memory pointers, spans, dimensions and data types of the tensors by the cross-platform tensor encapsulation tool library; based on the extracted information, the cross-platform tensor encapsulation tool library constructs a memory descriptor in Apache Arrow format.
3. The system of claim 1, wherein the topology aware scheduling unit selects a transmission path based on the physical location relationship, comprising: if the process identifiers of the sending end node and the receiving end node are the same, the topology aware scheduling unit determines that a transmission path is in-process pointer transfer; if the machine identifiers of the sending end node and the receiving end node are the same but the process identifiers are different, the topology aware scheduling unit determines that the transmission path is the same-machine zero-copy transmission path; And if the machine identifications of the sending end node and the receiving end node are different, the topology aware scheduling unit determines that the transmission path is a cross-machine network transmission path.
4. The system according to claim 1, wherein: when the selected transmission path is the same-machine zero-copy transmission path, the transmitting end node mounts the memory object conforming to the Apache Arrow format to a shared memory segment; and transmitting the identification of the shared memory segment to the receiving end node through inter-process communication.
5. The system according to claim 1, wherein: When the selected transmission path is a cross-machine network transmission path, the sending end node transmits the memory object conforming to the Apache Arrow format based on Zenoh communication protocol, and no serialization operation is required to be performed on the memory object in the transmission process.
6. A heterogeneous GPU data transmission method performed by a sender node in the system of claim 1, comprising: packaging tensor data in a first GPU contained in the sending end node into a memory object conforming to an Apache Arrow format; acquiring position information of a receiving end node; acquiring a transmission path based on the position information, wherein the transmission path is determined by a topology aware scheduling unit in the system based on the position information, and comprises a same-machine zero-copy transmission path or a cross-machine network transmission path; transmitting the memory object conforming to the Apache Arrow format to the receiving end node through the acquired transmission path, so as to be used for loading data in the memory object into a second GPU contained in the receiving end node; wherein the first GPU is heterogeneous with the second GPU.
7. The method of claim 6, wherein the obtaining location information of the receiving end node comprises: And acquiring the position information by inquiring a global registry, wherein the global registry records the machine identification, the process identification and the GPU equipment information of the nodes in the system.
8. The method of claim 6, wherein encapsulating the tensor data in the first GPU included in the sender node into a memory object conforming to an Apache array format comprises: calculating a logical offset of the tensor data when the tensor data is discontinuously stored; The tensor data and its logical offset are described in a Apache Arrow RecordBatch mode.
9. A heterogeneous GPU data transmission method performed by a receiving end node in the system of claim 1, comprising: receiving a memory object which is from a sending end node and accords with an Apache Arrow format, wherein the memory object is obtained by packaging tensor data in a first GPU (graphics processing unit) contained in the sending end node; Loading data in the memory object into a second GPU included in the receiving end node by using the cross-platform tensor encapsulation tool library; wherein the first GPU is heterogeneous with the second GPU.
10. The method of claim 9, wherein loading the data in the memory object into the second GPU using the cross-platform tensor encapsulation tool library comprises: and analyzing the memory descriptors in the memory objects conforming to the Apache Arrow format by using the cross-platform tensor encapsulation tool library, and directly mapping the memory descriptors to the address space of the second GPU for access in a memory mapping mode.

Description

Heterogeneous GPU data transmission system and method Technical Field The application relates to the technical field of robot middleware and distributed computing, in particular to a heterogeneous GPU data transmission system and method. Background In the fields of robot development, autopilot and high performance computing, integrating graphics processors (GPUs, collectively referred to as "heterogeneous GPUs") from different vendors (e.g., NVIDIA, AMD, and talent lifting) or using different computing architectures to cope with the requirement of high-frequency, large-scale tensor data exchange in a distributed robotic system or AI computing cluster has become a key means for improving computing performance. The core premise of realizing efficient coordination is to realize efficient and transparent data transmission among the heterogeneous GPUs, wherein the high efficiency requires low delay and high throughput, and the transparency requires shielding the lower layer hardware difference and network topology complexity for the upper layer computing task. However, the prior art, in achieving this goal, has the following typical drawbacks: First, ecologically closed, heterogeneous hardware is difficult to directly interwork. The native high-speed interconnection technology (such as NVLink of NVIDIA) provided by each mainstream GPU manufacturer belongs to a private protocol and a special physical interface, and different manufacturers and different generation architectures are mutually incompatible, so that a direct data channel cannot be established between heterogeneous GPUs, and hardware and protocol barriers which are difficult to cross are formed. Second, the common transmission path is redundant and the performance loss is significant. To bypass the ecological barrier, the industry generally adopts a general path transferred through the CPU memory, wherein data needs to undergo multiple copies of 'sending end GPU video memory, host memory, network, host memory and receiving end GPU video memory'. Meanwhile, in order to adapt to network transmission, high-overhead serialization and deserialization operations must be performed at the CPU side. The process brings huge data replication overhead and CPU calculation burden, obviously increases transmission delay, reduces effective bandwidth and becomes a system performance bottleneck. Thirdly, the communication interface is split, and the development and deployment cost is high. The existing distributed framework lacks unified communication abstraction, and a developer must explicitly select and call different bottom-layer communication interfaces (such as CUDA IPC, unix domain sockets, network sockets and the like) according to the physical deployment relationship of the transceiver (such as processes, co-machine cross-process, cross-machine network). The service logic is deeply coupled with the hardware topology, the system does not have position transparency, and development, debugging, maintenance and cross-environment migration costs of the distributed application are greatly improved. Therefore, when coping with the heterogeneous GPU data transmission requirement, the prior art scheme is limited by three problems of hardware non-intercommunication, high performance overhead and high development complexity for a long time, and lacks a technical scheme capable of systematically solving the defects at the same time. Disclosure of Invention The embodiment of the application provides a heterogeneous GPU data transmission scheme, which aims to solve the problems that heterogeneous GPUs cannot be directly communicated due to ecological closure of hardware, the performance cost is high due to redundancy and serialization cost of a general transmission path, and the development complexity is high due to cleavage of a communication interface and lack of position transparency in the prior art. A first aspect of an embodiment of the present application provides a heterogeneous GPU data transmission system, including a transmitting end node, a receiving end node, and a topology aware scheduling unit, including: The sending end node is provided with a first GPU and is configured to be capable of calling a cross-platform tensor encapsulation tool library so as to encapsulate tensor data from the first GPU into a memory object conforming to an Apache Arrow format; The topology aware scheduling unit is used for perceiving the physical position relation between the sending end node and the receiving end node, and selecting a same-machine zero-copy transmission path or a cross-machine network transmission path for the packaged memory object based on the position relation; The sending end node is further configured to send the memory object conforming to the Apache array format to the receiving end node through the path selected by the topology aware scheduling unit; the receiving end node is provided with a second GPU and is configured to call the cross-platform tensor en