CN-122001885-A - Network card-based RDMA (remote direct memory access) aggregate communication unloading device and equipment

CN122001885ACN 122001885 ACN122001885 ACN 122001885ACN-122001885-A

Abstract

The invention discloses a network card-based computing network convergence RDMA (remote direct memory access) aggregate communication unloading device and equipment, wherein the device comprises a PCIe (peripheral component interconnect express) interface, a descriptor processing component, an aggregate communication unloading engine, a DMA (direct memory access) component, an RDMA protocol processing component and a network interface, wherein the descriptor processing component is used for submitting metadata of an aggregate communication task written by a host to the aggregate communication unloading engine, the aggregate communication unloading engine is connected with the host through the PCIe interface, the aggregate communication unloading engine is connected with the network interface through the RDMA protocol processing component, and the DMA component is respectively connected with the PCIe interface, the aggregate communication unloading engine and the RDMA protocol processing component. The invention aims to realize flexible and efficient aggregate communication unloading and performance improvement under a multi-aggregate communication concurrency scene.

Inventors

CHANG JUNSHENG
He Chucai
XIE ZIHAO
GUO YANG
LIU SHENG
LEI FEI
Pan guoteng
ZHOU HONGWEI

Assignees

中国人民解放军国防科技大学

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (10)

1. The network card-based computing network fusion RDMA aggregate communication offloading device is characterized by comprising a PCIe interface, a descriptor processing component, an aggregate communication offloading engine, a DMA component, an RDMA protocol processing component and a network interface, wherein the descriptor processing component is used for submitting metadata of an aggregate communication task written by a host to the aggregate communication offloading engine, the aggregate communication offloading engine is connected with the host through the PCIe interface, the aggregate communication offloading engine is connected with the network interface through the RDMA protocol processing component, and the DMA component is respectively connected with the PCIe interface, the aggregate communication offloading engine and the RDMA protocol processing component.
2. The network card-based computing network convergence RDMA aggregate communication offload device of claim 1, wherein the aggregate communication offload engine comprises an aggregate communication core component, an aggregate communication context component, an event processing component, an aggregate communication data management component, an aggregate communication execution processing component and a protocol computation component, wherein the aggregate communication execution processing component is used for analyzing control command words issued by the aggregate communication core component, if the control command words are data forwarding operations, the DMA request and the RDMA request are initiated, a direct path from the DMA component or a buffer memory to the RDMA protocol processing component is established, and data receiving and forwarding operations of aggregate communication tasks are completed, if the control command words are computation operations, a protocol computation request is initiated to the protocol computation component, an aggregation operation is performed on an input data stream through the protocol computation component, and a result is stored in the buffer memory, and finally the control command words are directly forwarded to a network through the host computer or the RDMA request initiated, the aggregate communication core component is respectively connected with the aggregate communication context component, the event processing component and the aggregate communication data management component, the aggregate communication context component is used for recording communication context of the aggregate communication tasks, the aggregate communication event processing component is used for setting up event queue types for processing the aggregate communication queue, and the aggregate communication event queue is used for processing the aggregate communication tasks, and the aggregate communication queue is used for processing the aggregate communication tasks.
3. The network card based RDMA aggregate communication offload device of claim 2, wherein said protocol computation means comprises a computational control module and an array of arithmetic logic units interconnected, said array of arithmetic logic units supporting arithmetic logic operations comprising part or all of protocol semantics in summation, maximum, minimum, bitwise and, bitwise or bitwise exclusive or.
4. The network card based computing network convergence RDMA aggregate communication offload device of claim 2, wherein said aggregate communication data management component comprises a data buffer, a shared buffer queue management module, and a shared buffer queue, said shared buffer queue comprising a plurality of cache entries, said aggregate communication data management component caching data of the aggregate communication task in the cache entries and fetching the cached data in the cache entries into the data buffer to effect data processing of the aggregate communication task, fields in each record in said cache entries comprising a transaction number jobsid, a message sequence number SeqID, a cache Valid signal Valid, a cache block Address, and a cache block Size.
5. The network card based RDMA aggregate communication offload device of claim 2, wherein said aggregate communication offload engine performing an RDMA aggregate communication offload with an RDMA aggregate network comprises the steps of: S101, initializing an aggregate communication descriptor queue CCD, an operation completion state queue OCS and a handshake message queue RM, wherein the aggregate communication descriptor queue CCD is used for recording descriptors of aggregate communication tasks, the operation completion state queue OCS is used for recording completion states of the aggregate communication tasks, and the handshake message queue RM is used for recording handshake information of other nodes in the aggregate communication tasks; S102, if the operation completion state queue OCS is not empty, extracting a transaction number JobID of the aggregate communication task from the operation completion state queue OCS, subtracting 1 from an operand Op_num of the aggregate communication task corresponding to the transaction number JobID, if the operand Op_num after subtracting 1 is zero, returning to a completion state of the aggregate communication task corresponding to the transaction number JobID, and jumping to step S105, if the operand Op_num after subtracting 1 is not zero, directly jumping to step S105, and if the operation completion state queue OCS is empty, executing the next step; S103, if the handshake message queue RM is not empty, extracting a transaction number JobID and a message Type of an aggregate communication task from the handshake message queue RM, if the message Type is initiating handshake RendzInit, constructing and issuing a control command word for the aggregate communication task corresponding to the transaction number JobID to start processing of the aggregate communication task, jumping to step S105, if the message Type is handshake completion RendzDone, subtracting 1 for an operand Op_num of the aggregate communication task corresponding to the transaction number JobID, if the subtracted operand Op_num is zero, returning to a completion state of the aggregate communication task corresponding to the transaction number JobID, jumping to step S105, if the subtracted operand Op_num is not zero, directly jumping to step S105, and if the handshake message queue RM is empty, executing the next step; S104, if the aggregate communication descriptor queue CCD is not empty, extracting the descriptor from the aggregate communication descriptor queue CCD, analyzing the descriptor, acquiring the transaction number JobID and the aggregate communication Type Collecting_Type therein, creating a communication context for the aggregate communication task corresponding to the transaction number JobID, executing the corresponding aggregate communication task according to the aggregate communication Type Collecting_Type, storing the communication context of the aggregate communication task, and then jumping to the step S105; s105, judging whether the RESET signal RESET is effective, ending and exiting if the RESET signal RESET is effective, otherwise, continuing to iterate in the step S102.
6. The network card based computing network convergence RDMA aggregate communication offload device of claim 5, wherein when the corresponding aggregate communication task is executed according to the aggregate communication Type, the step S104 includes suspending and switching the execution of the current aggregate communication task to the execution of other aggregate communication tasks to implement the concurrent processing of the plurality of aggregate communication tasks when the external event to be waited for by the current aggregate communication task is not ready, and resuming the execution of the current aggregate communication task when the external event to be waited for by the current aggregate communication task is ready until the current aggregate communication task is completed and the result is written back to the host through the DMA operation.
7. The network card-based computing network convergence RDMA aggregate communication offload device of claim 5, wherein when parsing the descriptor and obtaining the transaction number JobID and the aggregate communication Type Collecting_Type in step S104, the parsed aggregate communication Type Collecting_Type is one of a Broadcast aggregate communication, a Scatter aggregate communication, a other aggregate communication, ALLGATHER aggregate communication, a Reduce aggregate communication, reduceScatter aggregate communication, allReduce aggregate communication and AlltoAll aggregate communication.
8. The network card-based RDMA aggregate communication offload device of claim 5 is characterized in that fields in each record in the operation completion status queue OCS include transaction numbers JobID and operand Op_num of aggregate communication tasks, wherein the transaction numbers JobID are used for distinguishing different aggregate communication tasks, the operand Op_num is used for recording operation numbers of the aggregate communication tasks which are not completed yet, and the operand Op_num is initialized to all operation numbers of the aggregate communication tasks, fields in each record in the handshake message queue RM include transaction numbers JobID and message Type types of the aggregate communication tasks, the message Type types are used for recording message types of handshake messages, including an initiating handshake RendzInit and a handshake completion RendzDone, fields in each record in the aggregate communication descriptor queue CCD include transaction numbers JobID, aggregate communication Type Collectype, data volume Count, communication Address Op_Rator, node Destink_size, protocol Type, protocol Address Type 0, destination Address protocol Type 0 and Destination Address protocol Type 1 are used for recording operation addresses of the aggregate communication tasks, and Destination Address Type 0-1 is used for marking the Destination Address of the aggregate communication tasks, and the Destination Address 1 is used for the operand Type of the aggregate communication task, and the Destination Address 1 is used for recording the Destination Address of the aggregate communication protocol of the Destination Address of the aggregate communication task 1, the Result Address result_address is used for recording the memory Address of the operation Result of the aggregate communication task.
9. The network card based RDMA aggregate communication offload device of claim 2, wherein the communication context of the aggregate communication task comprises part or all of a transaction number jobsid, communication group member information, a data buffer address, a source address, a destination address, a message size, a current progress, and a valid bit.
10. A computer device comprising a microprocessor and a network communication device communicatively connected to each other, wherein the network communication device is the network card-based computing network convergence RDMA aggregate communication offload device of any of claims 1-9.

Description

Network card-based RDMA (remote direct memory access) aggregate communication unloading device and equipment Technical Field The invention relates to a digital information transmission technology in the field of computer communication, in particular to a network card-based computing network convergence RDMA (remote direct memory access) aggregate communication unloading device and equipment. Background The aggregate communication offloading mechanism is to migrate tasks such as communication scheduling and data processing originally undertaken by a host CPU to network hardware, and may be divided into three offloading schemes based on a switch, a Data Processing Unit (DPU), and a network card. The existing network card-based aggregate communication hardware unloading mechanism can be divided into two implementation modes of a solidified aggregate communication hardware circuit and an integrated lightweight aggregate communication computing core. (1) The implementation mode of the solidified aggregate communication hardware circuit finishes aggregate communication unloading by solidifying an Application Specific Integrated Circuit (ASIC) in the network card, and all communication scheduling, data transmission and protocol calculation components are transmitted to the hardware circuit to be executed, so that software intervention is completely separated. EASYNET HE Z et al, entitled "EasyNet:100 Gbps Network for HLS", published in the paper 31st International Conference on Field-Programmable Logic and Applications, pages 197-203, 2021) discloses that the hardware implementation of the expansion operation is costly by keeping track of the available data of FIFOs in each connected TCP/IP stack, the Reducer reduces the data stored in the FIFOs in a round-robin fashion, but the message size is limited to the FIFO capacity, and only static aggregate communication offloading is implemented. SMI (DE MATTEIS T et al published paper titled "Streaming message interface: high-performance distributed memory programming on reconfigurable hardware", pages 1-33 of the 2019 "Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis" treatise) implements broadcast and protocol units on FPGAs, respectively, which directly launch collections through a streaming collection interface, but still through static hardware logic. NetFPGA (ARAP O et al published paper entitled "Offloading Collective Operations to Programmable Logic", journal IEEE Micro, volume 37, 5, pages 52-60) supports MPI offload requests through predefined UDP ports, and data is reduced and routed through a static aggregation processing engine. The Tianhe interconnection network interface (Xu Jinbo et al published paper entitled "Width order trigger mechanism and data caching method for aggregate communication hardware offload", published in 2025, university of national defense and science and technology journal, volume 47, 6, pages 13-23) designs a trigger-based communication offload mechanism, aggregate communication operation is offloaded into the interconnection network for autonomous trigger execution, and a computational logic part in the interconnection network interface can support protocol operation, but the granularity of the supported messages is smaller. (2) The implementation mode of the integrated lightweight aggregate communication computing core is that an embedded processor (such as MicroBlaze) is added into a network card, the software programming controls the aggregate communication execution flow, and the software flexibility makes up the functional limitation of pure hardware. Myrinet network interface (BUNTINAS D et al, published in 2001, proceedings of the Workshop on Communication Architecture for Clusters (CAC) paper, page 166) entitled "Performance benefits of NIC-based barrier on Myrinet/GM") contains an embedded processor that supports offloading of specific aggregate communication operations, but updating aggregate communication operations requires modification of control programs running on the network interface. The collective communication operations and peer-to-peer operations of TMD-MPI (Salda ñ a M et al, published under the paper "MPI as a Programming Model for High-Performance Reconfigurable Computers," ACM Transactions on Reconfigurable Technology AND SYSTEMS, volume 3, phase 4, pages 1-29) rely on the software implementation of an embedded processor MicroBlaze, providing flexibility of offloading, but its performance enhancement is limited by the frequency of the processor. Christgau S et al (Christgau S et al entitled "A FIRST STEP towards Support for MPI Partitioned Communication on SYCL-programmed FPGAs"), published in 2022, IEEE/ACM International Workshop on Heterogeneous High-performance Reconfigurable Computing, ipomoea 9-17, use a soft core processor (Intel's Nios II soft core) on an FPGA to coordinate and manage communication operations, use SYCL, a C++ ba