CN-121979832-A - NVMeoF remote storage system and method based on GPU direct command control

CN121979832ACN 121979832 ACN121979832 ACN 121979832ACN-121979832-A

Abstract

The invention discloses a NVMeoF remote storage system and a method based on GPU direct command control, which are used for definitely dividing I/O processing tasks in NVMeoF environment into two modules of GPU processing NVMe command and CPU processing network protocol, realizing the synergy of the two modules through a doorbell notification mechanism, enabling the CPU to concentrate on the processing of the network protocol which is good for the CPU, enabling the GPU to concentrate on the generation and the data processing of the NVMe command which is good for the GPU, obviously reducing the load of the CPU, constructing an end-to-end zero copy data path from remote NVMe storage to GPU video memory, eliminating unnecessary intermediate data copy and maximizing the data throughput. The invention can realize the approximate localization low-delay and high-bandwidth access of the GPU to the remote NVMe storage device in NVMeoF environment.

Inventors

ZHANG GEN
OU YANG
SUN YANQIANG
ZHANG WEI
TANG TAO
YUAN YUAN
XIE XUCHAO
SONG ZHENLONG
LI TIEJUN
ZHOU TONGQING
XING JIANYING
LI ZHIXING
TAN YUJUAN

Assignees

中国人民解放军国防科技大学

Dates

Publication Date: 20260505
Application Date: 20260409

Claims (10)

1. The NVMeoF remote storage method based on GPU direct command control is characterized in that the method is applied to a NVMeoF remote storage system comprising an initiating host system and a remote target end, wherein the initiating host system comprises a GPU domain, a CPU domain and a first RDMA network card, and the remote target end comprises a local NVMe SSD, NVMe-oF target end software and a second RDMA network card, and the method comprises the following steps: an application program of the GPU domain initiates a read request, a corresponding NVMe command is generated in a GPU video memory through a GPU computing core, the NVMe command is added into a submission queue SQ, and then a CPU domain is notified through a doorbell; The CPU domain packages the NVMe command in the submitting queue SQ into an NVMe over RDMA command frame, and submits an RDMA Send request to the first RDMA network card; The first RDMA network card packages the corresponding NVMe over RDMA command frame into a network data packet according to the RDMA Send request and sends the network data packet to the second RDMA network card; The second RDMA network card receives a network data packet, and the NVMe-oF target end software analyzes an NVMe over RDMA command frame from the network data packet to obtain a corresponding NVMe command, submits the NVMe command to a local NVMe SSD and performs a read operation to obtain data in the local NVMe SSD; The NVMe-oF target end software uses RDMA WRITE operation on the second RDMA network card and the first RDMA network card according to the information oF the NVMe over RDMA command frame, sends data in the local NVMe SSD to the GPU video memory, and sends a completion message to the CPU domain after the sending is completed; The CPU domain generates a completion event and joins a completion queue CQ in the GPU video memory, the GPU domain polls the completion queue CQ through the GPU computing core, and confirms that the read operation is successful when the completion event is detected, and notifies an application program.
2. The remote storage method of NVMeoF based on GPU direct command control according to claim 1, wherein when the CPU domain is notified by the doorbell, specifically the GPU computing core performs a write operation on the doorbell register to trigger the CPU interrupt, so that the CPU domain reads the NVMe command from the commit queue SQ of the GPU video memory after responding to the interrupt.
3. The GPU direct command control based NVMeoF remote storage method according to claim 2, wherein the doorbell register is an MMIO register mapped to CPU address space.
4. The remote storage method NVMeoF based on GPU direct command control according to claim 1, wherein when generating a corresponding NVMe command in GPU video memory and adding a commit queue SQ through a GPU computing core, comprising: Constructing a corresponding NVMe command at the tail of a submitting queue SQ of the GPU video memory, setting an operation code field OPC of the NVMe command as a read operation, setting a start logical block address field SLBA as a start logical block address corresponding to the read request, setting a block number field NLB as the data length requested to be transmitted in the read request, setting a data pointer field DPTR as a remote access key RKey corresponding to the data buffer area and the address of the data buffer area in the GPU video memory, and updating the tail pointer of the submitting queue SQ after the field of the NVMe command is filled.
5. The method for NVMeoF remote storage based on GPU direct command control according to claim 4, wherein the NVMe-orf target software operates the second RDMA network card and the first RDMA network card using RDMA WRITE according to the information oF the NVMe over RDMA command frame, specifically comprising: the NVMe-oF target software sends the address oF the data buffer area in the data pointer field DPTR and the remote access key RKey to a second RDMA network card; The second RDMA network card directly writes the data in the local NVMe SSD into the first RDMA network card through RDMA WRITE according to the address of the data buffer area and the remote access key RKey; And the first RDMA network card directly and DMA-transmits the data in the local NVMe SSD to the data buffer area of the GPU video memory, so that the data transmission in the local NVMe SSD completely bypasses the CPU and the system memory of the initiating terminal.
6. The remote storage method of NVMeoF based on GPU direct command control according to claim 5, further comprising, before the application initiates a read request, a CPU domain registering a data buffer in a GPU video memory through RDMA driver, obtaining a remote access key RKey corresponding to the data buffer, and providing the remote access key RKey and the data buffer in the GPU video memory to the GPU domain for constructing the NVMe command.
7. The remote storage method NVMeoF based on GPU direct command control of claim 6, wherein the GPU domain and the CPU domain communicate information via a shared command descriptor located in the GPU video memory, the shared command descriptor including, but not limited to, an address of an NVMe command in the GPU video memory, a remote access key RKey of a GPU data buffer, a data buffer address in the GPU video memory.
8. The method of claim 1, wherein when the GPU domain polls the completion queue CQ via the GPU computing core, the GPU computing core periodically checks whether the head pointer and the tail pointer of the completion queue CQ are equal, and when the head pointer and the tail pointer are detected to be unequal, reads a completion event from the completion queue CQ, updates the head pointer of the completion queue CQ, and notifies the application of the completion of the read operation.
9. The remote storage method NVMeoF based on GPU direct command control according to claim 1, wherein the method further comprises the step of performing a write operation to a remote target, specifically comprising: an application program of the GPU domain initiates a write request, a corresponding NVMe write command is generated in a GPU video memory through a GPU computing core and added into a submitting queue SQ, a data pointer field of the NVMe write command comprises a GPU video memory address of data to be written and a remote access key RKey, and then the CPU domain is notified through a doorbell; The CPU domain packages the NVMe write command in the submitting queue SQ into an NVMe over RDMA command frame, and submits an RDMA Send request to the first RDMA network card; The first RDMA network card packages the corresponding NVMe over RDMA command frame into a network data packet according to the RDMA Send request and sends the network data packet to the second RDMA network card; the second RDMA network card receives a network data packet, and NVMe-oF target end software analyzes an NVMe over RDMA command frame from the network data packet to obtain a corresponding NVMe write command and submits the NVMe write command to a local NVMe SSD; the NVMe-oF target end software reads data to be written from the GPU video memory oF the initiating end by using RDMA Read operation according to the GPU video memory address and the remote access key RKey, writes the data to be written into a local NVMe SSD, and after writing, sends a completion message to a CPU domain sequentially through a second RDMA network card and a first RDMA network card; the CPU domain generates a completion event and joins a completion queue CQ in the GPU video memory, the GPU domain polls the completion queue CQ through the GPU computing core, and confirms that the writing operation is successful and notifies the application program when the completion event is detected.
10. A NVMeoF remote storage system based on GPU direct command control, comprising an originating host system and a remote target, the originating host system comprising a GPU domain, a CPU domain, and a first RDMA network card, the remote target comprising a local NVMe SSD, NVMe-orf target software, and a second RDMA network card, the NVMeoF remote storage system being programmed or configured to perform the steps oF the NVMeoF remote storage method based on GPU direct command control oF any one oF claims 1-9.

Description

NVMeoF remote storage system and method based on GPU direct command control Technical Field The invention relates to a distributed computing technology, in particular to a NVMeoF remote storage system and a method based on GPU direct command control. Background In NVMe over Fabrics (NVMeoF) based distributed computing scenarios (e.g., AI training, high performance computing), the prior art schemes cannot achieve both low latency memory access and low CPU overhead. The conventional CPU-centric I/O path becomes a system performance bottleneck when the GPU needs to frequently access the remote storage device, and the drawbacks are as follows: The number of data copying is large, unnecessary memory copying exists in the I/O path, and precious memory bandwidth and PCIe bandwidth are consumed. CPU intervention is frequent, each I/O request requires multiple CPU interventions (command encapsulation, data handling, completion of processing), resulting in high CPU utilization, difficulty in supporting high concurrent I/O. GPU computing resources wait for I/O because of high I/O delay, GPU often needs to wait for data, and its powerful computing power is idle, reducing the computing efficiency of the whole system. The existing GPU straight-through storage and other technologies cannot be directly applied to network storage scenes, and the expansibility and flexibility of a high-performance computing cluster are limited. To solve the above problems, the prior art has proposed BaM (Big Accelerator Memory) architecture. BaM, the system mainly comprises a GPU, a local NVMe SSD, a host CPU and a memory. The core is to allocate and manage the commit queue (SQ) and the Completion Queue (CQ) of NVMe in the GPU video memory. BaM workflow (method steps): Initializing, namely assisting the CPU to create NVMe queues in the GPU video memory and configuring an NVMe SSD controller to identify the queues. And the command submitting step of directly constructing an NVMe command in the SQ of the video memory by the GPU computing core, and then writing the command into a doorbell register of the SSD through PCIe to inform the SSD of a new command. And data transmission, namely, the NVMe SSD directly reads the command in the SQ through PCIe, and then directly writes the data into the appointed position of the GPU video memory through DMA. Completion notification, SSD writes completion status to CQ in GPU video memory, and GPU knows I/O completion by polling CQ. BaM the method realizes the direct control of the GPU to the local NVMe equipment, and the data directly reach the GPU video memory from the SSD, thereby completely bypassing the CPU and the system memory. Because BaM architecture is designed for local PCIe environments, its GPU directly controls the mechanism of NVMe queues, and cannot directly process NVMeoF commands that need to be encapsulated/decapsulated by network protocols. Therefore, in NVMeoF scenario, if BaM concept is forcedly applied, the GPU is required to have the capability of processing the network protocol stack, which is very high in implementation cost and unrealistic in the current technology, so that the advantages of BaM cannot be expanded to the network storage field. Disclosure of Invention Aiming at the problems in the prior art, the invention provides a NVMeoF remote storage system and a method based on GPU direct command control, which can realize the approximately localized low-delay and high-bandwidth access of the GPU to the remote NVMe storage device in a NVMeoF environment. In order to solve the technical problems, the invention adopts the following technical scheme: A NVMeoF remote storage method based on GPU direct command control, the method being applied to a NVMeoF remote storage system including an initiator host system and a remote target, the initiator host system including a GPU domain, a CPU domain, and a first RDMA network card, the remote target including local NVMe SSD, NVMe-orf target software, and a second RDMA network card, the method comprising the steps oF: an application program of the GPU domain initiates a read request, a corresponding NVMe command is generated in a GPU video memory through a GPU computing core, the NVMe command is added into a submission queue SQ, and then a CPU domain is notified through a doorbell; The CPU domain packages the NVMe command in the submitting queue SQ into an NVMe over RDMA command frame, and submits an RDMA Send request to the first RDMA network card; The first RDMA network card packages the corresponding NVMe over RDMA command frame into a network data packet according to the RDMA Send request and sends the network data packet to the second RDMA network card; The second RDMA network card receives a network data packet, and the NVMe-oF target end software analyzes an NVMe over RDMA command frame from the network data packet to obtain a corresponding NVMe command, submits the NVMe command to a local NVMe SSD and performs a read operation to ob