CN-121984941-A - RDMA network card cache management method and device
Abstract
The invention provides an RDMA network card cache management method and device, the method comprises the steps of obtaining currently-to-be-processed work queue elements, dividing each work queue element into a plurality of sub-queue elements, calculating the estimated processing interval time of the next sub-queue element of a network card based on the actual processing interval time of the network card for the previous sub-queue element and the estimated processing interval time of the previous sub-queue element, calculating the processing speed of each processing queue based on the number of the sub-queue elements of each queue, the total number of the sub-queue elements of all the network card and the estimated processing interval time of the next sub-queue element of the network card for each processing queue of the network card, calculating the processing interval time based on the processing speed of the processing queue, calculating the processing interval time of the next sub-queue element based on the current processing interval time of the processing queue, and scheduling the sub-queue elements. The scheme continuously updates the scheduling interval time of each queue, and ensures the maximum operation efficiency.
Inventors
- ZHANG JIAO
- HU YUXUAN
- Liao Dexuan
- HUANG XIANYU
- HUANG TAO
Assignees
- 北京邮电大学
Dates
- Publication Date
- 20260505
- Application Date
- 20251217
Claims (10)
- 1. The RDMA network card cache management method is characterized by being applied to an RDMA network card, and comprises the following steps: Acquiring work queue elements to be processed in a current cache, and dividing each work queue element into a plurality of sub-queue elements based on a preset maximum transmission unit; Calculating the estimated processing interval time of the next sub-queue element of the network card based on the actual processing interval time of the previous sub-queue element and the estimated processing interval time of the previous sub-queue element in the historical processing of the network card; For each processing queue of the network card, calculating the processing speed of each processing queue based on the number of sub-queue elements of each queue, the total number of sub-queue elements of all queues of the network card and the estimated processing interval time of the next sub-queue element of the network card; calculating the current processing interval time of each processing queue based on the processing speed of each processing queue; And calculating the scheduling interval time of the next sub-queue element of the corresponding processing queue based on the current processing interval time of each processing queue, and scheduling the sub-queue elements based on the scheduling interval time of the next sub-queue element of the processing queue.
- 2. The RDMA network card cache management method according to claim 1, wherein in the step of calculating the scheduling interval time of the next sub-queue element of the corresponding processing queue based on the current processing interval time of each processing queue, the network card receives the instant data of the network card interface in real time, and if the instant data is successfully received, calculates the scheduling interval time of the next sub-queue element of the corresponding processing queue based on the instant data and the current processing interval time of each processing queue.
- 3. The RDMA network card cache management method according to claim 2, wherein in the step of calculating the scheduling interval time of the next sub-queue element of the corresponding processing queue based on the current processing interval time of each processing queue, if the immediate data is not successfully received, the following formula is adopted to calculate the scheduling interval time of the next sub-queue element of the processing queue: Wherein, the Representing the scheduling interval time of the next sub-queue element of processing queue i, Representing the current processing interval time of the local processing queue, Representing the number of sub-queue elements in the processing queue, The total number of sub-queue elements representing the entire queue of the network card, Representing the estimated processing interval time of the element of the previous sub-queue of the network card, Representing a preset smoothing factor parameter.
- 4. The RDMA network card cache management method according to claim 2, wherein if the butt-joint end is a transmitting end, the instant data includes a time interval for sub-queue element transmission in a processing queue transmitted by the butt-joint transmitting end and the number of sub-queue elements in the transmitted processing queue, and if the butt-joint end is a receiving end, the instant data includes a time interval for sub-queue element reception in the processing queue received by the butt-joint receiving end and the number of sub-queue elements in the received processing queue.
- 5. The RDMA network card cache management method according to claim 4, wherein in the step of calculating the scheduling interval time of the next sub-queue element of the corresponding processing queue based on the instant data and the current processing interval time of each processing queue, if the butt-joint end is the transmitting end, the following formula is adopted to calculate the scheduling interval time of the next sub-queue element of the corresponding processing queue: if the butt joint end is a receiving end, calculating the scheduling interval time of the next sub-queue element of the corresponding processing queue by adopting the following formula: Wherein, the Representing the scheduling interval time of the next sub-queue element in the processing queue sent by the network card; representing the scheduling interval time of the next sub-queue element in the processing queue received by the network card; representing the time interval that the processing queue received by the docking end is transmitting for the sub-queue element, Representing the time interval for the processing queue sent by the docking end to receive the sub-queue element, Representing the current processing interval time of the processing queue locally used for the transmission processing, Representing the current processing interval time of the processing queue locally used for the receive processing, Indicating the number of sub-queue elements in the processing queue sent by the docking, Representing the number of sub-queue elements in the processing queue received by the docking, Representing the number of sub-queue elements in the processing queue that are local to the receiving process, Representing the number of sub-queue elements in the processing queue locally used for the transmit processing, 、 、 And Are all preset calculation parameters.
- 6. The RDMA network card cache management method according to any one of claims 1 to 5, wherein in the step of calculating the estimated processing interval time of the next sub-queue element of the network card based on the actual processing interval time of the previous sub-queue element and the estimated processing interval time of the previous sub-queue element in the history processing of the network card, the estimated processing interval time of the next sub-queue element of the network card is calculated by adopting the following formula: Wherein, the Representing the estimated processing interval time of the next sub-queue element of the network card, Representing the estimated processing interval time of the element of the previous sub-queue of the network card, Representing the parameters of the smoothing factor that are preset, The actual processing interval time for the previous sub-queue element in the historical processing of the network card is represented.
- 7. The RDMA network card cache management method according to claim 6, wherein in the step of calculating the processing speed of each processing queue for each processing queue of the network card based on the number of sub-queue elements of each queue, the total number of sub-queue elements of all queues of the network card, and the estimated processing interval time of the next sub-queue element of the network card, the processing speed of each processing queue is calculated using the following formula: Wherein, the Indicating the processing speed of the processing queue, Representing the number of sub-queue elements in the processing queue, The total number of sub-queue elements representing the entire queue of the network card, And representing the estimated processing interval time of the next sub-queue element of the network card.
- 8. The RDMA network card cache management method according to claim 1, wherein in the step of calculating a current processing interval time of each processing queue based on the processing speed of each processing queue, an inverse of the processing speed of each processing queue is calculated as the current processing interval time of the processing queue.
- 9. The RDMA network card cache management method according to claim 1, wherein in the step of performing sub-queue element scheduling based on a scheduling interval time of a next sub-queue element of a processing queue, a WQE issue scheduling module of the network card is adopted to schedule the sub-queue element according to the scheduling interval time.
- 10. An RDMA network card cache management apparatus, comprising a computer device, the computer device comprising a processor and a memory, the memory storing computer instructions, the processor configured to execute the computer instructions stored in the memory, the apparatus implementing the steps implemented by the method according to any one of claims 1 to 9 when the computer instructions are executed by the processor.
Description
RDMA network card cache management method and device Technical Field The invention relates to the technical field of network cards, in particular to a method and a device for managing RDMA network card caches. Background In recent years, storage, high-performance computing, model training, reasoning and other services are actively developed, and these services also put higher demands on data transmission, so that the traditional TCP/IP protocol is difficult to adapt to a data center network which is rapidly developed. RDMA is widely applied to a data center network because of the mechanism of protocol stack unloading, kernel bypassing, zero copy and the like, and can provide high-efficiency data transmission with high throughput, low time delay and low CPU utilization rate. The manner of communication of RDMA requires the use of QP (Queue Pair) connections and the efficiency of its communication relies on the caching of metadata in the network card in connection with the data transfer. In RDMA communications, RC mode is widely used in data center networks because it supports all Verbs and can provide reliability guarantees, but the one-to-one connection of QPs in RC (Reliable Connection ) mode also presents connection scalability issues. With the development of data center services, the scale of the network is continuously increasing, and a single cluster may comprise tens of thousands of servers and matched network cards. The expansion of the network size also results in a doubling of the connections that need to be established for a single Zhang Wangka, which can be on the order of hundreds of thousands of RC connections for storage networks where full-mesh exists, such as Alibaba PanGu storage systems, and high performance computing networks, such as Microsoft Azure. In addition, for MoE (Mixture of Experts) models, because model training and reasoning require cooperation among multiple experts, a large number of all-to-all communication modes exist, and the number of connections can reach the order of ten thousand. The large number of QP connections can bring about connection scalability problems for the network card, which are mainly manifested as network card throughput dip, resulting in reduced network card processing efficiency. Disclosure of Invention In view of this, embodiments of the present invention provide a method for RDMA network card cache management to obviate or mitigate one or more disadvantages in the prior art. One aspect of the present invention provides a method for RDMA network card cache management, where the method is applied to an RDMA network card, and the method includes the steps of: Acquiring work queue elements to be processed in a current cache, and dividing each work queue element into a plurality of sub-queue elements based on a preset maximum transmission unit; Calculating the estimated processing interval time of the next sub-queue element of the network card based on the actual processing interval time of the previous sub-queue element and the estimated processing interval time of the previous sub-queue element in the historical processing of the network card; For each processing queue of the network card, calculating the processing speed of each processing queue based on the number of sub-queue elements of each queue, the total number of sub-queue elements of all queues of the network card and the estimated processing interval time of the next sub-queue element of the network card; calculating the current processing interval time of each processing queue based on the processing speed of each processing queue; And calculating the scheduling interval time of the next sub-queue element of the corresponding processing queue based on the current processing interval time of each processing queue, and scheduling the sub-queue elements based on the scheduling interval time of the next sub-queue element of the processing queue. By adopting the scheme, each work queue element (Work Queue Element, WQE) is firstly divided into a plurality of sub-queue elements with the same size according to the possible different sizes of the work queue elements in the actual working process, the estimated processing interval time of the next sub-queue element of the network card is further determined based on the actual processing interval time of the previous sub-queue element and the estimated processing interval time of the previous sub-queue element of the network card, and finally the scheduling interval time of the next sub-queue element of each queue in the network card is determined according to the estimated processing interval time of the next sub-queue element of the network card. In some embodiments of the present invention, in the step of calculating the scheduling interval time of the next sub-queue element of the corresponding processing queue based on the current processing interval time of each processing queue, the network card receives the instant data of the network card in