CN-122019402-A - Unified virtual memory system in multi-GPU environment

CN122019402ACN 122019402 ACN122019402 ACN 122019402ACN-122019402-A

Abstract

The invention provides a unified virtual memory system in a multi-GPU environment, wherein each GPU is provided with an RDMA buffer, the RDMA buffer is used for storing remote data acquired through an RDMA mechanism, all computing units on the GPU share the RDMA buffer, each GPU is provided with a backward access counter, and the backward access counter is used for recording the mode of the remote GPU accessing a data page so as to carry out data page migration when the mode of the remote GPU is inconsistent with the mode of the current access GPU and the mode of the remote GPU reaches a preset threshold value. The unified virtual memory system in the multi-GPU environment effectively improves the cross-GPU access performance in the multi-GPU environment based on the global shared RDMA buffer and the reverse access counter.

Inventors

HUANG QIYUE
LIANG XIAOYAO
JING NAIFENG
SONG ZHUORAN
LIU ZIZHAO

Assignees

上海交通大学

Dates

Publication Date: 20260512
Application Date: 20251224

Claims (10)

1. A unified virtual memory system in a multi-GPU environment is characterized in that, Each GPU is provided with an RDMA buffer, the RDMA buffer is used for storing remote data acquired through an RDMA mechanism, and all computing units on the GPU share the RDMA buffer; And each GPU is provided with a backward access counter, and the backward access counter is used for recording the mode of the remote GPU accessing the data page so as to carry out data page migration when the mode of the remote GPU is inconsistent with the current access GPU and reaches a preset threshold value.
2. The unified virtual memory system in a multiple GPU environment of claim 1, wherein the RDMA cache detects a ratio of cache lines that have been accessed only once when evicted to all cache lines, and wherein the RDMA cache is determined to be in a streaming access mode when the ratio exceeds a ratio threshold.
3. The unified virtual memory system in a multiple GPU environment of claim 2, wherein when the RDMA cache is in the streaming access mode, the RDMA cache carries a streaming access flag bit when sending a data request to a remote GPU, the streaming access flag bit being used to prevent the remote GPU from migrating a data page onto the GPU requesting data.
4. The unified virtual memory system in a multiple GPU environment according to claim 1, wherein the writing of the cache line of the RDMA cache is synchronously updated to the L2 cache of the GPU that is the data page owner when the writing hits, the L2 cache of the GPU that is the data page owner is directly written when the writing lacks, without writing into the RDMA cache of the current GPU, and the L1 cache of the current GPU and the RDMA cache of the remote GPU cooperate to perform a cache flush operation when the explicit synchronization instruction is executed.
5. The unified virtual memory system in a multiple GPU environment according to claim 1, wherein the reverse access counter dynamically identifies the mode of the remote GPU that accesses the data page using a Boyer-Moore voting algorithm.
6. The unified virtual memory system in a multiple GPU environment according to claim 1, wherein the reverse access counter comprises a reverse access counter table and a reverse access counter controller, wherein the reverse access counter controller is configured to monitor access requests for a current GPU L2 cache and determine whether the access requests are RDMA access requests of a remote GPU or local L1 cache access requests, and the reverse access counter table is configured to record a mode of the remote GPU accessing a data page.
7. The unified virtual memory system of claim 6, wherein the physical addresses of access requests and source GPUs are collected for RDMA access requests of remote GPUs, and the physical addresses of access requests are collected for local L1 cache access requests.
8. The unified virtual memory system in a multiple GPU environment according to claim 6, wherein the reverse access counter table comprises a page physical address identification, an access GPU identification, and an access frequency counter.
9. The unified virtual memory system in a multiple GPU environment of claim 1, wherein for a shared data page, a read-only sharing mode or a producer-consumer mode is employed between multiple GPUs, wherein the producer and consumer counter of accesses does not increment an access frequency count due to the shared data page.
10. The unified virtual memory system in a multiple GPU environment of claim 1, wherein the remote GPU does not generate RDMA requests to the GPU that is the data page holder when the remote GPU enables RDMA caching and cache hits.

Description

Unified virtual memory system in multi-GPU environment Technical Field The invention relates to the technical field of virtual memory, in particular to a unified virtual memory system in a multi-GPU environment. Background Graphics processor (GraphicsProcessingUnit, GPU) was used earliest for rendering of computer graphics. Because of its massively parallel computing capability, it has been widely used in the fields of scientific computing, machine learning, signal processing, etc. in recent years. With the recent rise of large language models (LargeLanguageModels, LLM), the demands of industry for computing have grown significantly. The parameter quantity of a large language model can reach the magnitude of billions, and the training and reasoning of LLM are highly dependent on the calculation power of AI chips such as GPU (graphic processing unit) and the like so as to complete capturing of input semantics and reasoning of output results. In the reasoning process of a large language model, the GPU video memory needs to store the weight and the activation value of the model, and in the training process, gradient values, parameters of an optimizer and the like which are reversely calculated by the neural network are needed to be stored, and the demand on the video memory can be multiple times of the reasoning. As one of the most dominant AI chips, GPU focuses on multiple compute cores for massive parallel computing efficiency, and strong scheduling capability by employing high-capacity, high-bandwidth video memories, so that training and reasoning of LLM can be performed efficiently. With the continuous expansion of upper layer applications, the architecture of GPUs has also undergone changes. In the traditional single GPU era, the initial GPU is mainly used for graphic image rendering, media encoding and decoding of user terminal products, and the like, but is not a general computing task. Some engineers pay attention to the efficient parallel computing capability of graphics GPUs, and propose GPU-based hardware architectures for general-purpose computing, such as scientific computing, machine learning, and deep learning computing. To accommodate this demand, chip manufacturers have also introduced respective general-purpose computing software stacks (e.g., CUDA of NVIDIA, HIP of AMD, etc.), and have greatly enhanced general-purpose computing capabilities in new GPU hardware architecture designs, such as the introduction of Application-SPECIFIC INTEGRATED circuits (ASICs) for matrix multiplication. GPUs, which are general-purpose computing chips, are sometimes also referred to as GPGPUs (GeneralPurposeGPU, general-purpose GPUs) to emphasize their distinction from earlier graphics-focused GPU products. These variations are all directed to a single GPU chip architecture. In recent years, with the development of large language models, it has been found that scale laws, i.e., increases in the number of parameters, will cause qualitative changes to the reasoning capabilities of the large language models. The expansion of model parameters also presents new challenges to the memory capabilities of modern GPUs. The flagship-level high-performance GPU on the market can only provide about 100GB of video memory, and the weight data of some large language models exceeds the value, so that the reasoning of a single GPU becomes a challenge, and the corresponding model training is not feasible. In addition, transistor scaling becomes increasingly difficult, making it difficult to scale the number of cores on a single GPU chip. Therefore, people begin to adopt a multi-GPU architecture, and complete computing tasks cooperatively by connecting multiple GPUs. In addition, even for a small deep neural network model, in order to realize efficient model training and reasoning, people start to adopt a multi-GPU architecture and divide the model by different dimensions, such as different batches of data, different layers of the neural network, different dimensions of tensors and the like, so that performance improvement can be brought. The multi-GPU architecture requires a communications library framework and efficient communications hardware support between multiple cards, as well as providing multi-card management capabilities at the upper level of applications, GPU drivers, and runtime. The multi-GPU architecture based on PCIe connection is shown in fig. 1. FIG. 2 illustrates a multi-GPU system interconnect architecture that provides NVLINK connections, in which PCIe connections are reserved as an adjunct. Virtual memory is a memory management mechanism implemented by the operating system and hardware together. When the virtual memory is used for managing CPU for main memory, the virtual memory provides a logic address space isolated from each other for each process, the size of the logic address space is not limited by the physical memory size actually installed by the computer system, and continuous data in the virtual addre