CN-122019311-A - Method and device for detecting GPU (graphics processing unit) terminal video memory competition based on CPU (Central processing Unit) terminal
Abstract
The invention discloses a method and a device for detecting GPU-end video memory competition based on a CPU end. The method comprises the steps of implanting callback functions at an API (application program interface) entrance of a GPU application program when the GPU application program runs, dynamically capturing context information corresponding to GPU call initiated by the GPU application program each time, constructing a CPU simulation execution environment which is equivalent to the parallel semantics of the GPU end according to the context information, starting a TSAN (time-based access network) in the environment, monitoring concurrent access behaviors of a plurality of CPU threads to the same mapping memory address, and reversely mapping competition information associated with a competition event to the context information when a monitoring result shows that the TSAN detects that the behaviors have data competition, so as to complete the video memory competition detection of the GPU end. The invention solves the technical problem that in the GPU parallel computing scene in the related technology, when a plurality of kernel functions (kernel) access the same video memory address through different streams (streams) simultaneously to cause data competition, the competition cannot be effectively positioned and diagnosed, so that the program is unstable.
Inventors
- SHI YAO
- XU KAIQIANG
- ZHENG WENCHAO
- LI YANGGUANG
Assignees
- 深圳小马易行科技有限公司
Dates
- Publication Date
- 20260512
- Application Date
- 20260415
Claims (10)
- 1. The method for detecting the GPU side video memory competition based on the CPU side is characterized by comprising the following steps: during the running of a GPU application program, a callback function is implanted at an API (application program interface) inlet of a GPU runtime library called by the GPU application program so as to dynamically capture context information corresponding to GPU call initiated by the GPU application program each time, wherein the context information comprises an accessed GPU video memory address, an access type, an affiliated GPU flow identifier, an associated GPU event identifier and an API name or a custom operator name for triggering call; according to the context information, a CPU simulation execution environment which is completely equivalent to GPU end parallel semantics corresponding to the GPU application program is constructed at a CPU end; In the CPU simulation execution environment, a thread data competition detection tool TSAN is started, and concurrent access behaviors of a plurality of CPU threads to the same mapping memory address are monitored to obtain a monitoring result; And when the monitoring result shows that the thread data contention detecting tool TSAN detects that the concurrent access behaviors of the plurality of CPU threads to the same mapping memory address have data contention, the contention information associated with the contention event corresponding to the data contention is reversely mapped to the context information so as to complete the video memory contention detection of the GPU side.
- 2. The method for detecting the video memory competition of the GPU terminal based on the CPU terminal according to claim 1, wherein constructing the CPU simulation execution environment which is completely equivalent to the GPU terminal parallel semantics corresponding to the GPU application program on the CPU terminal according to the context information comprises the following steps: distributing a unique CPU memory address with consistent declaration period for each unique GPU memory address, and establishing a one-to-one mapping relation; mapping a serial operation sequence contained in each GPU stream corresponding to each GPU stream identifier to an independent CPU thread for sequential execution; For each mapped read operation, performing a read by the corresponding CPU thread on the mapped memory address; For each mapped write operation, writing is performed by the corresponding CPU thread on the mapped memory address.
- 3. The method for detecting the video memory competition of the GPU terminal based on the CPU terminal according to claim 1, wherein constructing the CPU simulation execution environment which is completely equivalent to the GPU terminal parallel semantics corresponding to the GPU application program on the CPU terminal according to the context information comprises the following steps: when the two GPU streams establish a synchronous relationship through the event, a memory barrier or a thread synchronization primitive is inserted between the two corresponding CPU threads so as to reproduce the happens-before constraint relationship of the GPU side.
- 4. The method for detecting GPU-side memory contention based on the CPU side according to claim 2, wherein the mapping relation is completed by maintaining a global hash table, wherein a key of the global hash table is a GPU memory pointer, a value of the global hash table is a memory pointer dynamically allocated by a CPU heap memory allocator, and when a release operation of the GPU-side memory is detected, the global hash table automatically releases a corresponding CPU memory and clears a mapping entry.
- 5. The method for detecting the GPU-side video memory contention based on the CPU-side according to claim 1, further comprising: Recording a video memory address sequence accessed by each GPU stream according to an execution sequence, wherein the video memory address sequence is used for analyzing operation dependency relations; recording the GPU stream which is accessed to each video memory address concurrently at present so as to identify a cross-stream competition source; recording the GPU stream set bound by each event so as to track the synchronization point; and recording the video memory access snapshot of each associated stream at each event trigger time to restore the event-driven synchronous state.
- 6. The method for detecting the GPU-side video memory contention based on the CPU-side according to claim 1, further comprising: When the thread data competition detection tool TSAN detects that the concurrent access behaviors of the plurality of CPU threads to the same mapping memory address have data reading-writing or writing-writing operation, determining that the concurrent access behaviors of the plurality of CPU threads to the same mapping memory address have data competition.
- 7. The method for detecting a GPU side video memory contention based on a CPU side according to any one of claims 1 to 6, wherein mapping contention information associated with a contention event corresponding to the data contention back to the context information to complete the detection of the GPU side video memory contention, includes: And positioning to a GPU API or operator call point triggering the CPU to simulate the execution environment to run by analyzing the call chain of the callback function, and restoring the source code position by combining a symbol table or debugging information so as to reversely map the competition information to the context information, thereby completing the video memory competition detection of the GPU terminal.
- 8. The device for detecting the GPU side video memory competition based on the CPU side is characterized by comprising: The calling unit is used for dynamically capturing the context information corresponding to the GPU call initiated by the GPU application program each time by implanting a callback function at an API (application program interface) inlet of a GPU runtime library called by the GPU application program during the running of the GPU application program, wherein the context information comprises an accessed GPU video memory address, an access type, an affiliated GPU flow identifier, an associated GPU event identifier and an API name or a custom operator name for triggering the call; The environment construction unit is used for constructing a CPU simulation execution environment which is completely equivalent to the GPU side parallel semantics corresponding to the GPU application program at the CPU side according to the context information; The monitoring unit is used for starting a thread data competition detection tool TSAN in the CPU simulation execution environment, and monitoring concurrent access behaviors of a plurality of CPU threads to the same mapping memory address to obtain a monitoring result; and the mapping unit is used for reversely mapping the competition information associated with the competition event corresponding to the data competition to the context information when the monitoring result shows that the thread data competition detection tool TSAN detects that the concurrent access behaviors of the plurality of CPU threads to the same mapping memory address have data competition, so as to complete the video memory competition detection of the GPU side.
- 9. A GPU, wherein the GPU uses the method for detecting GPU-side video memory contention based on the CPU-side according to any one of claims 1 to 7.
- 10. A computer program product comprising computer instructions which, when executed by a processor, perform the method of detecting GPU-side memory contention based on a CPU-side as claimed in any one of claims 1 to 7.
Description
Method and device for detecting GPU (graphics processing unit) terminal video memory competition based on CPU (Central processing Unit) terminal Technical Field The invention relates to the technical field of computer parallel computing and program debugging, in particular to a method and a device for detecting GPU (graphics processing unit) end video memory competition based on a CPU (Central processing Unit) end. Background In modern high-performance computing, artificial intelligence training and reasoning systems, GPUs are widely used in deep learning frameworks (e.g., tensorFlow, pyTorch) and scientific computing programs due to their powerful parallel computing capabilities. GPU programs typically execute multiple kernels (Kernel) concurrently through multiple streams (streams) and rely on events (events) or synchronization functions (e.g., cudaStreamSynchronize) for coordination of video memory accesses. However, in a complex application scenario, a developer is very prone to causing multiple threads or cores to concurrently access the same memory address and cause a memory data race due to improper use of the synchronization mechanism (e.g., misuse of Event, missing of synchronization point, incorrect synchronization of the shared memory across streams, etc.) (Memory Race Condition). The data competition at the GPU end is expressed in the form of two or more cores which are executed concurrently, the read-write operation or the write-write operation is carried out on the video memory address of the same equipment at the same time, and the correct happens-before relation is not established through any synchronization primitive (such as cudaEventSynchronize, cudaStreamWaitEvent). Such problems have the following significant features: 1) The data competition is usually represented as intermittent errors and is difficult to reproduce, and particularly in a large-scale training task, the data competition can be triggered only under specific input or load; 2) The debugging is difficult because the existing GPU debugging tools (such as NVIDIA NSIGHT computer and NSIGHT SYSTEMS) mainly pay attention to performance analysis and hardware state monitoring and lack fine modeling of video memory access time sequence and concurrent semantics; 3) The closed source limitation is that the GPU driving and the Runtime library (such as CUDA run) are mostly closed sources, monitoring logic cannot be directly inserted, and fine-granularity access tracking is difficult to realize at the kernel level; 4) The special tool is lacking, and no mature, general and integrable tool can automatically detect the competition of the stored data directly when the GPU runs, and particularly the competition problem cannot be related to a source code call stack and a specific API/operator call point. In the prior art, some studies have attempted to track the memory access path through static analysis (such as data flow analysis) or run-time instrumentation (such as CUDA-Trace), but these methods have had the following drawbacks: 1) Dynamic concurrency behavior cannot be captured, namely dynamic branches, conditional control and memory addresses determined during running are difficult to deal with by static analysis; 2) The real concurrency semantics cannot be simulated, wherein most tools only record access addresses and do not establish semantic equivalent mapping between GPU Stream/Event and CPU thread concurrency models; 3) The mainstream detection framework is not supported, and the detection capability is limited due to the fact that data competition detection cannot be carried out by using a mature, high-precision and low-false-alarm tool (such as ThreadSanitizer, TSAN) in the CPU field. In view of the above problems, no effective solution has been proposed at present. Disclosure of Invention The embodiment of the invention provides a method and a device for detecting GPU-end video memory competition based on a CPU end, which at least solve the technical problem that in the GPU parallel computing scene in the related art, when a plurality of kernel functions (kernel) access the same video memory address through different streams (streams) to cause data competition, the competition cannot be effectively positioned and diagnosed, so that a program is unstable. According to one aspect of the embodiment of the invention, a method for detecting GPU-side video memory competition based on a CPU side is provided, and comprises the steps of implanting callback functions at an API entry of a GPU runtime library called by a GPU application program during running of the GPU application program to dynamically capture context information corresponding to GPU call initiated by the GPU application program each time, wherein the context information comprises accessed GPU video memory addresses, access types, affiliated GPU stream identifications, associated GPU event identifications and API names or custom operator names triggering the call, constr