CN-121996604-A - GPGPU (graphics processing unit) cross-SM data sharing method under multi-Kernel scene

CN121996604ACN 121996604 ACN121996604 ACN 121996604ACN-121996604-A

Abstract

The invention provides a data sharing method of GPGPU (graphics processing unit) crossing SM (synchronous machine) under a multi-Kernel scene, belonging to the technical field of GPGPU parallel computing and Cache optimization, comprising the following steps: an access counter and a resident identifier are additionally arranged on L1D Cache line data of each SM in the GPGPU, the access frequency of the L1D Cache line data in a set period is counted by the access counter, the high-frequency accessed data in each SM are precisely screened out to be high-frequency shared data, the resident identifier is used for resident the high-frequency shared data, the data are prevented from being eliminated by a traditional Cache replacement strategy, and meanwhile unique identification of the data and SM information to which the data belong are recorded. According to the invention, through accurately positioning and solidifying high-frequency shared data, a foundation is laid for the subsequent construction of the efficient SM-crossing data shared link, the data access path is effectively shortened, the cache resource waste is reduced, and the overall data processing efficiency and throughput performance under a multi-kernel concurrent scene are remarkably improved.

Inventors

ZHANG JUN
LIU LISEN

Assignees

东华理工大学

Dates

Publication Date: 20260508
Application Date: 20260129

Claims (10)

1. The data sharing method of the GPGPU crossing the SM under the multi-Kernel scene is characterized by comprising the following steps of: In a multi-Kernel scene, an access counter and a resident identifier are additionally arranged on the L1D Cache line data of a first-level data Cache of each streaming multiprocessor SM in a GPGPU, the access frequency of the L1D Cache line data in a set period is counted through the access counter, and the L1D Cache line data which is accessed by high frequency in each streaming multiprocessor is screened out based on the access frequency and is recorded as high-frequency shared data; Designing a cross-SM data sharing table for a target GPGPU, wherein the cross-SM data sharing table comprises a unique identifier of high-frequency sharing data, cross-unit access times of the high-frequency sharing data and valid bits, and the valid bits are used for marking the available state of the corresponding high-frequency sharing data; when some SM has operation data missing, inquiring whether the data is in other SM or not through the global mapping table, if the target operation data is inquired and the valid bit is marked as available state, directly obtaining the target operation data from the corresponding SM to realize data sharing.
2. The method for sharing data across SMs by a GPGPU in a multi-Kernel scene according to claim 1, wherein when operation data is missing in a certain SM, the method further comprises the steps of acquiring the operation data from a lower storage unit through a traditional data access path if target operation data is not queried, setting a resident identifier to be an unlocking state and deleting high-frequency shared data with the minimum number of times of cross-unit access when cross-SM data sharing is expressed to a storage upper limit, and clearing a cross-SM data sharing table when Kernel execution of a target general graphics processor is finished.
3. The method for sharing data by GPGPU under a multi-Kernel scene according to claim 1 is characterized in that the access counter is configured in an L1D Cache line and comprises a double-period access counter divided into a first half access counter FC and a second half access counter BC, wherein the FC is used for counting the access frequency of the L1D Cache line data in the first half of a set sampling period, the BC is used for counting the access frequency of the second half of the same sampling period, and the corresponding L1D Cache line data is screened as high-frequency shared data only when the frequencies counted by the FC and the BC reach a preset high-frequency threshold value.
4. The method for sharing data by GPGPU across SMs in a multi-Kernel scene according to claim 1 is characterized in that when operation data is missing in a certain SM, whether the data are in other SMs is queried through the global mapping table, if the target operation data are queried and the valid bit is marked as an available state, the target operation data are directly obtained from the corresponding SM to realize data sharing, specifically, when the operation data are missing in the certain SM, the SM is marked as a data request unit, the data request unit locates the SM to which the target missing data belong through the global mapping table, sends a data access request to the target SM, the target SM checks the resident state and the valid bit of the high-frequency shared data in the self L1D Cache, and if the state is legal, the target SM directly reads the data from the self L1D Cache and transmits the data to the request unit, and meanwhile, the cross-unit access times of the data in the cross-SM data sharing table are accumulated.
5. The method for sharing data by GPGPU under multi-Kernel scene according to claim 1 is characterized in that the valid bit of the cross-SM data sharing table adopts binary identification, and specific rules comprise that when the valid bit is 1, high-frequency sharing data are available, corresponding to L1D Cache lines are in a resident state, when the valid bit is 0, high-frequency sharing data are unavailable, corresponding to L1D Cache lines are in an unlocking state, the valid bit state and the resident identifier state are linked in real time, and state switching is synchronous and updated.
6. The method for sharing data across SMs by a GPGPU in a multi-Kernel scene according to claim 1, wherein the global mapping table specifically comprises a Kernel identification field for recording Kernel numbers executed by a general graphics processor in parallel, a stream multiprocessor SM identification field for associating SM numbers corresponding to the kernels, a storage address field of a across-SM data sharing table for recording physical addresses of the corresponding SM across-SM data sharing table in on-chip storage, and a Kernel identification field for quickly indexing to the corresponding SM and across-SM data sharing table to shorten query time.
7. The method for sharing data by GPGPU across SMs under a multi-Kernel scene according to claim 2 is characterized in that when the cross-SM data sharing is expressed to a storage upper limit, a residence identifier is set to be in an unlocking state, high-frequency sharing data with minimum cross-unit access times are deleted, and the method specifically comprises the steps of traversing data items of a current SM in a cross-SM data sharing table, screening one or more data items with minimum cross-SM data access times, checking whether L1D Cache line data corresponding to the screened items are in an unaccessed state, firstly setting the residence identifier to be in an unlocking state, setting a valid bit to be 0 for the unaccessed data, deleting the items from the cross-SM data sharing table, and simultaneously releasing storage space corresponding to the L1D Cache.
8. The method for sharing data across SMs by a GPGPU in a multi-Kernel scene according to claim 2 is characterized in that when Kernel execution of a target general purpose graphics processor is finished, the method for sharing data across SMs is emptied, and concretely comprises the steps of emptying all entries of a current SM in the across SM data sharing table, resetting storage pointers of the across SM data sharing table, setting resident identifiers corresponding to all L1D Cache lines where the SM is resident into an unlocking state, and uniformly updating valid bits of all associated high-frequency shared data into 0 to ensure that invalid resident is not remained.
9. The method for sharing data by GPGPU across SM in the multi-Kernel scene according to claim 1 is characterized in that the primary data Cache L1D Cache line data further comprises a line address field, wherein the line address field is used for storing an L1D Cache physical address corresponding to high-frequency shared data, when data is required to be unlocked or deleted, the target L1D Cache line can be directly positioned through the address field, and when SM data is accessed, the target SM can quickly read the data in the L1D Cache through the address without traversing the Cache.
10. A system for implementing the GPGPU-across-SM data sharing method in a multi-Kernel scenario of claim 1, comprising: The system comprises an L1D Cache module, a resident identifier, a resident operation identifier, a unique identifier and stream multiprocessor information, wherein the L1D Cache module is used for adding an access counter and the resident identifier for the L1D Cache data of a first-level data Cache of each stream multiprocessor SM in a GPGPU (general purpose graphics processing unit) under a multi-Kernel scene, counting the access frequency of the L1D Cache line data in a set period by the access counter, and screening out the L1D Cache line data which is accessed by high frequency in each stream multiprocessor based on the access frequency to be recorded as the high-frequency shared data; The shared tag table module is used for designing a cross-SM data sharing table for the target GPGPU, wherein the cross-SM data sharing table comprises a unique identifier of high-frequency shared data, cross-unit access times of the high-frequency shared data and valid bits, and the valid bits are used for marking the available state of the corresponding high-frequency shared data; And when some SM has the operation data missing, inquiring whether the data is in other SMs or not through the global mapping table, and if the target operation data is inquired and the valid bit is marked as an available state, directly acquiring the target operation data from the corresponding SM to realize data sharing.

Description

GPGPU (graphics processing unit) cross-SM data sharing method under multi-Kernel scene Technical Field The invention belongs to the technical field of GPGPU parallel computing and cache optimization, and particularly relates to a data sharing method of a GPGPU cross-SM under a multi-Kernel scene. Background A General Purpose Graphics Processor (GPGPU) is widely applied to the fields of scientific computing, cloud reasoning service, deep learning training, graphics rendering and the like by virtue of high parallel computing capability, and the core bears the execution work of high-throughput parallel tasks. Along with the rapid increase of the demands of the massive parallel computing platform, particularly in high concurrency scenes such as image processing, large model reasoning, real-time analog computing and the like, the concurrent execution of multiple Kernel becomes a main flow form for improving the task processing efficiency, and more stringent demands are also put forward on the efficient memory access mechanism of the GPGPU. Kernel generally refers to a parallel computing task or program executed on a GPU in GPU parallel computing, and in a multi-Kernel scenario, the GPGPU performs multiple computing tasks (Kernel) in parallel through multiple Streaming Multiprocessors (SMs) inside the GPGPU. However, the hardware architecture of the existing GPGPU has inherent limitations, which cause the access performance bottleneck to be highlighted in the multi-Kernel concurrency scenario. When multiple Kernel concurrency occurs, a large number of threads generate access competition for limited Cache resources of the SM, so that the problems of frequent Cache replacement and significant increase of the miss rate of the L1D Cache are easily caused, and finally, access delay is greatly increased, and the overall performance of the GPGPU is restricted. To alleviate the above problems, various cache optimization schemes have been proposed in the prior art, such as a Last Level Cache (LLC) design based on locality, a shared data identification method based on access pattern prediction, and the like. However, these schemes are difficult to adapt to the actual requirements of the concurrent scenes of multiple Kernel, the core design is oriented to the scene of single Kernel, and the adaptation capability of the access characteristic difference among multiple Kernel is lacking, so that the complex SM resource competition and data interaction requirements of the GPGPU among multiple Kernel cannot be met. Disclosure of Invention In order to solve the background problem, the invention provides a data sharing method of GPGPU (graphics processing Unit) crossing SM under a multi-Kernel scene. In order to achieve the above object, the present invention provides a data sharing method of GPGPU across SMs in a multi-Kernel scenario, including: In a multi-Kernel scene, an access counter and a resident identifier are additionally arranged on the L1D Cache line data of a first-level data Cache of each streaming multiprocessor SM in a GPGPU, the access frequency of the L1D Cache line data in a set period is counted through the access counter, the L1D Cache line data which is accessed by high frequency in each streaming multiprocessor is screened out based on the access frequency and is recorded as high-frequency shared data, resident operation is carried out on the high-frequency shared data according to the resident identifier, and meanwhile, the unique identification of the high-frequency shared data and the streaming multiprocessor information to which the high-frequency shared data belongs are recorded. And designing a cross-SM data sharing table for the target GPGPU, wherein the cross-SM data sharing table comprises a unique identifier of high-frequency sharing data, the cross-unit access times of the high-frequency sharing data and valid bits, and the valid bits are used for marking the available states of the corresponding high-frequency sharing data. When some SM has operation data missing, inquiring whether the data is in other SM or not through the global mapping table, if the target operation data is inquired and the valid bit is marked as available state, directly obtaining the target operation data from the corresponding SM to realize data sharing. Preferably, when the operation data is missing in a certain SM, the method further comprises the steps of acquiring the target operation data from a lower-level storage unit through a traditional data access path if the target operation data is not queried, setting a resident identifier to be in an unlocking state and deleting high-frequency shared data with the minimum cross-unit access times when cross-SM data sharing is expressed to a storage upper limit, and clearing a cross-SM data sharing table when kernel execution of a target general graphics processor is finished. The access counter is configured in an L1D Cache line, and specifically comprises a double-period acce