CN-122019157-A - GPU parallel thread block adjustment and kernel function scheduling optimization method and system

CN122019157ACN 122019157 ACN122019157 ACN 122019157ACN-122019157-A

Abstract

The invention discloses a GPU parallel thread block adjustment and kernel function scheduling optimization method and system, which comprise the steps of obtaining GPU equipment attributes and kernel function execution attributes, determining feasible thread block size solution spaces of kernel functions based on the GPU equipment attributes and the kernel function execution attributes, selecting the thread block size with the maximum load balance evaluation model value as optimal execution configuration of the kernel functions, dividing an execution grid of each kernel function into a plurality of kernel function slices based on the determined optimal thread block size of each kernel function, establishing a concurrent execution relationship among the kernel functions, constructing a CUDA graph by adopting a mixed mode according to the established concurrent execution relationship among the kernel functions and data dependence, respectively comparing the GPU occupancy rate, the kernel function execution time and the overall simulation efficiency before and after the thread block self-adaptive adjustment, the multi-stream concurrent scheduling and the mixed CUDA graph scheduling optimization by taking an actual model of particle simulation software as a test object, and outputting an optimized particle simulation result.

Inventors

YANG WENJIN
Dang yuan
CHEN YUJUN
LI YONGDONG
WANG HONGGUANG

Assignees

西安交通大学

Dates

Publication Date: 20260512
Application Date: 20260123

Claims (10)

1. The GPU parallel thread block adjustment and kernel function scheduling optimization method is characterized by comprising the following steps of: Acquiring GPU equipment attributes and kernel function execution attributes; Based on GPU equipment attributes and kernel function execution attributes, determining a feasible thread block size solution space of each kernel function according to triple constraints of thread bundle alignment, register total limit and shared memory total limit, and selecting a thread block size with the maximum load balance evaluation model value as the optimal execution configuration of the kernel function; dividing the execution grid of each kernel function into a plurality of kernel function slices based on the determined optimal thread block size of each kernel function, and establishing a concurrent execution relationship among the kernel functions; And respectively comparing the GPU occupancy rate, the kernel execution time and the overall simulation efficiency before and after the thread block self-adaptive adjustment, the multi-stream concurrent scheduling and the mixed CUDA graph scheduling optimization, and outputting the optimized particle simulation result by taking the actual model of the particle simulation software as a test object.
2. The method for GPU parallel thread block adjustment and kernel scheduling optimization according to claim 1, wherein the obtaining GPU device attributes and kernel execution attributes comprises: And acquiring the device attributes of the SM number, the maximum thread block number of each SM and the warp size through cudaDeviceProps, and extracting the register usage amount of each thread of the kernel function and the kernel function attribute occupied by the shared memory by utilizing cudaFuncGetAttributes.
3. The method for GPU parallel thread block adjustment and kernel scheduling optimization of claim 1, wherein the triple constraints based on GPU device attributes and kernel execution attributes, based on thread bundle alignment, register total limit, and shared memory total limit, comprise: The method comprises the steps of aligning thread bundles, limiting a register, limiting a total register used for a thread block to be an integral multiple of the warp size, limiting a shared memory, initializing a thread block solution space of a kernel function according to the three conditions, and initializing the optimal load balance degree and the optimal thread block size to be zero, wherein the total register used for the thread block cannot exceed the upper limit of equipment, and the shared memory used for the thread block cannot exceed the upper limit of equipment.
4. A method for GPU parallel thread block adjustment and kernel scheduling optimization according to claim 3, wherein the determining a feasible thread block size solution space of each kernel, selecting a thread block size with the largest load balance evaluation model value as an optimal execution configuration of the kernel, comprises: And (3) building a GPU load balance evaluation model by weighting and calculating the product of the warp occupancy rate, the thread block occupancy rate and the grid occupancy rate, respectively updating the optimal load balance and the optimal thread block size into the current load balance and the thread block size if the current load balance is larger than the optimal load balance, otherwise, traversing the next solution until traversing is completed, and finally obtaining the optimal thread block size which is the thread block size for enabling the SM load of the GPU to be balanced most.
5. The method for adjusting parallel thread blocks of a GPU and dispatching and optimizing kernel functions according to claim 4, wherein the warp occupancy rate represents a proportional relation between the number of warp actually and concurrently executed on a SM and the maximum supported warp number of hardware, the thread block occupancy rate represents a ratio of the number of concurrently executed thread blocks on the SM to the maximum supported thread blocks of the hardware, the proportional relation between the number of concurrent thread blocks on the SM and the maximum bearing capacity of the hardware is represented, and the grid occupancy rate represents a ratio of the total number of thread blocks in a grid to the maximum concurrent thread blocks supportable by all SMs on the GPU.
6. The method for GPU parallel thread block adjustment and kernel function scheduling optimization according to claim 1, wherein the dividing the execution grid of each kernel function into a plurality of kernel function slices based on the determined optimal thread block size of each kernel function, and establishing a concurrent execution relationship between kernel functions, comprises: The method comprises the steps of performing slicing processing on a kernel function, obtaining the distribution condition of a thread block of the kernel function by a previous algorithm when the kernel function is scheduled by taking the kernel function slices as a unit, dividing the kernel function into grids, setting the size of the kernel function slices as the number of SMs in a GPU, dividing the kernel function into a plurality of kernel function slices, then distributing a CUDA stream for each slice for concurrent execution, finally synchronizing the execution results in all CUDA streams, and summarizing the execution results into the final execution result of the whole kernel function to finish the execution of the kernel function.
7. The method for GPU parallel thread block adjustment and kernel function scheduling optimization according to claim 1, wherein the constructing the CUDA graph in a hybrid mode according to the established concurrent execution relationship and data dependency between kernel functions comprises: Firstly, capturing modules with static parameters such as the rest electromagnetic field propulsion and node field calculation by using a stream capturing API, generating an initial static subgraph, and starting a particle simulation calculation flow by loading the static graph; If the generation of particles is detected, adding a kernel function of the particle propulsion module into the static diagram at the moment, calling cudaGraphAddKernelNode a function to manually construct a CUDA diagram node of the particle propulsion step, and establishing a dependency relationship with the static subgraph through cudaGraphAddDependencies function; If the change of the particle number does not exceed the thread size configured for the particle propelling kernel function currently, no change is made, the CUDA graph is continuously loaded to perform particle simulation calculation, and if the change of the particle number exceeds the thread size, the grid parameters of the particle propelling node manually constructed in the second stage are updated through cudaGraphKernelNodeSetParams functions.
8. A GPU parallel thread block conditioning and kernel scheduling optimization system, comprising: The data acquisition module is used for acquiring the GPU equipment attribute and the kernel function execution attribute; The configuration module is used for determining a feasible thread block size solution space of each kernel function according to triple constraints of thread bundle alignment, register total quantity limitation and shared memory total quantity limitation based on GPU equipment attributes and kernel function execution attributes, and selecting the thread block size with the maximum load balance evaluation model value as the optimal execution configuration of the kernel function; the relation establishing module is used for dividing the execution grid of each kernel function into a plurality of kernel function slices based on the determined optimal thread block size of each kernel function, and establishing concurrent execution relation among the kernel functions; And the comparison output module is used for constructing a CUDA graph by adopting a mixed mode according to the established concurrent execution relationship and data dependence among the kernel functions, taking an actual model of the particle simulation software as a test object, respectively comparing the GPU occupancy rate, the kernel function execution time and the overall simulation efficiency before and after the thread block self-adaptive adjustment, the multi-stream concurrent scheduling and the mixed CUDA graph scheduling optimization, and outputting the optimized particle simulation result.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of a GPU parallel thread block adjustment and kernel scheduling optimization method as claimed in any one of claims 1 to 7.
10. A computer readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the steps of a GPU parallel thread block adjustment and kernel scheduling optimization method as claimed in any one of claims 1 to 7.

Description

GPU parallel thread block adjustment and kernel function scheduling optimization method and system Technical Field The invention belongs to the technical field of GPU optimization, and particularly relates to a method and a system for GPU parallel thread block adjustment and kernel function scheduling optimization. Background The high-power microwave (High Power Microwave, HPM) device can generate, bear and transmit microwave signals with higher power level, and plays an irreplaceable key role in the fields of national defense safety, aerospace, high-energy physics and the like. However, the internal physical processes of HPM devices are extremely complex, exhibiting highly nonlinear characteristics. It is difficult to fully and precisely explain the internal physical mechanism by means of the conventional theoretical analysis, and precisely diagnosing the coupling mechanism of the electron beam-electromagnetic field in the device faces serious difficulties by experimental means. In this context, the particle simulation method based on the first principle is a core technical means to solve this problem. According to the particle simulation method, through carrying out discretization coupling solution on the Maxwell equation set and the Newton-Lorentz equation, the motion track of the charged particles under the action of the self-consistent electromagnetic field can be accurately reproduced, the physical process that the simulation structure is complex and the analysis formula is difficult to deduce is simulated, and a powerful tool is provided for revealing the complex physical process inside the HPM device. Through years of development, a particle simulation method has become one of key tools for researching and understanding plasmas, and is widely applied to the national defense fields of controlled nuclear fusion, vacuum electronics, low-temperature plasmas and the like. With the development of HPM technology, researchers are increasingly pressing the demands for performance optimization of HPM devices and research and development of new devices, which makes the computational complexity involved in the particle simulation method rapidly increase. When simulation is carried out by using a traditional CPU (Central Processing Unit ) architecture, a large-size HPM device is often required to be simulated for a plurality of weeks, so that the scientific research progress is severely restricted, and the requirement of researchers on quick calculation is difficult to meet. The field of high-end electromagnetic simulation software in China faces strict technical blockages in western countries for a long time, and the development situation is very serious. In 2016, the institute of plasma and microwave electronics at the university of western security was based on You Pu (UNIPIC) software, which successfully developed the full electromagnetic particle simulation software XEMPIC using conformal meshing technology. However, the current scheme still has the problems of low parallelism, unbalanced resource utilization and the like, and has a large optimization space. Based on the current situation, XEMPIC software is taken as a research object in the paper, the bottleneck in the existing architecture scheme is analyzed, a GPU (Graphics Processing Unit, graphic processor) -based parallel thread block adjustment and kernel function scheduling mechanism is constructed, GPU parallel computing efficiency of XEMPIC software is remarkably improved, faster and more powerful support is provided for scientific research work, technical blockage is broken through in the high-power microwave device and related national defense high-tech fields in China by assistance, and autonomous innovation development is realized. The traditional particle simulation method has obvious defects, and is specifically characterized in that: 1. The CPU serial architecture is adopted, the number of cores is small, the parallel computing power is limited, the memory bandwidth is low, the data transmission delay is high, the mass parallel computing task of large-scale simulation is difficult to bear, the simulation speed is low, the efficiency is low, the limitation of the computing bottleneck is easy, and the high-efficiency simulation requirement cannot be met. 2. The traditional method reflects the unreasonable load state of the GPU through the SM (STREAMING MULTIPROCESSOR ) occupancy rate, the SM occupancy rate is a sufficient unnecessary condition for GPU load balancing, global load balancing characteristics cannot be represented, the adoption of a fixed thread block size allocation strategy is extremely easy to cause idle waste of thread block resources, particularly for access-limited kernel functions, the phenomenon of greatly increasing the running time is caused, the overall computing performance in heterogeneous computing environments is finally limited, and the optimal execution efficiency of each kernel function cannot be realized. 3. The