CN-122019946-A - Sparse matrix-vector multiplication acceleration method for bath GPU
Abstract
The invention discloses a sparse matrix-vector multiplication acceleration method for a bath-drawn GPU, and belongs to the field of high-performance calculation. The method comprises the steps of obtaining a sparse matrix A and a vector x in a CSR format, dividing the A into 16 multiplied by 16 sub-blocks Tile according to operand width of a mu GPU matrix multiplication instruction, generating a CSR sub-matrix row_base_list for tiles with a non-zero element proportion eta <10%, generating a Tile-Dense sub-matrix block_base_list for the rest, dividing rows of the CSR sub-matrix into three types of light rows, medium rows and weight rows according to the number of the non-zero elements, completing three-level specifications in the same GPU core by utilizing Thread Block Tile, on-chip shared memory and global memory atom operation to obtain y_csr, calling mfma instruction accumulation to obtain y_tile, and finally accumulating y_csr and y_tile to output a result y. The method can replace mcSPARSE interfaces in binary level, and on the premise of not changing an upper frame, the load unbalance is relieved and the utilization rate of matrix multiplication instructions is improved.
Inventors
- CAI ZHENGYANG
Assignees
- 蔡正阳
Dates
- Publication Date
- 20260512
- Application Date
- 20260213
Claims (1)
- 1. The sparse matrix-vector multiplication acceleration method for the bath GPU is characterized by comprising the following steps of: S1, acquiring a sparse matrix A epsilon R (MxK) and a vector x epsilon R K in a compressed sparse row CSR format, distributing a video memory at an equipment end, storing a row offset rowPtr, a column index colIdx, a non-zero element value val, and recording a row number M, a column number K and a non-zero element number nnz; step S2, according to the operand width of a drawn GPU matrix multiplied acceleration instruction mfma.f32.16x16 x 16.f16, uniformly dividing the sparse matrix A into 16x 16 sub-blocks Tile, defining NNZ_PER_tile as non-zero element number in Tile, and calculating the non-zero element occupation ratio eta=NNZ_PER_tile/256 of each Tile, wherein if eta is less than 10%, all non-zero elements in the Tile are moved into row_base_list and marked as CSR sub-matrix, otherwise, block_base_list is moved into Tile-Dense sub-matrix; Step S3, defining r as the row number of the sparse matrix, counting the number NNZ (r) of non-zero elements in each row, setting th1=64 and th2=1024, classifying the rows according to the following rules, namely, light-weight rows L1: NNZ (r) is less than or equal to th1, medium-weight rows L2: th1< NNZ (r) is less than or equal to th2, weight rows L3: NNZ (r) > th2, and respectively generating L1_rows, L2_rows and L3_rows arrays which only store the row numbers; Step S4, executing the CSR submatrices in parallel in the same GPU kernel: For the L1 line, taking Thread Block Tile as a unit, reading non-zero elements in charge of the line into an on-chip shared memory through a merging memory, completing register-level protocol by using a thread bundle __ shfl _xor_sync primitive, and writing a result into y_csr [ r ]; For L2 lines, each Thread Block Tile is responsible for one line, defining a thread number tid E [0,1023] in Thread Block Tile, executing binary tree reduction in the shared memory, and writing y_csr [ r ] by a thread tid=0, wherein depth log2 (1024) =10; For L3 rows, T=NNZ (r)/1024 Thread Block Tile are enabled to cooperatively complete local protocols, each Thread Block Tile executes atomicAdd (& C [ r ], 1) through a global memory int32_t counter C [ r ], and when C [ r ] =T, the last Thread Block Tile completes the assembly protocol and writes y_csr [ r ]; Step S5, starting a GPU kernel for the Tile-Dense submatrix, wherein each Thread Block Tile is responsible for one 16×16 Tile, loading 256 float16 values into a register, calling a mfma _f32_16x16x16_f16 instruction to complete local matrix vector multiplication, and writing a result into a temporary array y_tile; and S6, starting a GPU kernel, and accumulating y_tile and y_csr according to line numbers to obtain a final SpMV result y.
Description
Sparse matrix-vector multiplication acceleration method for bath GPU Technical Field The invention relates to the field of high-performance computation, in particular to a sparse matrix-vector multiplication acceleration method for a drawn GPU, which can be widely applied to scenes such as scientific computation, deep learning reasoning, graph neural network, large-scale graph analysis and the like. Background Sparse Matrix-vector multiplication (SpMV) is a core computational operator in applications such as scientific computation, deep learning reasoning, graph neural networks, and large-scale graph analysis. With the exponential growth of problem scale, the time duty cycle of SpMV in end-to-end applications has increased significantly, which has become one of the key bottlenecks that limit the overall performance of the system. In order to fully exploit the parallel computing power of a Graphics Processor (GPU), various SpMV optimization strategies, such as algorithms such as MERGE-CSR, ELL-C-SIGMA, are proposed around general GPU architectures such as NVIDIA, AMD, etc. in academia and industry. The optimization methods achieve good effects on specific platforms, but due to the high coupling of the optimization methods with a bottom instruction set, a storage hierarchy and a thread scheduler, performance is often reduced and even functional abnormality is caused when the optimization methods are directly migrated to a drawn GPU which is designed for artificial intelligence and scientific computing scenes. In the sparse linear algebraic library mcSPARSE provided by the morning official, the mcsparseCsrmvEx interface performs well in most typical scenarios. However, when the non-zero element distribution of each row of the sparse matrix is significantly different, the performance of the sparse matrix is significantly degraded. Analyzed, the following two major challenges are faced by the bath GPU in SpMV implementations: The problem of load unbalance is that a common sparse matrix in the scientific computing field has extremely large non-zero element distribution variance in a row dimension, so that task distribution of a computing unit in a drawn GPU many-core processor is extremely unbalanced, a part of threads are overloaded, a large number of threads are in an idle or waiting state, and the parallel efficiency is seriously affected. The utilization rate of the computing unit is low, the sparse mode lacks of spatial locality, matrix multiplication acceleration instructions occupying a computing force main body in the bathing GPU are difficult to trigger, core computing resources are not fully utilized, and the computing force utilization rate is low. Therefore, there is a need for a SpMV acceleration method specifically designed for the structural characteristics of a GPU in a bath, which can sense the structural characteristics of a sparse matrix on line, dynamically balance the computational load, and fully activate matrix multiplication acceleration instructions, so as to achieve the improvement of the robustness of cross-matrix distribution. Disclosure of Invention When the prior bathing GPU executes sparse matrix-vector multiplication SpMV, the computing task quantity in Thread Block Tile is large because of extremely uneven distribution of non-zero elements of sparse matrix rows, the utilization rate of matrix multiplication acceleration instructions mfma is low, and the overall parallel efficiency is fluctuated severely along with matrix distribution change. In order to solve the problems, the invention provides a SpMV acceleration method for a bath GPU, which comprises the following steps: S1, acquiring a sparse matrix A epsilon R (M multiplied by K) and a vector x epsilon R (K) in a CSR format, distributing a video memory at a device end, storing a row offset rowPtr, a column index colIdx, a matrix value val, and recording a row number M, a column number K and a nonzero element number nnz; Step S2, according to the operand width of a drawn GPU matrix multiplying instruction mfma.f32.16x16x16.f16, dividing A into 16x16 sub-blocks Tile, defining NNZ_PER_tile as non-zero element number in Tile, calculating eta=NNZ_PER_tile/256, if eta is less than 10%, moving all non-zero elements of Tile into row_base_list and marking as CSR sub-matrix, otherwise moving into block_base_list and marking as Tile-Dense sub-matrix; step S3, defining a sparse matrix row number r, counting the number NNZ (r) of nonzero elements of each row, setting threshold values th1 and th2, dividing the rows into light-weight rows L1, NNZ (r) less than or equal to th1, medium-weight rows L2, th1, NNZ (r) less than or equal to th2, weight rows L3, NNZ (r) more than th2, and generating L1_rows, L2_rows and L3_rows arrays which only store the row numbers; step S4, executing the CSR submatrices in parallel in the same GPU kernel: For L1 line, taking Thread Block Tile as a unit, reading in an on-chip shared memory through a merging memor