CN-121996410-A - Matrix multiplication optimization method and system for RISC-V architecture

CN121996410ACN 121996410 ACN121996410 ACN 121996410ACN-121996410-A

Abstract

The invention discloses a matrix multiplication optimization method and system for RISC-V architecture, which comprises the steps of detecting hardware characteristics of a target RISC-V platform, adaptively configuring optimization parameters according to detection results, carrying out memory layout and data rearrangement on an input matrix B to realize memory optimization, adopting a multi-level block strategy to adapt to a memory hierarchical structure to realize multi-level block calculation, carrying out RVV vectorization core calculation by RISC-V vector instruction expansion to realize data level parallel optimization, hiding instruction delay by a loop expansion and software pipeline, adopting a data prefetching mechanism to realize instruction level parallel optimization, carrying out multi-core parallel calculation by OpenMP multithread technology to realize thread level parallel optimization, and calculating the rest elements by using vector registers to realize tail processing optimization. The method improves the cache hit rate and the computational performance, fully digs the hardware potential, and is suitable for high-requirement edge equipment.

Inventors

LI SHEN
LIU SIMING
BAO HONG

Assignees

西安电子科技大学

Dates

Publication Date: 20260508
Application Date: 20251229

Claims (10)

1. An optimization method of matrix multiplication for RISC-V architecture, comprising: detecting hardware characteristics of a target RISC-V platform, and adaptively configuring optimization parameters according to detection results, wherein the optimization parameters comprise a blocking parameter and an OpenMP thread number; Performing memory layout and data rearrangement on the input matrix B to realize memory optimization; Based on the partitioning parameters and the memory optimization result, adapting the memory hierarchical structure by adopting a multi-level partitioning strategy to realize multi-level partitioning calculation; Performing RVV vectorization core calculation by using RISC-V vector instruction extension to realize data-level parallel optimization; The instruction delay is hidden through cyclic expansion and a software pipeline, and a data prefetching mechanism is adopted to realize instruction level parallel optimization; based on the OpenMP thread number, performing multi-core parallel computation by using an OpenMP multithreading technology to realize thread-level parallel optimization; and calculating the residual elements by using vector registers to realize tail processing optimization.
2. The method for optimizing matrix multiplication for RISC-V architecture according to claim 1, wherein the detecting the hardware characteristics of the target RISC-V platform and adaptively configuring the optimization parameters according to the detection result comprises: Dynamically inquiring the length VLEN of a vector register, and determining the number of floating point numbers which can be processed by a single vector according to the length VLEN of the vector register; Dynamically detecting the size of a RISC-V core cache, and determining a blocking parameter according to the size of the RISC-V core cache; Dynamically detecting the number of CPU cores, and configuring the OpenMP thread number according to the number of the CPU cores.
3. The optimization method for matrix multiplication for RISC-V architecture according to claim 1, wherein said performing memory layout and data rearrangement on the input matrix B to achieve memory optimization comprises: Performing transposition operation on the input matrix B, and converting column priority into row priority access; And aligning the data addresses according to the vector length by using a memory alignment allocation mechanism to finish data rearrangement.
4. The optimization method for matrix multiplication for RISC-V architecture according to claim 1, wherein said adapting the memory hierarchy with a multi-level partitioning strategy based on the partitioning parameters and the memory optimization result, to achieve multi-level partitioning computation, comprises: The memory hierarchical structure is adapted by adopting a three-level block strategy to perform multi-level block calculation, wherein the three-level block strategy comprises an outer layer block, a middle layer block and an inner layer block, For the outer layer block, dividing an output matrix C into macro blocks of MB multiplied by NB, and adapting to an L3 cache, wherein MB is the number of row blocks, and NB is the number of column blocks; for middle layer partitioning, IB x KB partitioning is carried out in each macro block, and L2 caching is adapted, wherein IB is the number of row blocks, and KB is the number of column blocks; for inner layer blocking, register level blocking is performed within the cache block, using RVV vector registers to handle 32 elements.
5. The optimization method for matrix multiplication for RISC-V architecture according to claim 1, wherein said performing RVV vectorization core calculation using RISC-V vector instruction extension to implement data-level parallel optimization comprises: Dynamically determining the vector length by using an assembly instruction, and completing vector configuration; using a load instruction to load matrix block data from memory into vector registers in batches; Vector fusion multiply-add operation is carried out by using the multiply-add instruction, so that a calculation result is obtained; And writing the calculation result back to the memory by using the storage instruction to finish the output of the calculation block.
6. The method for optimizing matrix multiplication for RISC-V architecture according to claim 1, wherein said hiding instruction delay through loop unrolling and software pipelining and implementing instruction-level parallel optimization using a data prefetch mechanism comprises: The memory is circularly unfolded for a plurality of times, so that the circulation control overhead is reduced; Interleaving load data, compute, and store instructions to form a pipeline, hiding execution delay; Data needed for subsequent computation is loaded into the cache from the main memory in advance by using a data prefetching instruction.
7. The optimization method of matrix multiplication for RISC-V architecture according to claim 1, wherein based on the OpenMP thread number, performing multi-core parallel computation by using an OpenMP multithreading technology to implement thread-level parallel optimization, comprising: And adding a preprocessing instruction of a compiler before main circulation according to the OpenMP thread number so as to automatically parallelize multi-level block calculation of matrix multiplication.
8. The optimization method of matrix multiplication for RISC-V architecture according to claim 2, wherein the calculating the remaining elements using vector registers to implement tail processing optimization comprises: dynamically determining main vector cyclic processing data according to the number of floating point numbers which can be processed by the single vector; and carrying out vectorization calculation on the residual elements by using RVV characteristics, and dynamically adjusting the number of the actually processed elements.
9. The method for optimizing matrix multiplication for RISC-V architecture according to claim 8, wherein the calculation formula of the number of actually processed elements is N% vector_capacity when vectorizing the remaining elements by using RVV characteristics, wherein N represents the dimension of the matrix and vector_capacity represents the number of floating points that can be processed by a single vector.
10. An optimization system for matrix multiplication for RISC-V architecture, for implementing the method of any of claims 1-9, the system comprising: The detection and configuration module is used for detecting the hardware characteristics of the target RISC-V platform and adaptively configuring optimization parameters according to the detection result, wherein the optimization parameters comprise blocking parameters and OpenMP thread numbers; The memory optimization module is used for carrying out memory layout and data rearrangement on the input matrix B so as to realize memory optimization; the multi-level block calculation module is used for adapting the memory hierarchical structure by adopting a multi-level block strategy based on the block parameters and the memory optimization result to realize multi-level block calculation; The data level parallel optimization module is used for carrying out RVV vectorization core calculation by using RISC-V vector instruction expansion to realize data level parallel optimization; the instruction level parallel optimization module is used for hiding instruction delay through cyclic expansion and software pipeline, and realizing instruction level parallel optimization by adopting a data prefetching mechanism; The thread-level parallel optimization module is used for performing multi-core parallel computation by using an OpenMP multithreading technology based on the OpenMP thread number so as to realize thread-level parallel optimization; And the tail optimizing module is used for calculating the residual elements by using vector registers and realizing tail processing optimization.

Description

Matrix multiplication optimization method and system for RISC-V architecture Technical Field The invention belongs to the technical field of artificial intelligence and computer software and hardware collaborative optimization, and particularly relates to a matrix multiplication optimization method and system for a RISC-V architecture. Background With the rapid development of the fields of artificial intelligence, big data analysis, scientific computation and the like, matrix multiplication serves as a most basic and computationally intensive core algorithm, and plays a vital role in key applications such as deep learning reasoning, image processing, signal analysis and the like. Particularly, under the promotion of edge calculation and intelligent trend of the internet of things equipment, more and more intelligent applications need to execute neural network reasoning and complex numerical calculation on the embedded terminal equipment with limited resources in real time. RISC-V (Reduced Instruction Set Computing-V, fifth generation reduced instruction set) architecture is becoming the mainstream choice of edge smart chips by virtue of its open source, customizable, low power consumption characteristics. For example, in the edge computing scenarios such as intelligent security, automatic driving, industrial internet of things, etc., a terminal device adopting a RISC-V processor needs to efficiently execute convolutional neural network computation based on matrix multiplication to realize intelligent functions such as face recognition, target detection, voice wakeup, etc. However, due to the severe limitations of edge terminal devices in terms of computational power, memory bandwidth, and power consumption, the direct deployment of high-complexity matrix computing libraries designed for the cloud face serious challenges. The general matrix multiplication (GEneral Matrix to Matrix Multiplication, GEMM) is used as a core operator with the largest calculation amount in the neural network, and the execution efficiency directly determines the performance and energy efficiency of intelligent application. Although the RISC-V instruction set provides vector expansion (RVV) to support data parallel computing, matrix multiplication implementations in existing general purpose mathematical libraries fail to fully exploit their hardware potential, resulting in significant performance bottlenecks in performing large-scale matrix operations on RISC-V platforms. Along with the rapid popularization of RISC-V ecology in the field of edge AI, the matrix multiplication depth optimization research facing RISC-V architecture is developed, and the method has a vital meaning for promoting the landing application of the edge intelligent technology. The basic idea of the existing matrix multiplication aiming at RISC-V is that for an input matrix A and a matrix B, high-dimensional matrix calculation is firstly decomposed into calculation blocks which can be processed in parallel, then a multi-level parallelization strategy is designed to carry out efficient calculation on the matrix blocks, and finally the results of all the calculation blocks are accumulated to obtain a final output matrix, and the matrix multiplication with highest calculation efficiency and numerical precision meeting the requirements is realized as an optimization target. For example, wang Yumu in the document Wang Yumu, pan Zhiming, wu Pengfei, et al, yolov of RISC-V vector instruction set, graft optimization [ J ]. SCM & embedded systems applications, 2021, 21 (12): 20-25+30, propose methods for optimizing matrix multiplication using the V (vector) instruction set of the RISC-V extended instruction set, and employ both embedded handwriting assembly and inline assembly integrinsides, and use data address alignment methods. For example, matrix multiplication is optimized by techniques such as program performance analysis, vectorization, memory access optimization, cyclic expansion and the like, which are proposed in documents such as at the discretion, xu Xuezheng, huang Anwen and the like, object detection algorithm optimization [ J ] intelligent security for RISC-V architecture, 2024, 3 (03): 21-33. However, when matrix multiplication optimization is performed on a RISC-V architecture, most of the existing methods are directly transplanted from an ARM or x86 platform, the self characteristics of the RISC-V instruction set architecture and the depth collaborative optimization potential existing in convolution calculation are ignored, and the following limitations exist that 1, the dynamic configuration capability of RISC-V vector registers is ignored, the tail processing efficiency is low, 2, a memory hierarchy structure is not adapted, the cache hit rate is insufficient, and 3, multi-core collaborative optimization is lacked, so that the hardware potential cannot be fully exerted. Based on the problems, the existing matrix multiplication optimization method for