CN-115795251-B - Matrix vector multiplication implementation method after QR decomposition based on vector processor

CN115795251BCN 115795251 BCN115795251 BCN 115795251BCN-115795251-B

Abstract

The invention relates to a vector processor-based matrix vector multiplication implementation method after QR decomposition, wherein the QR decomposition algorithm is vsip _ cqrd _f function in VSIP library, VSIP library is vector, signal and image processing library pushed out by a GE intelligent platform, and is an open standard application programming interface provided for developers needing to use enhanced signal and image processing. The technical scheme of the invention mainly comprises the following steps of judging whether an input matrix C needs transposed conjugate processing and whether the matrix C is on the left or the right of matrix multiplication, so as to enter different branch processing, preprocessing the matrix A, and adopting different processing schemes according to the number and the size to realize matrix and vector multiplication operation. Aiming at the advantages of large storage space, high-speed cache in a chip and high-efficiency and rapid data transmission of a vector processor, the invention designs and realizes matrix vector multiplication aiming at QR decomposition, and can play a great advantage in complex matrix operation.

Inventors

Mo Shangfeng
YUAN YUDI
HU YONGHUA

Assignees

湖南科技大学

Dates

Publication Date: 20260512
Application Date: 20221218

Claims (1)

1. A matrix vector multiplication implementation method after QR decomposition based on a vector processor is characterized by comprising the following steps, Step 1, judging an input matrix Whether or not transposed conjugation and matrix are required Whether to the left or right of the matrix multiplication to enter different sub-function processes, the process flow of each sub-function comprises the following common steps; Step 2, preprocessing the A matrix, setting the element on the diagonal line of the A matrix to be 1, setting the element on the upper right part of the diagonal line to be 0, and setting the element on the lower left part of the diagonal line to be unchanged, partitioning the A matrix according to columns after transformation, and naming each column as a vector respectively according to the sequence 、 Vector quantity Contains the values of column 1 of matrix A, the values of the components are named 、 ; Step 3, matrix Vector of AND Matrix multiplication is performed, and the whole C matrix and vector are transmitted through DMA Transmitted to cache, vector The method comprises the steps of respectively taking P numbers from each row of a matrix C to be transmitted into a vector operation part, simultaneously executing a multiplication instruction, transmitting multiplication results into a temporary vector register TR, taking the P+12P data, carrying out the same calculation, accumulating corresponding values stored in the matrix C with the TR, transmitting the values of the TR into a vector element accumulator VEA until one row of the matrix C is calculated, carrying out accumulation operation on the VEA, multiplying the values by beta, storing the values into a DDR designated space, carrying out second row calculation until all the rows are calculated, and obtaining a vector with the length m stored in the DDR ; The specific mode of the step 3 comprises the following steps, Step 3.1, judging how to process according to the size of the original data, when the original data is extremely small, entering step 3.2, when the size of the original data does not exceed the size of the cache, entering step 3.3, and when the size of the original data exceeds the size of the cache, entering step 3.4; step 3.2, when the data volume is extremely small, scalar processing is directly performed in DDR; Step 3.3, when the data amount does not exceed the cache size, the whole C matrix and vector are transferred through DMA Transmitted to cache, vector The method comprises the steps of respectively taking P numbers from each row of a matrix C to be transmitted into a vector operation part, simultaneously executing a multiplication instruction, transmitting multiplication results into a temporary vector register TR, taking the P+12P data, carrying out the same calculation, accumulating corresponding values stored in the matrix C with the TR, transmitting the values of the TR into a vector element accumulator VEA until one row of the matrix C is calculated, carrying out accumulation operation on the VEA, multiplying the values by beta, storing the values into a DDR designated space, carrying out second row calculation until all the rows are calculated, and obtaining a vector with the length m stored in the DDR ; Step 3.4, dividing the cache into three parts when the data size exceeds the cache size, wherein the first part is used for storing vectors The rest space is divided into an upper half area and a lower half area, the content of the matrix C is transmitted into the upper half area according to the line through DMA transmission, after the data transmission is completed, the rest line of the matrix C is continuously transmitted to the lower half area, the upper half area starts to calculate, the calculation process is parallel to the calculation, after the calculation of the upper half area and the transmission of the lower half area are completed, the rest content of the matrix C is continuously transmitted to the upper half area, the calculation of the lower half area is started, and the steps are repeated until all the calculation is completed.

Description

Matrix vector multiplication implementation method after QR decomposition based on vector processor Technical Field The invention relates to high-performance digital signal Processor (DIGITAL SIGNAL Processor, DSP) chip code optimization, in particular to a vector Processor-based matrix multiplication implementation method after QR decomposition. Background The VSIP library is a vector, signal and image processing library which is pushed out by the GE intelligent platform, is an open standard application programming interface which is specially provided for developers needing to use enhanced signals and image processing, loses excellent hardware performance of a vector processor if original functions in the VSIP library are directly called, wants to exert the hardware advantages of a high-performance processor, improves the parallelism and efficiency of an algorithm, and adapts different high-performance processors after the algorithm in the VSIP library is optimized, so that the method is a good choice. In modern processors, the VLIW architecture with vector processing units is becoming a typical structure of high-performance digital signal processors, and domestic high-performance DSPs are rapidly developing, so that many high-performance processors produced by different companies are appeared, and the processors are usually characterized by rich register resources and a large number of execution units. Vector processors typically consist of multiple processing units that support vector-based data loading, computation, and storage. Each processing unit comprises a plurality of independent multifunctional units, generally comprising an addition unit, a multiplication unit, a shift unit, a comparison unit, a vector element accumulator, and a scalar vector conversion unit. Vector processors generally support single instruction multiple data stream SIMD operations, i.e., under the control of the same vector instruction, all processing units simultaneously perform the same operation on corresponding local registers, thereby realizing data level parallelism of development applications. In order to accelerate the data access speed, the inside or outside of the vector processor is usually provided with a Cache, so that the performance advantage of DSP vector processing hardware is effectively exerted, and an adaptive efficient matrix multiplication optimization algorithm is needed to be generated. Matrix multiplication is a key part frequently used in many scientific applications, and has the nature of huge and dense computation, so that the research of matrix and vector multiplication parallel algorithms in the field of high-performance computation is always a durable topic. Furthermore, matrix multiplication is often used to evaluate the performance and efficiency of new processor architectures, and to explore how to perform high performance optimizations on new architectures. The patent application of application publication No. CN 103294648A, application publication No. 2013, 9, 11 discloses a block matrix multiplication vectorization method supporting a vector processor with multiple MAC operation components. According to the method, each element of each row of the multiplied submatrix A is expanded into vector data and one row of data of the matrix B for accumulation multiplication, the scheme is different from the implementation method of the scheme, and the scheme does not realize matrix vector multiplication after QR decomposition. The invention patent application of application publication number CN 114090954A, application publication day 2022, 2 and 25 discloses an integer matrix multiplication kernel optimization method based on FT-2000+. According to the method, the design content of the subfunctions is distinguished according to the difference of the data length of the input matrix and the output matrix, a Feiteng2000+ system structure is combined, a register allocation strategy and a matrix partitioning strategy are determined, the optimization of integer matrix multiplication is realized, the scheme is different from the realization method of the scheme, and the matrix vector multiplication after QR decomposition is not realized in the scheme. The invention relates to a matrix multiplication implementation method after QR decomposition based on a vector processor, wherein the QR decomposition algorithm is vsip _ cqrd _f function in VSIP library functions, and after cqrd functions are called, an output QRD structure body of the function is taken out, wherein the structure body mainly comprises a complex matrix A, a complex vector cI and a real vector beta. The upper triangle part in the complex matrix A is an R matrix, the lower triangle part is a Q matrix, and the data are used as the input of the algorithm, so that matrix vector multiplication operation with the Q matrix in QR decomposition under four different conditions is realized. Disclosure of Invention The invention provides a matrix multip