CN-121997993-A - Inference acceleration method, device, medium and product for neural network model

CN121997993ACN 121997993 ACN121997993 ACN 121997993ACN-121997993-A

Abstract

The application provides an inference acceleration method, an inference acceleration device, a inference acceleration medium and an inference acceleration product for a neural network model. An inference acceleration method for a neural network model is characterized by comprising the steps of identifying an original weight matrix of a target network layer to be accelerated in the neural network model before deploying the neural network model, wherein the original weight matrix is n×m in size, decomposing the original weight matrix into a first low-rank matrix and a second low-rank matrix by using a low-rank decomposition algorithm, wherein the first low-rank matrix is n×k in size, the second low-rank matrix is k×m in size, and k is smaller than the minimum value in n and m, and performing matrix multiplication of the input data with the first low-rank matrix to obtain an intermediate result and performing matrix multiplication of the intermediate result with the second low-rank matrix to obtain a final output result in an inference stage of the neural network model.

Inventors

GAO WEITE

Assignees

墨芯人工智能科技(深圳)有限公司

Dates

Publication Date: 20260508
Application Date: 20260330

Claims (10)

1. An inference acceleration method for a neural network model, comprising: before the neural network model is deployed, an original weight matrix of a target network layer to be accelerated in the neural network model is identified, wherein the size of the original weight matrix is n multiplied by m, n and m are positive integers, and the n and m are respectively representing the number of rows and the number of columns of the original weight matrix; Decomposing the original weight matrix into a first low-rank matrix and a second low-rank matrix using a low-rank decomposition algorithm, wherein the first low-rank matrix has a size of nxk and the second low-rank matrix has a size of kxm, wherein k is a positive integer, n and k represent the number of rows and columns of the first low-rank matrix, k and m represent the number of rows and columns of the second low-rank matrix, respectively, and k is smaller than the minimum value of n and m, and In the reasoning phase of the neural network model, the following operations are performed for the input data: Performing a matrix multiplication of the input data with the first low rank matrix to obtain intermediate results, and And performing matrix multiplication of the intermediate result and the second low-rank matrix to obtain a final output result.
2. The method of claim 1, wherein the target network layer comprises a feed forward network full connection layer or an attention output projection layer in a transducer model.
3. The method of claim 1, wherein the low-rank decomposition algorithm comprises a singular value decomposition truncation algorithm, a structured pruning and reconstruction algorithm, or a training-based adapter trimming algorithm.
4. The method of claim 1, further comprising, prior to decomposing the raw weight matrix into the first low rank matrix and the second low rank matrix using the low rank decomposition algorithm: the value of k is determined based on the computational and memory characteristics of the target hardware platform on which the neural network model is deployed.
5. An inference acceleration apparatus for a neural network model, comprising: The decomposition preprocessing module is used for: Identifying an original weight matrix of a target network layer to be accelerated in the neural network model before deploying the neural network model, wherein the original weight matrix has a size of n multiplied by m, wherein n and m are positive integers respectively representing the number of rows and the number of columns of the original weight matrix, and Decomposing the original weight matrix into a first low-rank matrix and a second low-rank matrix using a low-rank decomposition algorithm, wherein the first low-rank matrix has a size of nxk and the second low-rank matrix has a size of kxm, wherein k is a positive integer, n and k represent the number of rows and columns of the first low-rank matrix, k and m represent the number of rows and columns of the second low-rank matrix, respectively, and k is smaller than the minimum value of n and m, and A matrix computation engine for performing, in an inference phase of the neural network model, the following operations on input data: Performing a matrix multiplication of the input data with the first low rank matrix to obtain intermediate results, and And performing matrix multiplication of the intermediate result and the second low-rank matrix to obtain a final output result.
6. The apparatus of claim 5, wherein the target network layer comprises a feed forward network full connection layer or an attention output projection layer in a transducer model.
7. The apparatus of claim 5, wherein the low rank decomposition algorithm comprises a singular value decomposition truncation algorithm, a structured pruning and reconstruction algorithm, or a training-based adapter trimming algorithm.
8. The apparatus of claim 5, wherein the decomposition preprocessing module is further configured to: the value of k is determined based on the computational and memory characteristics of the target hardware platform on which the neural network model is deployed.
9. A non-transitory computer-readable medium storing instructions which, when executed by one or more processors, cause the one or more processors to perform the method of any of claims 1-4.
10. A computer program product comprising instructions which, when executed by one or more processors, cause the one or more processors to perform the method of any of claims 1-4.

Description

Inference acceleration method, device, medium and product for neural network model Technical Field The present disclosure relates to inference acceleration methods, apparatus, non-transitory computer readable media, and computer program products for neural network models. Background With the rapid development of artificial intelligence technology, the scale of neural network models is continually expanding. The core computing operation of the neural network model, i.e. large-scale matrix multiplication (such as matrix multiplication in a full-connection layer, a projection layer in an attention mechanism, etc.), has become a computing bottleneck for limiting the model reasoning speed and the deployment cost, and consumes huge computational power and memory bandwidth. Particularly, in the model reasoning stage, how to reduce the calculation delay and improve the throughput is a technical problem which is continuously concerned by the industry. Disclosure of Invention In one aspect, the application discloses an inference acceleration method for a neural network model, which comprises the steps of identifying an original weight matrix of a target network layer to be accelerated in the neural network model before the neural network model is deployed, wherein the original weight matrix is n multiplied by m, n and m are positive integers and respectively represent the number of rows and the number of columns of the original weight matrix, decomposing the original weight matrix into a first low-rank matrix and a second low-rank matrix by using a low-rank decomposition algorithm, wherein the first low-rank matrix is n multiplied by k, the second low-rank matrix is k multiplied by m, k is a positive integer, n and k respectively represent the number of rows and the number of columns of the first low-rank matrix, k and m are respectively smaller than the minimum value in n and m, and performing the operations of performing intermediate multiplication of the input data and the first low-rank matrix on input data, and performing intermediate multiplication of the second low-rank matrix to obtain a final result. In one aspect, the application discloses an inference acceleration device for a neural network model, which comprises a decomposition preprocessing module, a matrix calculation engine and an inference result processing module, wherein the decomposition preprocessing module is used for identifying an original weight matrix of a target network layer to be accelerated in the neural network model before the neural network model is deployed, the original weight matrix is n multiplied by m, n and m are positive integers and respectively represent the number of rows and the number of columns of the original weight matrix, the low-rank decomposition algorithm is used for decomposing the original weight matrix into a first low-rank matrix and a second low-rank matrix, the size of the first low-rank matrix is n multiplied by k, the size of the second low-rank matrix is k multiplied by m, k is a positive integer, n and k respectively represent the number of rows and the number of columns of the first low-rank matrix, k and m are respectively smaller than the minimum value in n and m, the matrix calculation engine is used for executing the following operations on input data, the first low-rank matrix and the second low-rank matrix are executed, and the intermediate result is obtained. In other aspects, a non-transitory computer-readable medium storing instructions and a computer program product comprising instructions are disclosed. Which when executed by one or more processors, cause the processors to perform the methods described herein. Drawings Aspects of the disclosure are best understood from the following detailed description when read in conjunction with the accompanying drawing figures. It is noted that the various features are not drawn to scale in accordance with standard practice in the industry. In fact, the dimensions of the various features may be arbitrarily increased or reduced for clarity of discussion. Fig. 1 shows a schematic diagram of an inference acceleration method for a neural network model according to an embodiment of the present application. Fig. 2 shows a schematic diagram of an inference acceleration arrangement for a neural network model according to an embodiment of the application. FIG. 3 illustrates a schematic diagram of a computing device in which embodiments in accordance with the application may be implemented. Detailed Description The following disclosure provides many different embodiments, or examples, for implementing different features of the provided subject matter. Specific examples of components and arrangements are described below to simplify the present disclosure. Of course, these are merely examples and are not limiting. As described above, the core computing operation of the neural network model, i.e. large-scale matrix multiplication, has become a computing bottleneck that