CN-122019944-A - Method and device for calculating general matrix vector multiplication

CN122019944ACN 122019944 ACN122019944 ACN 122019944ACN-122019944-A

Abstract

A method and apparatus for computing a general vector matrix multiplication are provided, the method including distributing an aligned portion of a weight matrix to each in-memory computing execution unit of an in-memory computing apparatus according to a first granularity, and controlling the in-memory computing apparatus to perform a multiplication of an input vector with the aligned portion, distributing a non-aligned portion of the weight matrix to each in-memory computing execution unit of the in-memory computing apparatus according to a second granularity, and controlling the in-memory computing apparatus to perform a multiplication of the input vector with the non-aligned portion, wherein a size of the first granularity is larger than a size of the second granularity.

Inventors

PEI YUAN
AN YUXIN
LU KEJIA
JIN FENGJUN
SUN GANG

Assignees

三星（中国）半导体有限公司
三星电子株式会社

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (10)

1. A method of computing a generic vector matrix multiplication, comprising: Distributing the aligned part of the weight matrix to each in-memory computing execution unit of the in-memory computing device according to the first granularity, and controlling the in-memory computing device to execute multiplication of the input vector and the aligned part, and Distributing the non-aligned part of the weight matrix to each in-memory computing execution unit of the in-memory computing device according to a second granularity distribution, controlling the in-memory computing device to execute multiplication computation of the input vector and the non-aligned part, Wherein the magnitude of the input vector is [1, K ], and the magnitude of the weight matrix is [ K, N ], wherein k=k M, wherein m is a positive integer, Wherein, with the row interval The corresponding portion being the non-aligned portion of the weight matrix, the remainder of the weight matrix being the aligned portion of the weight matrix, wherein k and n are values corresponding to the type of in-memory computing device, Wherein the size of the first particle size is greater than the size of the second particle size.
2. The method of claim 1, wherein the in-memory computing device is an HBM2-PIM device, wherein k = 256 and n = 4096.
3. The method of claim 2, wherein the first particle size is [128,8] and the second particle size is [128,1].
4. A method according to claim 3, wherein the step of distributing the non-aligned part of the weight matrix to each in-memory computing execution unit of the in-memory computing device according to a second granularity distribution comprises: The unaligned portion of the weight matrix is preferentially distributed over the banks of compute execution units in all memories with a size of [128,1] such that consecutive data blocks in each bank are guaranteed to be ordered and the banks of compute execution units in each memory are assigned at most weight data of size [128, ceil (j/512) ] where ceil indicates a round-up function.
5. The method of claim 1, further comprising: Generating a first in-memory computing instruction for the aligned portion and generating a second in-memory computing instruction for the non-aligned portion, Wherein the step of performing a multiplication of the input vector with the alignment portion comprises: Triggering a first in-memory computing instruction to control the in-memory computing device to perform a multiplication of the input vector with the aligned portion, The step of performing a multiplication of the input vector with the non-aligned portion comprises: A second in-memory computing instruction is triggered to control the in-memory computing device to perform a multiplication of the input vector with the non-aligned portion.
6. A method as claimed in claim 3, wherein each in-memory computational execution unit comprises 8 first general vector registers for storing input vectors and 8 second general vector registers for storing multiply-accumulate results, wherein for the aligned portions each in-memory computational execution unit uses 8 second general vector registers and for the non-aligned portions each in-memory computational execution unit uses at most ceil (j/512) second general vector registers, wherein ceil indicates a round-up function.
7. The method of claim 5, the step of performing a multiplication of the input vector with the non-aligned portion comprising: and writing the multiply-accumulate result cached in the second general vector register to the bank, Wherein the method further comprises: And performing reduction summation on the column units in the bank according to the number of second general vector registers used by each in-memory computing execution unit to obtain a multiplication computing result of the input vector and the non-aligned part.
8. A computing device for universal vector matrix multiplication, comprising: a first distribution unit configured to distribute the aligned portion of the weight matrix to each in-memory computing execution unit of the in-memory computing device at a first granularity; a first control unit configured to control the in-memory computing device to perform multiplication computation of an input vector with the alignment portion; A second distribution unit configured to distribute the non-aligned part of the weight matrix to each PIM execution unit of the in-memory computing device according to a second granularity, and A second control unit configured to control the in-memory computing device to perform multiplication of the input vector with the non-aligned portion, Wherein the magnitude of the input vector is [1, K ], and the magnitude of the weight matrix is [ K, N ], wherein k=k M, wherein m is a positive integer, Wherein, with the row interval The corresponding portion being the non-aligned portion of the weight matrix, the remainder of the weight matrix being the aligned portion of the weight matrix, wherein k and n are values corresponding to the type of in-memory computing device, Wherein the size of the first particle size is greater than the size of the second particle size.
9. The device of claim 8, wherein the in-memory computing device is an HBM2-PIM device, wherein k = 256 and n = 4096.
10. A computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor, causes the processor to implement the method of any one of claims 1-7.

Description

Method and device for calculating general matrix vector multiplication Technical Field The application relates to the technical field of General Matrix vector multiplication (General Matrix-Vector Multiplication, GEMV) calculation, in particular to a GEMV calculation method and a GEMV calculation device. Background With the rapid development of deep learning techniques (e.g., generated artificial intelligence (GENERATIVE ARTIFICIAL INTELLIGENCE, GENAI)), applications based on modern deep neural networks (Deep Neural Network, DNN) place higher demands on off-chip memory bandwidth, such applications exhibit memory-intensive features. Processing-In-Memory (PIM) technology maintains computing operations on a Memory module by embedding a programmable computing unit (Programmable Computing Unit, PCU) on the Memory module, thereby avoiding data transfer with a host (e.g., CPU/GPU), reducing the need for off-chip Memory bandwidth and thus reducing power consumption. The bandwidth on a dynamic random access memory (Dynamic Random Access Memory, DRAM) chip can now be increased, for example, by bank-level parallelism of the PIM device, and calculations can be performed using a programmable computing unit (Programmable Computing Unit, PCU) built into the PIM device to reduce data movement, thereby increasing energy efficiency. However, in the prior art, when performing GEMV calculations with the PIM device, if the input vector and the weight matrix do not meet the requirements of the PIM device, it is necessary to first align the input vector and the weight matrix by the zero padding operation, respectively, and then perform GEMV based on the aligned input vector and weight matrix. This causes part of the PIM execution units to perform matrix multiplication corresponding to 0, resulting in low hardware resource utilization. Therefore, how to improve the hardware utilization of in-memory computing devices and reduce the computation delay for non-aligned input vectors and/or weight matrices is a need for the present invention. Disclosure of Invention The present invention is directed to a method for computing a general matrix vector multiplication and an apparatus for performing the same, which solve at least the above-mentioned problems of the related art, but do not solve any of the above-mentioned problems. According to an aspect of an embodiment of the present invention, there is provided a calculation method of a general matrix multiplication (GEMV) including distributing an aligned portion of a weight matrix to each in-memory calculation execution unit of an in-memory calculation device at a first granularity and controlling the in-memory calculation device to perform multiplication of an input vector with the aligned portion, distributing a non-aligned portion of the weight matrix to each in-memory calculation execution unit of the in-memory calculation device at a second granularity and controlling the in-memory calculation device to perform multiplication of the input vector with the non-aligned portion, wherein a size of the input vector is [1, K ], a size of the weight matrix is [ K, N ], wherein k=kM, wherein m is a positive integer,Wherein, with the row intervalThe corresponding portion is the non-aligned portion of the weight matrix, the remaining portion of the weight matrix is the aligned portion of the weight matrix, wherein k and n are values corresponding to the type of in-memory computing device, wherein the size of the first granularity is greater than the size of the second granularity. According to embodiments of the present disclosure, the granularity of alignment is reduced when processing the computation of non-alignment GEMV, such that less data is completed, and thus invalid data is reduced for movement from the host to the in-memory computing device when performing non-alignment GEMV. In addition, since the 0-supplement data is reduced, the invalid calculation amount executed by the PIM execution unit can be reduced, so that the hardware utilization rate is improved, and the calculation delay is further reduced. Optionally, the in-memory computing device is an HBM2-PIM device, wherein k=256, n=4096. Optionally, the first particle size is of size [128,8], and the second particle size is of size [128,1]. Optionally, distributing the non-aligned portions of the weight matrix to each in-memory computing execution unit of the in-memory computing device at a second granularity comprises preferentially distributing the non-aligned portions of the weight matrix over banks of all in-memory computing execution units at a size of [128,1] such that consecutive data blocks within each bank are guaranteed to be ordered and banks of each in-memory computing execution unit are assigned weight data at most of a size of [128, ceil (j/512) ], where ceil indicates a round-up function. According to embodiments of the present disclosure, the computing resources of each PIM execution unit may be fully utilized. Opt