CN-121301720-B - Prefix and computing device, method, chip, board card and electronic equipment

CN121301720BCN 121301720 BCN121301720 BCN 121301720BCN-121301720-B

Abstract

The invention discloses a prefix and calculating device, a method, a chip, a board card and electronic equipment, wherein the device comprises a data segmentation unit, a matrix operation unit, a vector operation unit and a data splicing unit, wherein the data segmentation unit is used for grouping vector data to be processed according to preset segmentation granularity, each vector obtained by grouping is spliced in sequence to form a matrix to be processed, the matrix operation unit is used for calculating the prefix sum of each row of data in the matrix to be processed in batches to obtain an output matrix, the vector operation unit is used for carrying out addition operation on each row of data in the output matrix and a corresponding scalar value to obtain a result vector corresponding to each row of data, the corresponding scalar value is equal to the accumulated value of last data of each row before the row, and the data splicing unit is used for splicing each result vector in sequence to obtain the prefix and the vector corresponding to the vector data to be processed. The invention can fully utilize matrix operation resources and vector operation resources in the chip, and promote the overall processing efficiency of prefix and calculation while ensuring the calculation accuracy.

Inventors

LIU CHAO
OUYANG PENG
YU YI
YANG JIANXUN
LI XIUDONG
WANG BO

Assignees

北京清微智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251211

Claims (12)

1. A prefix and computing device, comprising: The data segmentation unit is used for grouping the vector data to be processed according to a preset segmentation granularity, and splicing all vectors obtained by grouping according to the sequence to form a matrix to be processed; The matrix operation unit is used for performing matrix multiplication operation on the matrix to be processed and an upper triangular matrix to obtain an output matrix, wherein the diagonal line and the elements above of the upper triangular matrix are 1, and each row of data in the output matrix is the prefix sum of the corresponding row of data in the matrix to be processed; The vector operation unit is used for carrying out addition operation on each row of data in the output matrix and a corresponding scalar value to obtain a result vector corresponding to each row of data, wherein the corresponding scalar value is equal to the accumulated value of the last data of each row before the row; and the data splicing unit is used for splicing the result vectors in sequence to obtain prefixes and vectors corresponding to the vector data to be processed.
2. The prefix and computing device of claim 1, further comprising: A memory for storing the output matrix; A vector register for storing data of a row to be calculated in the output matrix during vector calculation; a scalar register, configured to store the scalar value corresponding to the row to be calculated during vector calculation; and the controller is used for controlling the vector operation unit to add the data in the vector register and the data in the scalar register during vector calculation to generate a corresponding result vector.
3. The prefix and computation device according to claim 2, wherein the controller is specifically configured to read the output matrix from the memory, sequentially load the current row of data into the vector register for each row of data except for the first row in the output matrix, control the vector operation unit to add each data in the vector register to the last data of the result vector corresponding to the last row of data stored in the scalar register to obtain the result vector corresponding to the current row of data, write the result vector corresponding to the current row of data into the memory, complete all vector computation if the current row of data is the last row in the output matrix, and otherwise update the last data in the result vector corresponding to the current row of data to the scalar register for use in the next row of operation.
4. A prefix and computation device according to claim 3, wherein the controller is further adapted to, for a first row of data in the output matrix, take that row of data as a result vector and update the last data of that row of data to the scalar register.
5. A prefix and computation device according to claim 3, wherein the controller is specifically configured to write a result vector corresponding to a current row of data into the memory, replacing the original data corresponding to the row in the output matrix.
6. The prefix and computation device according to any one of claims 3 to 5, wherein the data stitching unit is specifically configured to obtain, from the memory, a result vector corresponding to each line of data in the output matrix, and stitch each result vector in line order, so as to obtain the prefix and vector corresponding to the vector data to be processed.
7. The prefix and computation device according to claim 1, wherein the matrix operation unit employs a systolic array, a weight matrix loaded in the systolic array is the upper triangular matrix, and the systolic array is used for performing matrix multiplication operation on the matrix to be processed and the weight matrix, so as to generate the output matrix.
8. The prefix and computation device of claim 1, wherein said segmentation granularity is set according to a parallel processing scale of said matrix arithmetic unit.
9. A prefix and computation method, comprising: Grouping vector data to be processed according to preset segmentation granularity through a data segmentation unit, and splicing all vectors obtained by grouping according to sequence to form a matrix to be processed; Performing matrix multiplication operation on the matrix to be processed and an upper triangular matrix through a matrix operation unit to obtain an output matrix, wherein the diagonal line and the elements above of the upper triangular matrix are 1, and each row of data in the output matrix is the prefix sum of the corresponding row of data in the matrix to be processed; Performing addition operation on each row of data in the output matrix and a corresponding scalar value through a vector operation unit to obtain a result vector corresponding to each row of data, wherein the corresponding scalar value is equal to the accumulated value of the last data of each row before the row; And splicing the result vectors in sequence through a data splicing unit to obtain prefixes and vectors corresponding to the vector data to be processed.
10. A chip comprising the prefix of any one of claims 1 to 8 and a computing device.
11. A board comprising the chip of claim 10.
12. An electronic device comprising the chip of claim 10.

Description

Prefix and computing device, method, chip, board card and electronic equipment Technical Field The invention relates to the field of chips, in particular to a prefix and computing device, a prefix and computing method, a chip, a board card and electronic equipment. Background The Prefix Sum (Prefix Sum) is a technology for efficiently calculating the sequence accumulation Sum in computer science, and the core idea is that the accumulation Sum of any interval can be inquired in O (1) time by preprocessing an array, so that the method is suitable for inquiring a large-scale vocabulary. In large models (e.g., transformers), prefix sum is widely used in scenarios such as probability distribution computation, sampling strategies (e.g., top-p), efficiency optimization, etc., and plays a key role in particular in the decoding stage of text generation. In tasks such as natural language generation, dialogue system, code generation and the like, prefix and optimization path evaluation and support parallel computation through acceleration probability screening become core technologies for improving model generation efficiency and controllability. For an array: The prefix and the array S are defined as: An example of the calculation of the prefix sums as shown in fig. 2 is: S[1]= A[1] = 1; S[2]= A[1] + A[2] = 3; S[3]= A[1] + A[2] + A[3] = 6; ... S[8]= A[1] + A[2] + A[3] + ... +A[8] = 36。 In the prior art, vector computation (such as SIMD, CUDA Core) is generally used to implement prefix sum computation (Single-PASS PARALLEL Prefix Scan with Decoupled Look-back), in an Artificial Intelligence (AI) chip architecture, matrix operation and vector/scalar operation are generally implemented by adopting different hardware units, and matrix operation is generally implemented by using special hardware (such as Tensor Core) to efficiently complete matrix multiply-accumulate operation, so that data parallelism is high. Vector operations are typically done using SIMD or CUDA core, with low data parallelism efficiency. Furthermore, since the matrix operation is very high in the neural network model and the vector/scalar operation is relatively small in the ratio, the matrix operation and the vector/scalar calculation are very different in the AI chip, and the difference is typically several tens to thousands times. The prefix and calculation are realized by simply using vector operation, and the calculation efficiency is low. Therefore, how to fully utilize hardware resources (matrix operation resources and vector operation resources) on a chip, effectively accelerate the calculation process of the prefix sum, and improve the efficiency of the prefix sum calculation is a technical problem which needs to be solved urgently in the prior art. Disclosure of Invention The invention provides a prefix and computing device, a method, a chip, a board card and electronic equipment for solving at least one technical problem in the background art. In one aspect of the invention, there is provided a prefix and computing device, the device comprising: The data segmentation unit is used for grouping the vector data to be processed according to a preset segmentation granularity, and splicing all vectors obtained by grouping according to the sequence to form a matrix to be processed; the matrix operation unit is used for calculating the prefix sum of each row of data in the matrix to be processed in batches to obtain an output matrix, wherein each row of data in the output matrix is the prefix sum of the corresponding row of data in the matrix to be processed; The vector operation unit is used for carrying out addition operation on each row of data in the output matrix and a corresponding scalar value to obtain a result vector corresponding to each row of data, wherein the corresponding scalar value is equal to the accumulated value of the last data of each row before the row; and the data splicing unit is used for splicing the result vectors in sequence to obtain prefixes and vectors corresponding to the vector data to be processed. Optionally, the matrix operation unit is specifically configured to perform matrix multiplication operation on the matrix to be processed and an upper triangular matrix to obtain the output matrix, where a diagonal line of the upper triangular matrix and the elements above are all 1. Optionally, the prefix and the computing device further include: A memory for storing the output matrix; A vector register for storing data of a row to be calculated in the output matrix during vector calculation; a scalar register, configured to store the scalar value corresponding to the row to be calculated during vector calculation; and the controller is used for controlling the vector operation unit to add the data in the vector register and the data in the scalar register during vector calculation to generate a corresponding result vector. Optionally, the controller is specifically configured to read the output matrix from the memory