CN-115828042-B - Data processing method and device for two-dimensional pulse array

CN115828042BCN 115828042 BCN115828042 BCN 115828042BCN-115828042-B

Abstract

The invention provides a data processing method and device of a two-dimensional pulse array, wherein the data processing method and device comprises a data buffering subsystem and a pulse array, and the pulse array can be divided into an upper part and a lower part or a left part and a right part for operation processing according to the row and column conditions of an input matrix, and the loading priority order is controlled when data are loaded. In the operation unit designed based on the two-dimensional pulse array, the invention realizes the non-blocking loading of the data matrix (in the array) with smaller hardware resource expense, realizes the hardware resource allocation under different scenes through region division, obviously improves the operation execution efficiency and reduces the control logic design complexity.

Inventors

LI DONGSHENG

Assignees

北京数渡信息科技有限公司

Dates

Publication Date: 20260505
Application Date: 20221116

Claims (7)

1. A data processing method of a two-dimensional systolic array is realized by a data processing device of the two-dimensional systolic array, wherein the data processing device of the two-dimensional systolic array comprises a data buffer subsystem and a systolic array, and the systolic array comprises n rows The data buffer subsystem inputs the data of the Weight matrix to the systolic array through the Weight triangularization FIFO module, and the data buffer subsystem inputs the data of the Activation matrix to the systolic array through the input triangularization FIFO module, and the data processing method of the two-dimensional systolic array comprises the following steps: S1, performing matrix operation, when the number of rows of the action matrix is not more than 1/2 of the number n of rows of the systolic array, turning to step S2, and when the number of rows of the Weight matrix is not more than 1/2 of the number m of rows of the systolic array, turning to step S3; Wherein, the matrix operation comprises the following steps: S11, a first column of the Weight matrix is transposed into a first row through a Weight triangularization FIFO module, then WEIGHT PATH REGISTER of the first row cells of the pulsation array is entered, and after WEIGHT PATH REGISTER of the first row cells of the pulsation array is filled with the Weight, the first column of the Activation matrix enters Activation Register of the first column cells of the pulsation array through an input triangularization FIFO module; s12, the first column Weight of the Weight matrix enters WEIGHT REGISTER from WEIGHT PATH REGISTER, and new Weight enters WEIGHT PATH REGISTER which is vacated; S13, carrying out an action multiply-accumulate operation in Weight and Activation Register in WEIGHT REGISTER, and keeping up with the subsequent multiply-accumulate operation in an execution pipeline of the data buffer subsystem when the first row Cell of the systolic array is executed; S14, when multiply-accumulate operation of the previous set of Weight matrix and activity matrix is about to end, updating the corresponding WEIGHT REGISTER of the completion operation cell and pulsing the new activity matrix; S2, judging whether the number of rows of the two groups of Activation matrixes is not more than 1/2 of the number n of rows of the pulse array when the two groups of Activation matrixes are operated with the same Weight matrix, if so, dividing the pulse array into an upper part and a lower part for operation respectively, otherwise, turning to the step S1 to continuously execute matrix operation; and S3, judging whether the number of rows of the two Weight matrixes is not more than 1/2 of the number m of the pulse array when the two Weight matrixes are operated with the same action matrix, if so, dividing the pulse array into a left part and a right part for operation, otherwise, turning to the step S1 to continuously execute matrix operation.
2. The method of claim 1, wherein in step S13, paths of loading the Weight matrix and the activity matrix from the data buffer subsystem are independent of each other, and pipeline stalling/backpressure for processing read conflicts or systolic waits is limited inside the data buffer subsystem.
3. The method of claim 2, wherein the data buffering subsystem is also designed as a pipeline structure, each stage can be back-pressure, and when data comes out of the data buffering subsystem and enters the corresponding triangulated pulse FIFO module, the subsequent path of the data buffering subsystem should not back-pressure or halt the pipeline of the data.
4. The method of claim 1, wherein in step S14, when the same set of matrix operations are performed, the load Weight and the load action collide with each other when the read data is buffered, and the load WEIGHT PATH REGISTER is preferentially executed.
5. The method of claim 1, wherein in step S14, the previous group of loads Activation Register and the next group of loads WEIGHT PATH REGISTER collide with each other when the read data is buffered, and the previous group of loads Activation Register are preferentially executed.
6. The method of claim 1, wherein the output of the product-sum result accumulation of cells of n/2 rows in the systolic array is connected to an accumulator by a lead.
7. The method of claim 1, wherein a data selector is disposed at a rear end of cells of m/2 columns in the systolic array, a front end of the data selector is connected to cells of a previous column, a rear end of the data selector is connected to cells of a subsequent column, and is used for transmitting an action from left to right, and the data selector is further connected to an action data entry input to a triangularization FIFO module, and is used for receiving an action of a data buffer subsystem.

Description

Data processing method and device for two-dimensional pulse array Technical Field The present invention relates to processing of a two-dimensional systolic array, and in particular, to a method and an apparatus for processing data of a two-dimensional systolic array. Background Systolic arrays (Systolic Array), an array structure. Pulsation means that its way and process of operation is as if it were the way and process of the human blood circulation system. In such an array configuration, data "flows" rhythmically in a predetermined "pipelined" fashion between the processing elements of the array. During the data flow, all processing units process the data flowing to it simultaneously in parallel, so that it can achieve a high parallel processing speed. Meanwhile, the predetermined data flow mode enables data to complete all corresponding processing from flowing into the processing unit array to flowing out of the processing unit array, the data does not need to be input again, and only the 'boundary' processing units of the array communicate with the outside (namely the first row and the first column of the array), so that the processing speed of the array machine is improved under the condition that the input and output speed of the array machine is not increased. Because the array and the processing unit have simple structures and consistent rules, the high modularization degree can be achieved, and the method is very suitable for the design and the manufacture of very large scale integrated circuits. The concept of systolic arrays was proposed by h.t. Kung in 1982 as early as 44 th International Symposium on Computer Architecture (ISCA) on month 6, 26 of 2017, google proposed tensor processor Tensor Processing Unit (TPU) for data center server-side neural network reasoning acceleration, which is nearly 15-30 times faster than the server-side CPU and GPU speeds. FIG. 1a is a model of a conventional computing system. A processing Element (PE in the figure) reads data from Memory, processes it, and then writes it back to Memory. The biggest problem with this system is that the speed of data access tends to be much slower than the speed of data processing. Thus, the processing power of the overall system (MOPS, operations completed per second) is largely limited by the memory access capability. This problem has also been one of the important topics of computer architecture research for many years, which can be said to be a great impetus for driving processor and memory designs. While the systolic architecture uses a very simple approach to letting the data flow as much as possible for a few cycles in the processing unit. As depicted in fig. 1b, a first data first enters a first PE, is processed and then passed on to a next PE, while a second data enters the first PE. Similarly, when the first data arrives at the last PE, it has been processed multiple times. Therefore, the systolic architecture actually reuses the input data multiple times. Therefore, it can achieve higher operation throughput with less memory bandwidth consumption. Of course, the systolic architecture has other benefits such as a modular design that is easily scalable, simple and regular data and control flows, the use of simple and uniform cells, the avoidance of global broadcast and fan-in, and fast response times, etc. Summarizing, the pulsation architecture has several features: 1) Is made up of a plurality of isomorphic PEs, which may be one-dimensional or two-dimensional, serial, array or tree-like structures (more in the form of arrays we now see). 2) The PE function is relatively simple, and the system improves the operation efficiency by realizing a large number of PEs in parallel. 3) The PEs can only send data to neighboring PEs (in some two-dimensional structures, there may also be diagonally oriented data channels). The data flows "downstream" in a pipelined fashion until the last PE is flushed. In conclusion, the pulsation structure is a very special design, and has simple structure and low realization cost. But it is less flexible and only suited for specific operations. Further, the most frequent operation in the neural network operation is a convolution operation, essentially a matrix multiplication, fig. 2 is a mathematical model of the convolution, fig. 3 is a method (WEIGHTS STAY) given by h.t. Kung, in which in fig. 3 the X value is broadcast to the individual operation units, the W value is pre-stored in the PE and kept still, and part of the result Y is passed in a pulsating manner to the right (initial value is zero) between the PE arrays. It can be seen that, after three times, the output of the rightmost PE is the first result of the convolution operation of the two sequences X and W, after which the Y value is continuously output. A typical example of a two-dimensional systolic array application is Google's TPU matrix arithmetic unit, which is a typical systolic array from the perspective of the published materia