Search

KR-20260062644-A - A Systolic Array-Based Processing Device and Method for Sharing Compute Units Within the Systolic Array According to the Purpose of Operations

KR20260062644AKR 20260062644 AKR20260062644 AKR 20260062644AKR-20260062644-A

Abstract

The present invention relates to a computation device based on a systolic array. In a method for computation through the sharing of computation units within a systolic array according to the purpose of computation performed in a computer device according to the present invention, the method comprises the steps of: identifying operations to be performed sequentially; determining the number of computation units required for each operation by considering the two-dimensional size of the systolic array; creating a computation group according to the determined number of computation units; and performing operations according to the created computation group. The present invention has the effect of significantly improving computation speed by optimizing the distance and placement between computation units through the optimization of data flow within the systolic array. Furthermore, the present invention has the effect of improving hardware utilization by minimizing the idle time of MAC computation units within the systolic array.

Inventors

  • 양성원
  • 이상원
  • 최영근

Assignees

  • 수퍼게이트 주식회사

Dates

Publication Date
20260507
Application Date
20241029

Claims (14)

  1. In a method of computation through the sharing of computation units within a systolic array according to the purpose of computation, A step of identifying operations performed sequentially; A step of determining the number of operation units required per operation by considering the two-dimensional size of the above-mentioned systolic array; A step of generating an operation group according to the number of operation units determined above; and The step of performing an operation according to the generated operation group above Operation method.
  2. In Article 1, The above operation group is, including a first operation group of maximum size for matrix multiplication Operation method.
  3. In Article 2, The above operation group is, including a second operation group of a first size for vector multiplication Operation method.
  4. In Paragraph 3, The above operation group is, including a third operation group of a second size for operation of an activation function, Operation method.
  5. In Article 4, The above second and third operation groups have physical locations on a systolic array determined according to the operation pipeline of a neural network model, Operation method.
  6. In Article 1, The step of performing the above operation is, A step of receiving an execution signal for an operation from an external unit; A step of performing an operation according to an operation group upon the application of an execution signal; and It includes the step of receiving a termination signal of the operation from the external unit, The above operation group is switched to an idle state, Operation method.
  7. In Article 6, The step of creating the above operation group is, Regenerating an operation group including an operation unit within the operation group that has been switched to the above idle state, Operation method.
  8. processor, and It includes a memory that communicates with the above processor, and The above memory stores instructions that cause the processor to perform operations, and The above operations are, An operation that identifies operations performed sequentially, An operation that determines the number of operation units required per operation by considering the 2D size of a systolic array, The operation of creating an operation group according to the number of operation units determined above, and Includes an operation to perform an operation according to the generated operation group above. Computing device.
  9. In Article 8, The above operation group is, including a first operation group of maximum size for matrix multiplication Computing device.
  10. In Article 9, The above operation group is, including a second operation group of a first size for vector multiplication Computing device.
  11. In Article 10, The above operation group is, including a third operation group of a second size for operation of an activation function, Computing device.
  12. In Article 11, The above second and third operation groups have physical locations on a systolic array determined according to the operation pipeline of a neural network model, Computing device.
  13. In Article 8, The operation of performing the above operation is, Operation of receiving an execution signal for an operation from an external unit, An operation of performing an operation according to an operation group upon the application of an execution signal, and It includes an operation of receiving a termination signal of the operation from the above external unit, and The above operation group is switched to an idle state, Computing device.
  14. In Article 13, The operation of creating the above operation group is, Regenerating an operation group including operation units within the operation group that have been switched to the above idle state, Computing device.

Description

A Systolic Array-Based Processing Device and Method for Sharing Compute Units Within the Systolic Array According to the Purpose of Operations The present invention relates to a systolic array-based computing device. Systolic arrays are large-scale parallel computing hardware structures primarily used in artificial intelligence and machine learning, utilizing Multiply-Accumulate (MAC) devices to rapidly process complex operations such as matrix multiplication. The reason systolic arrays are used in neural networks is that they can process large-scale matrix operations in parallel quickly and efficiently. Neural networks repeatedly perform multiple matrix multiplication operations, and these operations constitute a core part, especially in deep learning models. Processing these large-scale operations with conventional processors takes a long time or consumes a lot of resources. A systolic array is a structure that performs operations by moving data flow along regularly arranged processing units, which can significantly increase computation speed. In addition, it can optimize General Matrix Multiplication (GEMM) operations frequently used in neural networks, making it suitable for improving the processing speed and energy efficiency of neural network models. Prior art patent (KR 2024-0112088 A (July 18, 2024)) proposes a method and apparatus for improving memory usability and efficiency in the processing of a Squeeze-and-Excitation (SE) block using a Systolic Array. This invention provides a method to reduce memory waste occurring during the processing of a feature map with a width and/or height of 1 during SE block operations, and to maximize the efficient use of memory. However, existing systolic arrays are primarily used only for the purpose of General Matrix Multiplication (GEMM). However, while matrix multiplication is not being performed with the systolic array, resources remain idle, which can result in low overall hardware resource utilization. FIG. 1 is an exemplary diagram showing a computing device according to one embodiment of the present invention. FIG. 2 is a flowchart illustrating a calculation method according to one embodiment of the present invention. Figure 3 is a figure illustrating the operation of a CNN structure model. Figures 4 and 5 illustrate the calculations of a transformer model. Figures 6 and 7 show examples of creating operation groups according to the number of operation units. FIG. 8 is an exemplary diagram showing the Adder Tree structure of a computational device according to one embodiment of the present invention. FIG. 9 is a diagram showing the detailed operation of a computational device according to one embodiment of the present invention. FIG. 10 is an exemplary diagram showing an implementation of a computing device including a computational device according to one embodiment of the present invention. The following description merely illustrates the principles of the invention. Therefore, those skilled in the art may invent various devices that embody the principles of the invention and are included within the concept and scope of the invention, even if they are not explicitly described or illustrated in this specification. Furthermore, all conditional terms and embodiments listed in this specification are, in principle, explicitly intended only for the purpose of enabling an understanding of the concept of the invention and should be understood as not being limited to the embodiments and conditions specifically listed elsewhere. The aforementioned objectives, features, and advantages will become clearer through the following detailed description in conjunction with the attached drawings, and accordingly, a person skilled in the art to which the invention pertains will be able to easily implement the technical concept of the invention. In addition, in describing the invention, if it is determined that a detailed description of known technology related to the invention may unnecessarily obscure the essence of the invention, such detailed description will be omitted. Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the attached drawings. FIG. 1 is an exemplary diagram showing a computing device (300) according to one embodiment of the present invention. FIG. 1 shows the structure of a systolic array type computing device (300) composed of 9x9 MAC (Multiply-Accumulate) units (300u) as an example. A systolic array is a two-dimensional array designed to efficiently process matrix multiplication, and each MAC unit (300u) receives an activation input and a weight input, performs a multiplication operation, and calculates the final sum value by accumulating the results. At this time, each MAC unit (300u) shifts the input data, and the data is transferred to adjacent MAC units (300u) within the array during each consecutive clock cycle. Each MAC unit (300u) is a core computing unit (300) for efficiently processing matri