CN-122018993-A - Computing unit, finger request processing method, electronic device and medium

CN122018993ACN 122018993 ACN122018993 ACN 122018993ACN-122018993-A

Abstract

The invention provides a computing unit, an instruction fetch request processing method, electronic equipment and a medium, wherein the computing unit comprises an instruction cache unit, a first instruction processing unit, a second instruction processing unit and an instruction fetch request processing unit, wherein the first instruction processing unit is used for generating a first instruction fetch request when a first instruction fetch condition is met, the first instruction fetch request points to a thread bundle level instruction to be acquired, the second instruction processing unit is used for executing a workgroup level instruction, one workgroup level instruction comprises a plurality of thread bundle level instructions, the instruction fetch request processing unit is used for generating a second instruction fetch request when a second instruction fetch condition is met, the second instruction fetch request points to the workgroup level instruction to be acquired, the first instruction fetch request and the second instruction fetch request are subjected to arbitration processing according to a preset strategy, one instruction fetch request is selected from the first instruction fetch request and the second instruction fetch request, and the corresponding instruction data are conveniently sent to the first instruction processing unit or the second instruction processing unit according to the source of the instruction fetch request. The present disclosure enables parallel processing of instruction fetch requests for workgroup level and thread bundle level instructions.

Inventors

ZHU HENGYE
JIANG YUXIANG
XU PINGAN
Hao Shaopu

Assignees

苏州亿铸智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (15)

1. A computing unit, comprising: The instruction cache unit is used for caching instructions; the first instruction processing unit is used for generating a first instruction fetching request when a first instruction fetching condition is met, wherein the first instruction fetching request points to a thread bundle level instruction to be acquired; a second instruction processing unit for executing workgroup level instructions, wherein one workgroup level instruction comprises a plurality of thread bundle level instructions; The instruction fetching request processing unit is used for generating a second instruction fetching request when a second instruction fetching condition is met, the second instruction fetching request points to a working group level instruction to be acquired, arbitration processing is carried out on the first instruction fetching request and the second instruction fetching request according to a preset strategy, one instruction fetching request is selected to be sent to the instruction cache unit, and corresponding instruction data is sent to the first instruction processing unit or the second instruction processing unit according to the source of the instruction fetching request.
2. The computing unit of claim 1, wherein the first instruction processing unit comprises a plurality of execution units, the execution units comprising a first instruction buffer for buffering thread bundle level instructions fetched from the instruction buffer unit, The first instruction fetching condition includes that the instruction data in the first instruction buffer is lower than a preset capacity threshold value, or that the first instruction buffer is in an empty state and needs to prefetch subsequent instructions to maintain continuous execution of a pipeline.
3. The computing unit of claim 2, wherein the first instruction processing unit further comprises a first scheduler, The first scheduler is configured to monitor a filling state of the first instruction buffers of the plurality of execution units, and generate a first instruction fetch request when the first instruction fetch condition is satisfied, so as to trigger a new thread bundle level instruction to be fetched from the instruction buffer unit.
4. The computing unit of claim 1, wherein the instruction fetch request processing unit includes a second instruction buffer to buffer workgroup level instructions fetched from the instruction buffer unit, The second instruction fetching condition includes that the instruction data in the second instruction buffer is lower than a preset capacity threshold, or that the second instruction buffer is in an empty state and needs to prefetch subsequent instructions to maintain continuous execution of the pipeline.
5. The computing unit of claim 4, wherein the finger request processing unit further comprises a second scheduler, The second scheduler is configured to monitor a fill state of the second instruction buffer and generate a second instruction fetch request to trigger fetching of a new workgroup level instruction from the instruction cache unit when the second instruction fetch condition is satisfied.
6. The computing unit of claim 3, wherein the finger request processing unit comprises a request merge module and an arbitration module, The request merging module is used for receiving one or more first instruction fetch requests from the first scheduler, merging the first instruction fetch requests of the same or continuous instruction data addresses into a first aggregate instruction fetch request, and providing the first aggregate instruction fetch request to the arbitration module; The arbitration module is used for performing arbitration processing on the first aggregation instruction fetch request and the second instruction fetch request according to a preset strategy, and selecting one instruction fetch request from the first aggregation instruction fetch request and the second instruction fetch request to send to the instruction cache unit.
7. The computing unit of claim 6, wherein the preset policy comprises at least one of a fixed priority policy, a polling policy, a dynamic priority policy based on request latency, or a dynamic priority policy based on the first instruction buffer and the second instruction buffer tension.
8. The computing unit of claim 5 or 6, wherein the finger request processing unit further comprises: The distribution module is used for receiving the instruction data packet returned from the instruction cache unit, wherein the instruction data packet comprises an instruction fetching request source identifier, and distributing the instruction data to the first instruction processing unit or the second instruction processing unit according to the instruction fetching request source identifier.
9. The computing unit of claim 8, wherein the finger request processing unit further comprises: The instruction fetch request sending module is used for generating an instruction fetch request data packet according to an arbitration processing result and sending the instruction fetch request data packet to the instruction cache unit, wherein the instruction fetch request data packet comprises an instruction data address corresponding to the selected instruction fetch request, an instruction fetch request source identifier and an instruction fetch request sending sequence identifier; the distribution module is also used for sequentially managing the returned instruction data according to the instruction fetching request sending sequence identifier.
10. The computing unit of claim 9, wherein the second instruction processing unit comprises a plurality of heterogeneous computing units including at least two of a general purpose computing unit, a tensor processing unit, and a data copying unit, the workgroup level instructions comprising general purpose computing instructions, tensor operation instructions, and data copying instructions; the distribution module distributes the work group level instructions to corresponding heterogeneous computing units for execution according to the types of the work group level instructions.
11. The computing unit of claim 10, wherein the workgroup level instruction further comprises a trigger instruction to trigger the first instruction processing unit to begin operation; and the distribution module responds to the trigger instruction and sends a starting signal to the first instruction processing unit.
12. A finger request processing method applied to the computing unit according to any one of claims 1 to 11, the finger request processing method comprising: Generating a first instruction fetching request when a first instruction fetching condition is met through a first instruction processing unit, wherein the first instruction fetching request points to a thread bundle level instruction to be acquired; Executing, by the second instruction processing unit, a workgroup level instruction, wherein one workgroup level instruction includes a plurality of thread bundle level instructions; and generating a second instruction fetch request when a second instruction fetch condition is met through the instruction fetch request processing unit, wherein the second instruction fetch request points to a working group level instruction to be acquired, carrying out arbitration processing on the first instruction fetch request and the second instruction fetch request according to a preset strategy, and selecting one instruction fetch request from the first instruction fetch request and the second instruction fetch request to send to the instruction cache unit so as to send corresponding instruction data to the first instruction processing unit or the second instruction processing unit according to the source of the instruction fetch request.
13. The finger request processing method according to claim 12, wherein the finger request processing method further comprises: merging, by the fetch request processing unit, the first fetch requests of the same or consecutive instruction data addresses into a first aggregate fetch request; and carrying out arbitration processing on the first aggregation instruction fetch request and the second instruction fetch request according to a preset strategy, and selecting one instruction fetch request from the first aggregation instruction fetch request and the second instruction fetch request to send to the instruction cache unit.
14. An electronic device comprising a memory, a processor, a program stored on the memory and executable on the processor, and a data bus for enabling a connection communication between the processor and the memory, the program when executed by the processor implementing the finger request processing method of claim 12 or 13.
15. Computer can the storage medium is read and the data is read, the computer-readable storage medium stores one or more programs, the one or more programs are executable by one or more processors to implement the finger request processing method of claim 12 or 13.

Description

Computing unit, finger request processing method, electronic device and medium Technical Field The disclosure relates to the field of computer technology, and in particular, to a computing unit, an instruction fetch request processing method, an electronic device, and a medium. Background In recent years, with the continuous increase of the scale of deep learning models, particularly the wide application of large language models and generated AI, general-purpose graphics processors (GPGPU) are facing the evolution from "general-purpose parallel" to "heterogeneous fusion" and "mixed granularity computing" as a core computing force carrier. The traditional GPGPU is based on a single instruction multi-thread (SIMT) architecture, takes a thread bundle (Warp) as a basic scheduling unit, and has good performance in processing fine-granularity tasks, but has the problems of low instruction density, high control overhead, insufficient execution efficiency and the like when dealing with macro tasks such as large-scale Matrix Multiplication and Addition (MMA), tensor transformation, global data handling and the like. Although modern AI accelerators have integrated Tensor cores (Tensor cores), matrix cores (Matrix cores), etc. with specialized execution units that execute at the granularity of the workgroup (Workgroup, WG), and Tensor memory accelerators (Tensor Memory Accelerator, TMA) etc. with hardware modules that support bulk asynchronous data handling, existing instruction architectures are still limited to Warp-level instructions, which must be broken down by compilers from high-level computational intent into a large number of low-level SIMT instructions, resulting in large instruction bandwidth pressures, control logic redundancy, reduced pipeline utilization, and difficulty in fully exploiting the performance potential of specialized hardware. Under the background, it is needed to introduce a working group level instruction (WG-level instruction) on the basis of retaining the original SIMT execution capability, directly describe operations such as matrix operation, data copying and synchronous configuration by high-order semantics, remarkably reduce instruction emission times and improve hardware utilization and program locality. However, the working group level (WG-level) instruction and the conventional thread bundle level (Warp-level) instruction have substantial differences in execution cycle, resource occupation, access frequency and delay characteristics, and if a unified instruction fetch path and a shared buffer structure are adopted, instruction supply bottlenecks are easily caused, so that overall performance is affected. Therefore, a new architecture of a computing unit is needed that can achieve both instruction granularity and efficient parallel instruction fetching and distribution. Disclosure of Invention The embodiment of the disclosure provides a computing unit, a processing method of instruction fetching requests, electronic equipment and a medium, aiming at being capable of parallelly and efficiently processing instruction fetching requests of a working group level instruction and a thread bundle level instruction by constructing independent instruction acquisition paths and a differential scheduling mechanism, avoiding resource competition and supply delay caused by instruction granularity difference, thereby improving overall instruction supply efficiency and hardware utilization rate. According to an aspect of the present disclosure, there is provided a computing unit, including: The instruction cache unit is used for caching instructions; the first instruction processing unit is used for generating a first instruction fetching request when a first instruction fetching condition is met, wherein the first instruction fetching request points to a thread bundle level instruction to be acquired; a second instruction processing unit for executing workgroup level instructions, wherein one workgroup level instruction comprises a plurality of thread bundle level instructions; The instruction fetching request processing unit is used for generating a second instruction fetching request when a second instruction fetching condition is met, the second instruction fetching request points to a working group level instruction to be acquired, arbitration processing is carried out on the first instruction fetching request and the second instruction fetching request according to a preset strategy, one instruction fetching request is selected to be sent to the instruction cache unit, and corresponding instruction data is sent to the first instruction processing unit or the second instruction processing unit according to the source of the instruction fetching request. Optionally, the first instruction processing unit comprises a plurality of execution units, the execution units comprising a first instruction buffer for buffering thread bundle level instructions fetched from the instruction buffer unit, The