EP-4320514-B1 - COOPERATIVE INSTRUCTION PREFETCH ON MULTICORE SYSTEM

EP4320514B1EP 4320514 B1EP4320514 B1EP 4320514B1EP-4320514-B1

Inventors

Nagarajan, Rahul
LEARY, CHRISTOPHER
VIJAYARAJ, Thejasvi, Magudilu
NORRIE, Thomas, James

Dates

Publication Date: 20260506
Application Date: 20221031

Claims (13)

A hardware circuit (101), comprising: a plurality of tiles (102), each tile configured to operate in parallel with other tiles in the plurality of tiles, each tile of the plurality of tiles comprising a processing core (301, 302), which comprises a tile access core (310, 330) and a tile execute core (320, 340), wherein each of the tile access core and the tile execute core of the processing core of the respective tile comprises: a prefetch unit (311, 321, 331, 341); and an instruction buffer (312, 322, 332, 342), wherein each prefetch unit of a respective core is arranged to make requests for instructions to a plurality of task instruction memories (351, 352); an instruction request bus (392) arranged to aggregate the requests for instructions from different prefetch units before requesting instructions responsive to the requests for instructions; the plurality of task instruction memories (351, 352), each task instruction memory of the plurality of task instruction memories being arranged in a sequence and coupled to one or more tiles from the plurality of tiles via an instruction router (360), wherein the instruction router is arranged to filter the requests for instructions from the prefetch units to de-duplicate requests for identical instructions and to provide the de-duplicated requests to the plurality of task instruction memories, which are arranged to store instructions responsive to the de-duplicated requests; and an instruction broadcast bus (391) arranged to broadcast the instructions from the plurality of task instruction memories to the plurality of tiles via the instruction router, which is arranged to deserialize the instructions and to provide the deserialized instructions to the instruction buffers of the cores of the tiles.
The hardware circuit of claim 1, wherein the task instruction memories are arranged in a downstream sequence.
The hardware circuit of claim 1, wherein the instruction broadcast bus contains independent data lanes, wherein a number of independent data lanes corresponds to a number of task instruction memories.
The hardware circuit of claim 1, wherein the instruction request bus contains independent data lanes, wherein a number of independent data lanes corresponds to a number of task instruction memories.
The hardware circuit of claim 1, wherein instructions received by a task instruction memory are broadcasted to all the tiles linked on the instruction broadcast bus.
The hardware circuit of claim 1, wherein the prefetch unit is configured to provide a request to at least one task instruction memory during a prefetch window.
The hardware circuit of claim 1, wherein the instruction router comprises a round robin arbiter configured to arbitrate requests including a prefetch read request.
The hardware circuit of claim 1, wherein the instruction buffer is configured to store instructions for the tile access core or the tile execute core.
The hardware circuit of claim 1, further comprising a task instruction memory access bus, the task instruction memory access bus comprising a read request bus, a read response bus, a write request bus, and a write response bus.
A method of providing instructions by a hardware circuit (101) according to any one of claims 1 to 9, the method comprising: aggregating, by the instruction request bus (392), the requests for instructions from different prefetch units before requesting instructions responsive to the requests for instructions; filtering, by the instruction router (360), the requests for instructions to de-duplicate requests for identical instructions and providing the de-duplicated requests to the plurality of task instruction memories; storing, by the plurality of task instruction memories, instructions responsive to the de-duplicated requests; and broadcasting, by the instruction broadcast bus (391), the instructions from the plurality of task instruction memories to the plurality of tiles via the instruction router, wherein the instruction router deserializes the instructions to provide the deserialized instructions to the instruction buffers of the cores of the tiles.
The method of claim 10, wherein the requests for instructions to the plurality of task instruction memories are made in a first processing clock cycle and the storing of the instructions responsive to the de-duplicated requests occurs in a second processing clock cycle.
The method of claim 11, wherein the first processing clock cycle occurs prior to the second processing clock cycle.
A non-transitory computer readable medium for storing instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of the method according to any one of the preceding claims 10 to 12.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS The present application is a continuation of U.S. Patent Application No. 17/972,681, filed October 25, 2022, which claims the benefit of the filing date of U.S. Provisional Patent Application No. 63/281,960, filed November 22, 2021. Technical Field The pesent disclosure relates to the technical field of instruction prefetching within multicore processing systems. BACKGROUND A single instruction, multiple data (SIMD) processing unit is a type of processing unit for parallel processing of multiple data inputs by performing the same operation on each of the inputs. Operations to be accelerated by SIMD processing units are predetermined at design time of the SIMD processing unit. Adding an instruction memory to each tile of a cross-lane processing unit (XPU) is expensive, especially if the common use case will be single program multiple data (SPMD) executing different programs simultaneously. A full cache coherent solution, such as deployed in central processing units (CPUs), is too complex and not cost effective for XPUs. It is impractical to provide each core with a private instruction memory that can hold all possible programs, or even a single large program. The alternative would be to share a common instruction memory (TiMem) across all tiles, but have a small instruction buffer (iBuf) for each compute core and prefetch instructions from TiMem into iBuf when needed. All compute cores share this TiMem and access it concurrently. US2017351516A1 describes a mechanism called Total Store Elimination (TSE) used in processors and processing logic to optimize instruction execution by removing redundant store operations in the instruction stream, while maintaining total store order (TSO) consistency across multiple threads or processors. SUMMARY Aspects of the present disclosure include methods, systems, and apparatuses using an instruction prefetch pipeline architecture that provides good performance without the complexity of a full cache coherent solution deployed in CPUs. The architecture can include components which can be used to construct an instruction prefetch pipeline, including instruction memory (TiMem), instruction buffer (iBuf), a prefetch unit, and an instruction router. An aspect of the disclosure provides for a hardware circuit. The hardware circuit includes a plurality of tiles, where each tile is configured to operate in parallel with other tiles in the plurality of tiles. Each tile of the plurality of tiles includes: a processing core; a prefetch unit; and an instruction buffer. The hardware circuit further includes a plurality of data processing lanes configured to stream respective data from an upstream input to a downstream destination. The hardware circuit also includes a plurality of task instruction memories, where each task instruction memory of the plurality of task instruction memories are arranged in a sequence and coupled to one or more tiles from the plurality of tiles via an instruction router. In an example, the task instruction memories are arranged in a downstream sequence. In another example, each tile includes a tile access core and the prefetch unit contained within each tile is contained within the tile access core. In yet another example, each tile includes a tile execute core and the prefetch unit contained within each tile is contained within the tile execute core. In yet another example, the hardware circuit further includes an instruction broadcast bus and an instruction request bus. In yet another example, the instruction broadcast bus contains independent data lanes, where a number of independent data lanes corresponds to a number of task instruction memories. In yet another example, the instruction request bus contains independent data lanes, where a number of independent data lanes corresponds to a number of task instruction memories. In yet another example, instructions received by a task instruction memory are broadcasted to all the tiles linked on the instruction broadcast bus. In yet another example, the prefetch unit is configured to provide a request to at least one task instruction memory during a prefetch window. In yet another example, the instruction router includes a round robin arbiter configured to arbitrate requests including a prefetch read request. In yet another example, the instruction buffer is configured to store instructions for a tile access core or a tile execute core. In yet another example, the hardware circuit further includes a task instruction memory access bus, where the task instruction memory access bus includes a read request bus, a read response bus, a write request bus, and a write response bus. Another aspect of the disclosure provides for a method of providing instructions by a single instruction multiple data (SIMD) processing unit. The method includes: receiving, by one or more processors from a plurality of tiles of the SIMD processing unit, requests for instructions; filtering, by the one