US-20260126907-A1 - Semiconductor Apparatus, Semiconductor Device, Non-Transitory Computer-Readable Medium, Method, Apparatus, and Device for a Computer System
Abstract
Various examples relate to a semiconductor apparatus, a semiconductor r device, or to a non-transitory computer-readable medium, a method, an apparatus or a device for a computer system, and to a computer system comprising the semiconductor apparatus and the apparatus or device. A semiconductor apparatus comprises a spatial array of processing elements coupled by an interconnect network, a plurality of memory management circuits coupled to the spatial array of processing elements, wherein the respective memory management circuits comprise a first buffer circuit for storing a configuration to be used for configuring the processing elements associated with the respective memory management circuit and a second buffer circuit for storing results of computations performed by the processing elements associated with the respective memory management circuit, the first buffer circuit being separate from the second buffer circuit.
Inventors
- Tuan Quach
- Kermin CHOFLEMING, JR.
- William Lee
Assignees
- Tuan Quach
- Kermin CHOFLEMING, JR.
- William Lee
Dates
- Publication Date
- 20260507
- Application Date
- 20251222
Claims (16)
- 1 . A semiconductor apparatus comprising: a spatial array of processing elements coupled by an interconnect network; a plurality of memory management circuits coupled to the spatial array of processing elements, wherein the respective memory management circuits comprise a first buffer circuit for storing a configuration to be used for configuring the processing elements associated with the respective memory management circuit and a second buffer circuit for storing results of computations performed by the processing elements associated with the respective memory management circuit, the first buffer circuit being separate from the second buffer circuit.
- 2 . The semiconductor apparatus according to claim 1 , wherein the respective memory management circuits are configured to receive a subsequent configuration to be applied to the processing elements associated with the respective memory management circuit in a subsequent time interval while the processing elements associated with the respective memory management circuit are performing computations according to a present configuration.
- 3 . The semiconductor apparatus according to claim 1 , wherein the respective memory management circuits are configured to use the first buffer circuit independently of the second buffer circuit.
- 4 . The semiconductor apparatus according to claim 1 , wherein the respective memory management circuits are configured to detect that the processing elements associated with the respective memory management circuit have finished performing computations according to a present configuration, and to reconfigure the processing elements associated with the respective memory management circuit with the configuration stored in the first buffer circuit upon detection of the computations being finished.
- 5 . The semiconductor apparatus according to claim 4 , wherein the detection is based on one or more of (a) a subsequent configuration being stored in the first buffer circuit, (b) the results of graph invocations having been returned to a managing entity, or (c) outstanding memory operations associated with a present graph having been completed.
- 6 . The semiconductor apparatus according to claim 4 , wherein the respective memory management apparatuses comprise a counter of outstanding results, with the detection of the computations being finished being based on the counter of outstanding results.
- 7 . The semiconductor apparatus according to claim 1 , wherein the first buffer circuit comprises a first and a second buffer bank, the configuration comprises a first configuration portion relevant for the respective memory management apparatus and a second configuration portion relevant for the processing elements associated with the memory management apparatus, and the respective memory management apparatuses are configured to receive the configuration to be stored in the first and second buffer bank, and store the first configuration portion in the first buffer bank and at least a portion of the second configuration portion in the second buffer bank.
- 8 . The semiconductor apparatus according to claim 7 , wherein the respective memory management apparatuses are configured to store the first configuration portion in the first buffer bank, subsequently store the second configuration portion in the second buffer bank until the second buffer bank is filled, and thereafter store a remaining part of the second configuration portion in the first buffer bank.
- 9 . The semiconductor apparatus according to claim 7 , wherein the first and second buffer banks have a width that matches the bandwidth of a communication interface.
- 10 . The semiconductor apparatus according to claim 7 , wherein the semiconductor apparatus comprises multiplexing circuitry configured to align the first and second configuration portions for storage in the first and second buffer banks.
- 11 . The semiconductor apparatus according to claim 1 , wherein the first buffer circuit has a smaller storage capacity than the second buffer circuit.
- 12 . A computer system comprising the semiconductor apparatus according to claim 1 .
- 13 . A non-transitory computer-readable medium storing instructions that, when executed by one or more processing circuitries, cause the one or more processing circuitries to perform a method for a computer system, the method comprising: determining a configuration for a semiconductor apparatus, the semiconductor apparatus comprising a spatial array of processing elements coupled by a interconnect network and a plurality of memory management circuits coupled to the spatial array of processing elements, wherein the respective memory management circuits comprise a first buffer circuit for storing a configuration to be used for configuring the processing elements associated with the respective memory management circuit and a second buffer circuit for storing results of computations performed by the processing elements associated with the respective memory management circuit, the first buffer circuit being separate from the second buffer circuit; providing the configuration for the first buffer circuit; and obtaining the result of the computations of the second buffer circuit.
- 14 . The non-transitory computer-readable medium according to claim 13 , wherein the method comprises providing a subsequent configuration to the first buffer circuit of the respective memory management circuits while the processing elements associated with the respective memory management circuit are performing computations according to a present configuration.
- 15 . An apparatus for a computer system, comprising interface circuitry, machine-readable instructions, and processor circuitry to execute the machine-readable instructions to: determine a configuration for a semiconductor apparatus, the semiconductor apparatus comprising a spatial array of processing elements coupled by a interconnect network and a plurality of memory management circuits coupled to the spatial array of processing elements, wherein the respective memory management circuits comprise a first buffer circuit for storing a configuration to be used for configuring the processing elements associated with the respective memory management circuit and a second buffer circuit for storing results of computations performed by the processing elements associated with the respective memory management circuit, the first buffer circuit being separate from the second buffer circuit; provide the configuration for the first buffer circuit; and obtain the result of the computations of the second buffer circuit.
- 16 . The apparatus according to claim 15 , wherein the processor circuitry is to execute the machine-readable instructions to provide a subsequent configuration to the first buffer circuit of the respective memory management circuits while the processing elements associated with the respective memory management circuit are performing computations according to a present configuration.
Description
BACKGROUND Spatial accelerators, such as the Intel® Configurable Spatial Accelerator (CSA), are specialized hardware architectures designed to improve performance and energy efficiency for specific computational workloads. Unlike traditional processors that execute instructions sequentially, spatial accelerators implement computation by mapping dataflow graphs directly onto reconfigurable hardware fabric. These accelerators typically comprise an array of processing elements (PEs) interconnected through a configurable network, allowing data to flow spatially across the architecture rather than being shuttled back and forth to memory. Programs are executed on spatial accelerators by first compiling the high-level code into a dataflow graph representation that explicitly captures the parallelism and data dependencies in the computation. This dataflow graph is then mapped onto the accelerator's fabric, where nodes become processing elements and edges become data channels. The compiler configures the accelerator hardware to implement the specific operations and routing required for the program. During execution, data streams through the configured fabric in a pipelined fashion, with multiple operations proceeding concurrently as data becomes available, thus eliminating much of the overhead associated with instruction fetch and decode in traditional architectures. Configurable architectures, such as spatial accelerators, differ from Von Neumann architectures in that they have a discrete configuration operation, as opposed to being reconfigured on each instruction decoding. Making configuration faster is a key figure of merit in these architectures, as it increases the number of scenarios in which they can be deployed. BRIEF DESCRIPTION OF THE FIGURES Some examples of apparatuses and/or methods will be described in the following by way of example only, and with reference to the accompanying figures, in which: FIG. 1a shows a schematic diagram of a tile of a spatial accelerator semiconductor apparatus or semiconductor device; FIG. 1b shows a schematic diagram of a computer system comprising a spatial accelerator semiconductor apparatus or semiconductor device with a plurality of tiles; FIG. 1c shows a flowchart of a method for a computer system; FIGS. 2a and 2b show illustrations of completion buffer microarchitecture alternatives; FIG. 3 shows a diagram of an example of the physical organization of a configuration completion buffer; FIG. 4 shows a timing diagram depicting result-based triggering in combination with the configuration prefetching enabled by a separate completion buffer; FIG. 5 shows an illustration of result broadcast regions for sub-tiles; FIG. 6 shows an illustration of tdelay relative to an existing execution; FIG. 7 shows tdelay vs. observed configuration latency assuming a perfect CSA cache; FIG. 8 shows tdelay vs. observed configuration latency assuming no cache locality and configuration is off-chip; FIG. 9 shows an observed configuration latency vs. completion buffer entries, assuming a perfect CSA cache; FIG. 10 shows observed configuration latency vs. completion buffer entries, assuming configuration is in on-chip memory; FIG. 11 shows a block diagram of a RAF with separate buffers; FIG. 12 shows a timing flow; and FIG. 13 shows a schematic diagram of a computer system. DETAILED DESCRIPTION Some examples are now described in more detail with reference to the enclosed figures. However, other possible examples are not limited to the features of these embodiments described in detail. Other examples may include modifications of the features as well as equivalents and alternatives to the features. Furthermore, the terminology used herein to describe certain examples should not be restrictive of further possible examples. Throughout the description of the figures, same or similar reference numerals refer to same or similar elements and/or features, which may be identical or implemented in a modified form while providing the same or a similar function. The thickness of lines, layers, and/or areas in the figures may also be exaggerated for clarification. When two elements A and B are combined using an “or”, this is to be understood as disclosing all possible combinations, i.e. only A, only B as well as A and B, unless expressly defined otherwise in the individual case. As an alternative wording for the same combinations, “at least one of A and B” or “A and/or B” may be used. This applies equivalently to combinations of more than two elements. If a singular form, such as “a”, “an” and “the” is used and the use of only a single element is not defined as mandatory either explicitly or implicitly, further examples may also use several elements to implement the same function. If a function is described below as implemented using multiple elements, further examples may implement the same function using a single element or a single processing entity. It is further understood that the terms “include”, “in