US-12619434-B2 - Programmable fabric-based instruction set architecture for a processor

US12619434B2US 12619434 B2US12619434 B2US 12619434B2US-12619434-B2

Abstract

A semiconductor device may include a programmable fabric and a processor. The processor may utilize one or more extension architectures. At least one of these extension architectures may be used to integrate and/or embed the programmable fabric into the processor as part of the processor. Specifically, a buffer of the extension architecture may be used to load data to and store data from the programmable fabric.

Inventors

DHEERAJ SUBBAREDDY
Anshuman Thakur
Ankireddy Nalamalpu
Md Altaf HOSSAIN

Assignees

ALTERA CORPORATION

Dates

Publication Date: 20260505
Application Date: 20210625

Claims (20)

1 . An integrated circuit device comprising: a processor comprising: a register to store information that is processed in the processor; a decode unit to decode instructions for the processor; extension architecture coupled to the decode unit that is configured to receive the instructions from the decode unit, wherein the extension architecture comprises: a first computation grid comprising a grid of fused-multiply-add circuits tuned to a first format of data; a second computation grid comprising a programmable fabric that receives data in a second format of data different than the first format and pre-loads multiple configurations for multiple extension architectures upon boot of the processor; and control circuitry to receive the decoded instructions as a command, select the first computation grid or the second computation grid, and selectively transmit the command to the first computation grid or the second computation grid, wherein the selection of the first computation grid or the second computation grid is based on the command; and an execution unit coupled with the decode unit, wherein the execution unit, in response to the instructions, performs operations comprising: receiving a microcode update with a configuration bitstream carrying at least one configuration of the multiple configurations that when loaded into the programmable fabric, configures the programmable fabric to perform one or more functions using the at least one configuration; validating the microcode update by matching the configuration bitstream to a processor identification (CPUID) corresponding to the processor; and storing the configuration bitstream to the programmable fabric.
2 . The integrated circuit device of claim 1 , wherein the extension architecture comprises an advanced matrix extension (AMX) architecture to integrate the second computation grid into the processor.
3 . The integrated circuit device of claim 2 , wherein the register comprises a two-dimensional register associated with the AMX architecture.
4 . The integrated circuit device of claim 1 , wherein the second format of data comprises INT8.
5 . The integrated circuit device of claim 1 , wherein the first computation grid comprises an accelerator that the processor uses to offload some processing from a core of the processor.
6 . The integrated circuit device of claim 1 , wherein the programmable fabric of the second computation grid converts the data to the first format or a third format from the second format.
7 . The integrated circuit device of claim 1 , wherein the second computation grid comprises an accelerator that the processor uses to offload some processing from a core of the processor.
8 . The integrated circuit device of claim 1 , wherein the second computation grid is disposed monolithically on a piece of semiconductor that also has other portions of the processor disposed thereon.
9 . The integrated circuit device of claim 1 , wherein the second computation grid comprises a die that is separate from one or more dies hosting other portions of the processor.
10 . A semiconductor device, comprising: a first computation grid comprising a programmable fabric; and a processor comprising: a plurality of cores; and extension architecture that is configured to receive a command from a core of the plurality of cores, wherein the extension architecture comprises: a data buffer; a second computation grid comprising a grid of fused-multiply-add circuits tuned to a first format of data, wherein the programmable fabric receives data in a second format of data different than the first format and pre-loads multiple configurations for multiple extension architectures upon boot of the processor; and control circuitry to receive the command, select the first computation grid or the second computation grid, and selectively transmit the command to the first computation grid or the second computation grid, wherein the selection of the first computation grid or the second computation grid is based on a type of the command; and a decode unit to decode instructions for the processor; an execution unit coupled with the decode unit, wherein the execution unit, in response to the instructions, performs operations comprising: receiving a microcode update with a configuration bitstream carrying at least one configuration of the multiple configurations that when loaded into the programmable fabric, configures the programmable fabric to perform one or more functions using the at least one configuration; validating the microcode update by matching the configuration bitstream to a processor identification (CPUID) corresponding to the processor; and storing the configuration bitstream to the programmable fabric.
11 . The semiconductor device of claim 10 , wherein the processor comprises a data cache unit that loads tile data in the data buffer that is to be processed in the first computation grid or the second computation grid and stores data from the data buffer that has been processed in the first computation grid or the second computation grid.
12 . The semiconductor device of claim 10 , wherein the first computation grid: pulls data from the data buffer; processes the data; and stores the processed data to the data buffer.
13 . The semiconductor device of claim 12 , wherein pulling data from the data buffer comprises translating from a first clock domain of the processor to a second clock domain of the programmable fabric.
14 . The semiconductor device of claim 10 , wherein the second computation grid: pulls data from the data buffer; processes the data; and stores the processed data to the data buffer.
15 . The semiconductor device of claim 10 , wherein the data buffer comprises a two-dimensional buffer for storing matrices in two-dimensional registers of the data buffer.
16 . The semiconductor device of claim 10 , wherein the extension architecture comprises architecture for an advanced matrix extension (AMX) architecture of the processor that integrates the first computation grid into the processor.
17 . The semiconductor device of claim 10 , wherein the first computation grid is logically embedded in the extension architecture as logically part of the processor.
18 . The semiconductor device of claim 17 , wherein the extension architecture and the first computation grid are implemented on different semiconductor dies.
19 . The semiconductor device of claim 17 , wherein the extension architecture and the first computation grid are implemented monolithically on a same semiconductor chip.
20 . A method comprising: loading multiple configurations for multiple extension architectures upon boot of a processor into a programmable fabric; receiving, at control circuitry of extension architecture of the processor, a command from a core of the processor; based on a command type of the command, identifying a first computation grid or a second computation grid as a target of the command; based on identification of the first computation grid as the target of the command, transmitting the command to the programmable fabric of the first computation grid, wherein the command pertains to data in a first format that is different than a second format to which a second computation grid of the processor is tuned; receiving results for the command from the programmable fabric after conversion to the second format; transmitting the results to the core; receiving, at an execution unit of the processor, a microcode update with a configuration bitstream carrying at least one configuration of the multiple configurations that when loaded into the programmable fabric, configures the programmable fabric to perform one or more functions using the at least one configuration; validating, by the processor, the microcode update by matching the configuration bitstream to a processor identification (CPUID) corresponding to the processor; and storing, by the processor, the configuration bitstream to the programmable fabric.

Description

BACKGROUND This disclosure relates to a flexible instruction set architecture for a processor by incorporating a programmable fabric into the architecture of the processor to provide a more flexible instruction set architecture. This section is intended to introduce the reader to various aspects of art that may be related to various aspects of the present techniques, which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be noted that these statements are to be read in this light, and not as admissions of any kind. Integrated circuits are found in numerous electronic devices, from handheld devices, computers, gaming systems, robotic devices, automobiles, and more. Some integrated circuits, such as central processing units (CPUs) and/or microprocessors (μP) may utilize offload computing and/or acceleration to utilize other devices (e.g., programmable logic devices) to assist the CPU/μP in performing certain operations. However, certain compute models for implementing offloading may be limited due to latency, memory coherency, or flexibility issues in the implementations used to provide the acceleration. For instance, the implementations may include an Ethernet-based accelerator, a peripheral component interconnect express (PCIE)-based accelerator, an Ultra Path Interconnect (UPI)-based accelerator, an Intel Accelerator Link (IAL), or a cache coherent interconnect for accelerators (CCIX)-based accelerator. However, at least some of these interconnects may have a high latency relative to latency in the CPU/μP, inflexibility of usage, and/or a lack of memory coherency. For instance, a PCIE/Ethernet-based implementations may have a relatively long latency (e.g., 100 μs) relative to the latency in the CPU/μP. Furthermore, the PCIE/Ethernet-based implementations may lack memory coherency. UPI/IAL/CCIX-based accelerator may have a lower latency (e.g., 1 μs) than the PCIE/Ethernet implementations while having coherency, but the UPI/IAL/CCIX-based accelerators may utilize limited flexibility via fine-grained memory sharing. For instance, UPI/IAL/CCIX-based accelerators are first integrated into core software before being utilized. BRIEF DESCRIPTION OF THE DRAWINGS Various aspects of this disclosure may be better understood upon reading the following detailed description and upon reference to the drawings in which: FIG. 1 is a block diagram of a register architecture, in accordance with an embodiment; FIG. 2A is a block diagram illustrating an in-order pipeline and a register renaming, out-of-order issue/execution pipeline, in accordance with an embodiment; FIG. 2B is a block diagram illustrating an in-order architecture core and a register renaming, out-of-order issue/execution architecture core to be included in a processor, in accordance with an embodiment; FIGS. 3A and 3B illustrate a block diagram of a more specific example in-order core architecture, which core would be one of several logic blocks (including other cores of the same type and/or different types) in a chip, in accordance with an embodiment; FIG. 4 is a block diagram of a processor that may have more than one core, may have an integrated memory controller, and may have integrated graphics, in accordance with an embodiment; FIG. 5 shown a block diagram of a system, in accordance with an embodiment; FIG. 6 is a block diagram of a first more specific example system, in accordance with an embodiment; FIG. 7 is a block diagram of a system on a chip (SoC), in accordance with an embodiment; FIG. 8 is a block diagram of a process for programming an integrated circuit including a programmable fabric, in accordance with an embodiment; FIG. 9 is a diagram of the programmable fabric of FIG. 8, in accordance with an embodiment; FIG. 10 is a diagram of a processor architecture including the programmable fabric of FIG. 9 in a monolithic arrangement, in accordance with an embodiment; FIG. 11 is a diagram of a processor architecture including the programmable fabric of FIG. 9 with the processor and the programmable fabric located on separate silicon substrates, in accordance with an embodiment; FIG. 12 is a flow diagram of a method for receiving, authenticating, and storing a configuration bitstream to configure the programmable fabric of FIGS. 9 and/or 10, in accordance with an embodiment; FIG. 13 is a flow diagram of a method for performing task switching for a multitasking operating system, in accordance with an embodiment; FIG. 14 is a block diagram of a data processing system including a processor with an integrated programmable fabric unit, in accordance with an embodiment; FIG. 15 is a block diagram of a process for logically embedding an FPGA computation grid into a processor, in accordance with an embodiment; and FIG. 16 is a block diagram for saving a state of the FPGA