Search

US-12619574-B2 - Configuration data store in a reconfigurable data processor having two access modes

US12619574B2US 12619574 B2US12619574 B2US 12619574B2US-12619574-B2

Abstract

A reconfigurable processor is disclosed, featuring an array of configurable units interconnected by a bus system. Each configurable unit includes a configuration data store structured as a shift register that includes individually addressable argument registers. Program load logic is responsible for receiving sub-files of configuration data through the bus system and sequentially shifting them into the configuration data store, including the argument registers. Argument load logic is designed to receive argument data via the bus system and directly load it into the argument registers without the need for shifting through the shift register.

Inventors

  • Manish K. Shah
  • Gregory Frederick Grohoski

Assignees

  • SambaNova Systems, Inc.

Dates

Publication Date
20260505
Application Date
20240717

Claims (20)

  1. 1 . A reconfigurable processor that includes an array of configurable units connected by a bus system, a configurable unit in the array of configurable units comprising: a configuration data store to store configuration data, organized as a shift register and including individually addressable argument registers respectively comprising non-overlapping portions of the shift register; program load logic to receive sub-files of the configuration data via the bus system and to load the received sub-files into the configuration data store, including the argument registers, by sequentially shifting the received sub-files into the shift register; and argument load logic to receive argument data via the bus system and load the received argument data into the argument registers without shifting the received argument data through the shift register.
  2. 2 . The reconfigurable processor of claim 1 , the reconfigurable processor further comprising: a program load controller associated with the array to respond to a program load command by executing a program load process, including sending a first signal to the configurable unit and subsequently distributing a configuration file comprising the sub-files of configuration data to the configurable unit in the array as specified in the configuration file; and a fast argument load (FAL) controller associated with the array to respond to an FAL command by executing an FAL process, including sending a second signal to the configurable unit, and subsequently distributing (value, control) tuples to the configurable unit as specified in an argument load file.
  3. 3 . The reconfigurable processor of claim 2 , wherein the argument load file includes a list of (value, control) tuples specifying values to be written to argument registers, the list containing a (value, control) tuple for argument registers to be written by the FAL controller during a single invocation of the FAL process.
  4. 4 . The reconfigurable processor of claim 3 , wherein a (value, control) tuple includes a value to be written to an argument register and a control indicating an ID of the argument register to be written and a destination identification of a target configurable unit in the array of configurable units containing the argument register to be written.
  5. 5 . The reconfigurable processor of claim 4 , wherein the destination identification identifies a row containing the target configurable unit, a column containing the target configurable unit, and a type of the target configurable unit, the type being one of memory unit, compute unit, switch, or interface unit.
  6. 6 . The reconfigurable processor of claim 1 , wherein the configurable units in the array of configurable units are further connected in an interconnect topology, separate from, and in addition to, the bus system, the interconnect topology comprising a daisy chain used by the configurable unit to indicate completion of at least a portion of loading the received sub-files of the configuration data or loading the received argument data.
  7. 7 . The reconfigurable processor of claim 2 , wherein the array of configurable units is associated with a multi-bit program control register, selected bits of which are writeable to trigger execution of a process selected from among multiple processes, the multiple processes including the FAL process and the program load process.
  8. 8 . The reconfigurable processor of claim 7 , wherein the FAL controller is configured, upon completion of the FAL process, to clear an FAL process bit of the program control register that had been written to trigger execution of the FAL process, and the program load controller is configured, upon completion of the program load process, to clear a program load bit of the program control register that had been written to trigger execution of the program load process.
  9. 9 . The reconfigurable processor of claim 2 , wherein the FAL controller is associated with an argument load address register, an argument load size register, and one or more argument load bits of a multi-bit program control register, wherein the FAL controller is configured to recognize a write to at least one of the argument load address register, the argument load size register, or an argument load bit of the one or more argument load bits of the multi-bit program control register.
  10. 10 . The reconfigurable processor of claim 9 , wherein the FAL controller is configured to begin the FAL process by broadcasting the second signal to the configurable units in the array of configurable units in order to place the configurable units into an argument load state.
  11. 11 . The reconfigurable processor of claim 10 , wherein the FAL controller, once the second signal has been received by all of the configurable units in the array of configurable units, begins to retrieve the argument load file by issuing a memory access request to a physical address of the argument load file as stored in the argument load address register, and receives data of the argument load file in response to the memory access request.
  12. 12 . The reconfigurable processor of claim 11 , wherein the FAL controller is configured to distribute (value, control) tuples from the argument load file to the configurable units in the array of configurable units over a vector network of the bus system, one (value, control) tuple at a time, and to receive response packets with a control bit set in response to distributed (value, control) tuples over a scalar network of the bus system.
  13. 13 . The reconfigurable processor of claim 11 , wherein the FAL controller is configured to continue reading the argument load file one block of data at a time until as many (value, control) tuples as are specified in the argument load size register have been read and distributed, wherein the block of data contains a plurality of (value, control) tuples.
  14. 14 . The reconfigurable processor of claim 13 , wherein the FAL controller is configured to maintain a count of unprocessed (value, control) tuples sent to configurable units in the array of configurable units that have not yet been processed by the configurable units.
  15. 15 . The reconfigurable processor of claim 14 , wherein a (value, control) tuple is routed over a first network of the bus system to a row and column destination of a configurable unit as specified in the (value, control) tuple using dimension-order routing, wherein a row dimension is traversed before a column dimension; and a configurable unit that receives a (value, control) tuple while in the argument load state is configured to load data contained in the (value, control) tuple into the argument register indexed by a register ID contained in the (value, control) tuple, the configurable unit configured to subsequently report completion of the argument load by sending a response packet with a control bit set to the FAL controller over a second network of the bus system.
  16. 16 . The reconfigurable processor of claim 15 , wherein the FAL controller is configured to decrement the count of unprocessed (value, control) tuples sent with every scalar response packet with a control bit set received from configurable units over the second network.
  17. 17 . The reconfigurable processor of claim 16 , wherein once all of the (value, control) tuples specified in the argument load size register have been read from the argument load file, and once the count of unprocessed (value, control) tuples sent reaches zero, the FAL process is complete, and the FAL controller deasserts the second signal, sets an argument load complete bit of a tile status register, and generates an interrupt.
  18. 18 . The reconfigurable processor of claim 17 , wherein the FAL controller, once the FAL process is complete, returns either to an idle state or to an execute state, depending on which bit of the program control register had been written to initiate the FAL process.
  19. 19 . The reconfigurable processor of claim 2 , wherein the argument load logic in the configurable unit is configured to cause a component state machine in the configurable unit to transition from a current state to an argument load state in response to receiving the second signal; the FAL controller performs the FAL process without sending an indication of how many argument registers, M, will be loaded to the configurable unit before distributing the (value, control) tuples to the configurable unit; and the configurable unit accepts and processes (value, control) tuples received as long as it is in the argument load state, wherein M is a non-negative integer value.
  20. 20 . The reconfigurable processor of claim 19 , wherein the FAL controller is configurable to perform an alternate argument load process that uses an alternate argument load file containing: a first section specifying a respective number of argument registers, M, to be written in each configurable unit in the array of configurable units; and a second section listing (value, control) tuples for argument registers that need to be written during the alternate argument load process.

Description

CROSS-REFERENCES AND INCORPORATIONS This application is a continuation of U.S. Non-Provisional patent application Ser. No. 18/105,187, filed on Feb. 2, 2023, titled “Fast Argument Load in a Reconfigurable Data Processor,” which claims the benefit of U.S. Provisional Patent Application No. 63/308,246, filed on Feb. 9, 2022, entitled “Fast Argument Load.” Both aforementioned applications are hereby incorporated by reference for all purposes. This application is further related to the following patent applications, which are hereby incorporated by reference for all purposes: U.S. Nonprovisional patent application Ser. No. 16/197,826, now U.S. Pat. No. 10,831,507 B2, filed Nov. 21, 2018, entitled “Configuration Load of a Reconfigurable Data Processor;”U.S. Nonprovisional patent application Ser. No. 16/536,192, now U.S. Pat. No. 11,080,227 B2, filed Aug. 8, 2019, entitled “Compiler Flow Logic for Reconfigurable Architectures;”U.S. Nonprovisional patent application Ser. No. 16/996,666, filed Aug. 18, 2020, entitled “Runtime Patching of Configuration Files;”U.S. Nonprovisional patent application Ser. No. 17/322,697, filed May 17, 2021, entitled “Quiesce Reconfigurable Data Processor;”U.S. Nonprovisional patent application Ser. No. 18/105,189, filed Feb. 2, 2023, entitled “A Reconfigurable Data Processor with Fast Argument Load using a Runtime Program on a Host Processor.” The following are incorporated by reference for all purposes: Prabhakar et al., “Plasticine: A Reconfigurable Architecture for Parallel Patterns,” ISCA '17, Jun. 24-28, 2017, Toronto, ON, Canada; andKoeplinger et al., “Spatial: A Language and Compiler for Application Accelerators,” Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), Proceedings of the 43rd International Symposium on Computer Architecture, 2018. BACKGROUND Technical Field The technology disclosed relates to loading argument registers in a coarse-grained reconfigurable architecture processor from a host processor during runtime. Context The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves can also correspond to implementations of the claimed technology. Reconfigurable processors, including field programmable gate arrays FPGAs, can be configured to implement a variety of functions more efficiently or faster than might be achieved using a general-purpose processor executing a computer program. So called coarse-grained reconfigurable architectures (e.g., CGRAs) are being developed in which the configurable units in the array are more complex than used in typical, more fine-grained FPGAs, and may enable faster or more efficient execution of various classes of functions. For example, CGRAs have been proposed that can enable implementation of energy-efficient accelerators for machine learning and artificial intelligence workloads. See Prabhakar, et al. as referenced above. Configuration of reconfigurable processors involves compilation of a configuration description to produce a configuration file, referred to sometimes as a bitstream or bit file, and distributing the configuration file to the configurable units on the processor. To start a process, the configuration file must be loaded for that process. To change a process, the configuration file must be replaced with the new configuration file. The configuration file can include parameters or arguments for use by the graphs implemented by the configuration file once loaded into the coarse-grained reconfigurable (CGR) units. These locations may be updated more often than other parts of the configuration file, so it is inefficient to replace the entire configuration file just to update an argument. BRIEF DESCRIPTION OF THE DRAWINGS The technology will be described with reference to the drawings, in which: FIG. 1 illustrates an example system including a coarse-grained reconfigurable (CGR) processor, a host, and a memory. FIG. 2 illustrates an example of a computer, including an input device, a processor, a storage device, and an output device. FIG. 3 illustrates example details of a CGR architecture including a top-level network (TLN) and two CGR arrays. FIG. 4 illustrates an example CGR array, including an array of CGR units in an array-level network (ALN). FIG. 5 illustrates an example of a pattern memory unit (PMU) and a pattern compute unit (PCU), which may be combined in a fused compute and memory unit (FCMU). FIG. 6 shows an example (value, control) tuple to specify data to be loaded into a particular argument register in the CGR processor. FIG. 7 illustrates an example of a configuration data store organized as a shift