DE-112024000977-B4 - Cache control for storing registry data
Abstract
Complete facility: Processor pipeline switching logic (220 - 260) configured to perform operations on register operand data, wherein the processor pipeline switching logic (220 - 260) includes a decode switching logic (230) configured to decode instructions for execution; Memory hierarchy switching logic (845) configured to provide memory support for register operand data in one or more cache circuits (210, 670, 685); and Blocking logic (215) configured to: Controlling a first set of lock indicators for a set of registers for a first thread, including acknowledging one or more lock indicators for registers indicated by the decoding circuit (230) as being used by decoded instructions of the first thread; Storing register operand data in one or more cache circuits (210, 670, 685), including preventing the removal of a given cache row from a cache circuit of the one or more cache circuits (210, 670, 685) based on an affirmed lock indicator, wherein the affirmed lock indicator applies to a register for which operand data is stored in the given cache row; and Clearing the first set of lock indicators in response to a reset event.
Inventors
- Jonathan M. Redshaw
- Winnie W. Yeung
- Benjiman L. Goodman
- David K. Li
- Zelin Zhang
- Yoong Chert Foo
Assignees
- APPLE INC.
Dates
- Publication Date
- 20260513
- Application Date
- 20240209
- Priority Date
- 20230223
Claims (20)
- The setup comprises: Processor pipeline switching logic (220-260) configured to perform operations on register operand data, the processor pipeline switching logic (220-260) including a decode switching logic (230) configured to decode instructions for execution; Memory hierarchy switching logic (845) configured to provide memory support for register operand data in one or more cache circuits (210, 670, 685); and lock switching logic (215) configured to: Control a first set of lock indicators for a set of registers for a first thread, including acknowledging one or more lock indicators for registers indicated by the decode circuit (230) as being used by decoded instructions of the first thread; Retaining register operand data in one or more cache circuits (210, 670, 685), including preventing the removal of a given cache row from a cache circuit of one or more cache circuits (210, 670, 685) based on an affirmed lock indicator, wherein the affirmed lock indicator applies to a register for which operand data is stored in the given cache row; and clearing the first set of lock indicators in response to a reset event.
- establishment according Claim 1 , wherein the lock logic (215) is further configured to: control a second set of lock indicators for the set of registers for the first thread; and switch from the first set of lock indicators to the second set of lock indicators in response to the reset event.
- establishment according Claim 2 , wherein the processor pipeline switching logic (220 - 260) is configured to forward information through at least one scheduling pipeline stage (640) that identifies a set of lock indicators corresponding to a given operation.
- establishment according Claim 3 , wherein the processor pipeline switching logic (220 - 260) is configured to perform a barrier operation (655) in response to the reset event, so that all operations using the first set of lock indicators reach a pipeline stage before operations using the second set of lock indicators cross the pipeline stage.
- Establishment according to one of the Claims 1 until 4 , where the reset event corresponds to a threshold number of registers in the set of locked registers.
- Establishment according to one of the Claims 1 until 4 , where the reset event corresponds to a threshold number of degradation cycles of the first thread in a pipeline stage.
- Establishment according to one of the Claims 1 until 4 , where the reset event corresponds to a compiler hint.
- A device according to any of the preceding claims, further comprising: operand cache switching logic (648) configured to: store register operand data; maintain last used indicators for one or more operand cache entries; and control switching logic configured to flush operand data from one or more operand cache entries marked as last used into the memory hierarchy switching logic (845) in response to the reset event.
- A device according to any of the preceding claims, further comprising: Scoreboard switching logic configured to track which architecture registers are stored at a first cache level, wherein the lock switching logic (215) is configured to activate a lock indicator for a register in response to an allocation request that, based on the scoreboard switching logic, confirms that operand data for the corresponding register is stored at the first cache level.
- Device according to any of the preceding claims, wherein the device is a computing device (800) further comprising: a central processing unit (820); a display (865); and a network interface switching logic (850).
- Equipment according to one of the foregoing claims, wherein the processor pipeline switching logic (220 - 260) includes a plurality of single-instruction multiple-data pipelines configured to execute instructions; wherein the setup further includes fixed-function switching logic configured to control the single-instruction multiple-data pipelines to perform operations for at least one of the following program types: graphics shader programs; and machine learning programs.
- Methods comprising: Decoding, by a computing system, of instructions for execution, where the instructions specify operations on register operand data; Providing, by the computing system, of storage support for register operand data in one or more caches; Controlling, by the computing system, of a first set of lock indicators for a set of registers for a first thread, including acknowledging one or more lock indicators for registers indicated as being used by decoded instructions of the first thread; Retaining, by the computing system, of register operand data in the one or more caches, including preventing the swapping of a given cache row from a cache of the one or more caches based on an acknowledged lock indicator, wherein the acknowledged lock indicator applies to a register for which operand data is stored in the given cache row; and Clearing, by the computing system, of the first set of lock indicators in response to a reset event.
- Procedure according to Claim 12 , furthermore comprehensively: controlling a second set of lock indicators for the set of registers for the first thread; and switching from the first set of lock indicators to the second set of lock indicators in response to the reset event.
- Procedure according to one of the Claims 12 until 13 , where the reset event corresponds to one or more of the following reset events: a threshold number of registers in the set of locked registers; and a threshold number of first thread dismantling cycles in a pipeline stage.
- A non-transitory, computer-readable storage medium on which design information (1015) is stored, specifying a design of at least one section of a hardware-integrated circuit in a format recognized by a semiconductor manufacturing system (1020) configured to use the design information (1015) to produce the circuit according to the design, wherein the design information (1015) specifies that the circuit includes: Processor pipeline switching logic (220-260) configured to perform operations on register operand data, wherein the processor pipeline switching logic (220-260) includes a decode switching logic (230) configured to decode instructions for execution; Memory hierarchy switching logic (845) configured to provide memory support for register operand data in one or more cache circuits (210, 670, 685); and lock logic (215) configured to: control a first set of lock indicators for a set of registers for a first thread, including acknowledging one or more lock indicators for registers indicated by the decoder circuit (230) as being used by decoded instructions of the first thread; store register operand data in the one or more cache circuits (210, 670, 685), including preventing the offloading of a given cache row from a cache circuit of the one or more cache circuits (210, 670, 685) based on an acknowledged lock indicator, the acknowledged lock indicator being for a register for which operand data is stored in the given cache row; and clear the first set of lock indicators in response to a reset event.
- Non-transitory computer-readable storage medium according to Claim 15 , wherein the lock logic (215) is configured to: control a second set of lock indicators for the set of registers for the first thread; and switch from the first set of lock indicators to the second set of lock indicators in response to the reset event.
- Non-transitory computer-readable storage medium according to Claim 16 , wherein the processor pipeline switching logic (220 - 260) is configured to: forward, at least through one scheduled pipeline stage (640), information identifying a set of lock indicators corresponding to a given operation; and perform a barrier operation (655) in response to the reset event such that all operations using the first set of lock indicators reach a pipeline stage before Operations that use the second set of blocking indicators exceed the pipeline stage.
- Non-transitory computer-readable storage medium according to one of the Claims 15 until 17 , where the reset event corresponds to a threshold number of registers in the set of locked registers.
- Non-transitory computer-readable storage medium according to one of the Claims 15 until 18 , wherein the circuit further includes: operand cache switching logic (648) configured to: store register operand data; wait for last used indicators for one or more operand cache entries; and control switching logic configured to flush operand data from one or more operand cache entries marked as last used into the memory hierarchy switching logic (845) in response to the reset event.
- Non-transitory computer-readable storage medium (1010) according to one of the Claims 15 until 19 , wherein: the circuit further includes scoreboard switching logic configured to track which architecture registers are stored at a first cache level; and the lock switching logic (215) is configured to confirm, in response to an allocation request, a lock indicator for a register based on the scoreboard switching logic that operand data for the corresponding register is stored at the first cache level.
Description
BACKGROUND Technical field This revelation relates generally to computer processors and specifically to cache control. Description of the state of the art Data management techniques often have a significant impact on processor performance. Recently, unified memory architectures have allowed multiple components of a device (e.g., GPU and CPU) to access the same memory in the same locations, instead of reserving RAM segments for different components. This can advantageously reduce redundancy and data copies. In this context, data for different GPU registers in a cache/memory hierarchy can be memory-based. Therefore, certain combinations of tasks can cause register data to be dumped from a given cache level. This can impact performance when the register data is accessed again after being dumped. US 2019 / 0 095 203 A1 reveals a processor with multiple execution clusters that uses a cache memory separate from the system memory hierarchy and the register set for inter-cluster communication of live-in register values, in order to store values generated by a first cluster and consumed by a second cluster. US 9 785 567 B2 reveals an operand cache with pipeline-specific sub-areas and validity and modification states stored for each entry/sub-area relative to the register file. US 2021 / 0 311 997 A1 reveals a control table with entries for variable address ranges, the lookup of which is performed using binary search, whereby the window narrowing steps only use the comparison of the request address with the first boundary address and only check the remaining entry against the second boundary address or size. US 2010 / 0 106 938 A1 Disclosing a computing device with multiple TLBs that temporarily store parts of an address translation table stored in the main memory unit, wherein when an entry fetched from the main memory is entered, it is checked whether an entry already exists at the destination, and in this case the existing entry is displaced and moved to another TLB. US 10 275 372 B1 It reveals a cached memory device with a memory array and several data buffers connected via system data and address bus, each of which is assigned valid bits, buffer address registers and associated comparison circuits for address comparison checking. BRIEF DESCRIPTION OF THE DRAWINGS 1A is a diagram that illustrates an overview of exemplary graphics processing operations, according to some embodiments.1B is a block diagram illustrating an exemplary graphics unit, according to some embodiments.2 is a block diagram that, according to some embodiments, illustrates an exemplary data pipeline with register lock control for a data cache.3 is a diagram illustrating exemplary locking indicators according to some embodiments.4 is a block diagram that, according to some embodiments, illustrates an exemplary register lock control switching logic.5 is a flowchart that, according to some embodiments, illustrates an exemplary technique for switching to a new frame of lock indicators.6 is a block diagram which, according to some embodiments, shows an exemplary detailed shader processor that includes operand caches, data caches, and renaming switching logic.7 is a flowchart illustrating an exemplary procedure according to some embodiments.8 is a block diagram illustrating an exemplary computing device according to some embodiments.9 is a diagram illustrating exemplary applications of disclosed systems and devices according to some embodiments.10 is a block diagram which, according to some embodiments, illustrates an exemplary computer-readable medium that stores circuit design information. DETAILED DESCRIPTION The invention is set out in the attached claims. In the disclosed embodiments, registers are memory-based and can therefore be stored at various levels of a cache/memory hierarchy. For example, general-purpose GPU register data can be stored in reservation stations, physical registers, operand caches near data path circuits, one or more data caches that also store other data types, system memory, and so on. Therefore, register data can be removed from a given cache level while operations are being performed that still require the data. In general, these removals should be avoided whenever possible to keep register data closer to the execution pipelines. However, tracking which registers are in use (and therefore should be kept at a given cache level) can be complex in terms of circuit footprint and power consumption. In the disclosed embodiments, one or more lock frames are defined, each containing a lock indicator (e.g., a bit) per architecture register per thread (or per SIMD (Single-Instruction Multiple-Data) group, where a SIMD group can include multiple threads). It should be noted that threads/SIMD groups can be assigned to channels for execution, allowing the control logic to manage frames on a channel-by-channel basis (thus managing at least one lock indicator per architecture register per channel). In some embodiments, a register is