US-12625821-B2 - Cache control to preserve register data
Abstract
Techniques are disclosed relating to eviction control for cache lines that store register data. In some embodiments, memory hierarchy circuitry is configured to provide memory backing for register operand data in one or more cache circuits. Lock circuitry may control a first set of lock indicators for a set of registers for a first thread, including to assert one or more lock indicators for registers that are indicated, by decode circuitry, as being utilized by decoded instructions of the first thread. The lock circuitry may preserve register operand data in the one or more cache circuits, including to prevent eviction of a given cache line from a cache circuit based on an asserted lock indicator. The lock circuitry may clear the first set of lock indicators in response to a reset event. Disclosed techniques may advantageously retain relevant register information in the cache with limited control circuit area.
Inventors
- Jonathan M. Redshaw
- Winnie W. Yeung
- Benjiman L. Goodman
- David K. Li
- Zelin Zhang
- Yoong Chert Foo
Assignees
- APPLE INC.
Dates
- Publication Date
- 20260512
- Application Date
- 20241127
Claims (20)
- 1 . An apparatus, comprising: processor pipeline circuitry configured to perform operations on register operand data; control circuitry configured to provide memory backing for register operand data in a memory hierarchy, wherein the memory hierarchy includes: a data cache; and an operand cache between the processor pipeline circuitry and the data cache in the memory hierarchy; and lock circuitry configured to: control a first set of lock indicators for a set of registers for a first thread, including to assert one or more lock indicators for registers that are indicated as being utilized by instructions of the first thread; preserve register operand data in the data cache, including to prevent eviction of a given cache line from the data cache based on an asserted lock indicator for register data stored in the given cache line; and in response to a reset event: flush operand data to the memory hierarchy from one or more operand cache entries indicated as being last-use; and clear the first set of lock indicators.
- 2 . The apparatus of claim 1 , wherein the lock circuitry is further configured to, in response to the reset event: assert one or more lock indicators for the registers whose operand data was flushed from the one or more operand cache entries.
- 3 . The apparatus of claim 2 , wherein, to assert the one or more lock indicators for the registers whose operand data was flushed from the one or more operand cache entries, the lock circuitry is configured to control a second set of lock indicators for the set of registers.
- 4 . The apparatus of claim 1 , wherein the lock circuitry is further configured to, in response to the reset event: clear entries in the operand cache that are not indicated as being last-use.
- 5 . The apparatus of claim 1 , wherein indications of last-use for one or more operand cache entries are based on last-use hint information from a compiler.
- 6 . The apparatus of claim 1 , wherein the lock circuitry is configured to maintain lock indicators at single-instruction multiple-data (SIMD) granularity for threads in a SIMD group.
- 7 . The apparatus of claim 1 , further comprising: decode circuitry configured to decode instructions for execution and indicate the registers utilized by instructions of the first thread.
- 8 . The apparatus of claim 1 , wherein the lock circuitry is configured to switch between sets of lock indicators for the set of registers in response to the reset event.
- 9 . The apparatus of claim 1 , wherein the processor pipeline circuitry is configured to pipeline, at least through a schedule pipeline stage, information that identifies a set of lock indicators corresponding to a given operation.
- 10 . The apparatus of claim 9 , wherein the processor pipeline circuitry is configured to perform a fence operation in response to the reset event such that all operations that use the first set of lock indicators reach a pipeline stage before any operations that use a second set of lock indicators proceed past the pipeline stage.
- 11 . The apparatus of claim 1 , wherein the reset event corresponds to a threshold number of registers in the set of registers being locked.
- 12 . The apparatus of claim 1 , wherein the reset event corresponds to a compiler hint.
- 13 . The apparatus of claim 1 , wherein the reset event corresponds to a threshold number of stall cycles of the first thread in a pipeline stage.
- 14 . A method, comprising: storing, by a computing system in an operand cache, register operand data for operations performed by a processor pipeline of the computing system; controlling, by the computing system, a first set of lock indicators for a set of registers for a first thread, including to assert one or more lock indicators for registers that are indicated as being utilized by instructions of the first thread; preserving, by the computing system, register operand data in a data cache, including to prevent eviction of a given cache line from the data cache based on an asserted lock indicator for register data stored in the given cache line; and in response to a reset event, the computing system: flushing operand data to the data cache from one or more entries of the operand cache indicated as being last-use; and clearing the first set of lock indicators.
- 15 . The method of claim 14 , further comprising, in response to the reset event: asserting one or more lock indicators for the registers whose operand data was flushed from the one or more operand cache entries.
- 16 . The method of claim 15 , wherein the asserting includes controlling a second set of lock indicators for the set of registers.
- 17 . The method of claim 14 , further comprising, in response to the reset event: clearing entries in the operand cache that are not indicated as being last-use.
- 18 . An apparatus, comprising: processor pipeline circuitry configured to perform operations on register operand data; control circuitry configured to provide memory backing for register operand data in a memory hierarchy, wherein the memory hierarchy includes a data cache at a first cache level; scoreboard circuitry configured to track which architectural registers are stored at the first cache level; and lock circuitry configured to: initiate a map request to confirm whether operand data for registers that are indicated as being utilized by instructions of a first thread are stored at the first cache level; in response to the scoreboard circuitry confirming that one or more registers utilized by instructions of the first thread are stored at the first cache level, control a first set of lock indicators for a set of registers for the first thread, including to assert one or more lock indicators corresponding to the confirmed one or more registers; and preserve register operand data in the data cache, including to prevent eviction of a given cache line from the data cache based on an asserted lock indicator for a register stored in the given cache line.
- 19 . The apparatus of claim 18 , wherein the lock circuitry is further configured to: clear the first set of lock indicators in response to a reset event.
- 20 . The apparatus of claim 18 , further comprising: control circuitry configured to, in response to the scoreboard circuitry confirming that a register utilized by an instruction of the first thread is not stored at the first cache level, initiate a fetch operation from the memory hierarchy to the first cache level.
Description
The present application is a continuation of U.S. application Ser. No. 18/173,500, entitled “Cache Control to Preserve Register Data,” filed Feb. 23, 2023, the disclosure of which is incorporated by reference herein in its entirety. BACKGROUND Technical Field This disclosure relates generally to computer processors and more particularly to cache control. Description of the Related Art Data management techniques often have substantial impacts on processor performance. Recently, unified memory architectures allow multiple components of a device (e.g., GPU and CPU) to access the same memory at the same locations, in contrast to having reserved portions of RAM for different components. This may advantageously reduce redundancy and data copying. In this context, data for various graphics processor registers may be memory-backed in a cache/memory hierarchy. Therefore, certain combinations of tasks may cause register data to be evicted from a given cache level. This may have performance consequences if the register data is accessed again after the eviction. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1A is a diagram illustrating an overview of example graphics processing operations, according to some embodiments. FIG. 1B is a block diagram illustrating an example graphics unit, according to some embodiments. FIG. 2 is a block diagram illustrating an example pipeline with register lock control for a data cache, according to some embodiments. FIG. 3 is a diagram illustrating example lock indicators, according to some embodiments. FIG. 4 is a block diagram illustrating example register lock control circuitry, according to some embodiments. FIG. 5 is a flow diagram illustrating an example technique to switch to a new frame of lock indicators, according to some embodiments. FIG. 6 is a block diagram illustrating an example detailed shader processor that includes operand caches, data caches, and rename circuitry, according to some embodiments. FIG. 7 is a flow diagram illustrating an example method, according to some embodiments. FIG. 8 is a block diagram illustrating an example computing device, according to some embodiments. FIG. 9 is a diagram illustrating example applications of disclosed systems and devices, according to some embodiments. FIG. 10 is a block diagram illustrating an example computer-readable medium that stores circuit design information, according to some embodiments. DETAILED DESCRIPTION In disclosed embodiments, registers are memory-backed and therefore may be stored in various levels of a cache/memory hierarchy. As one example, GPU general purpose register data may be stored in reservation stations, physical registers, operand caches near datapath circuitry, one or more data caches that also store other types of data, system memory, etc. Therefore, register data may be evicted from a given cache level while operations that still need the data are being executed. Generally, such evictions should be avoided, when possible, to preserve register data closer to the execution pipelines. Tracking which registers are being used (and thus should be preserved at a given cache level) may be complex, however, in terms of circuit area and power consumption. In disclosed embodiments, one or more lock frames are defined that include a lock indicator (e.g., a bit) per architectural register per thread (or per single-instruction multiple-data (SIMD) group where a SIMD group may include multiple threads). Note that threads/SIMD groups may be assigned to channels for execution, so the control circuitry may maintain frames on a per channel basis (thus maintaining at least one lock indicator per architectural register per channel). In some embodiments, each time a register is accessed (e.g., at a map pipeline stage), its lock indicator is set. Control circuitry preserves register data in a data cache by preventing eviction of cache lines that store any register data with a lock bit set. The control circuitry may bulk unlock the lock indicators, e.g., when the number of locks in a frame meets a threshold or when a channel is stalled for a threshold number of cycles. Further, multiple lock frames may be defined for a given channel so that processing can move to a new lock frame while the old frame drains. In some embodiments, a given frame may thus be accumulating, draining, or invalid. Note that a given lock indicator may remain set after its corresponding register data is no longer being used, in these embodiments. Therefore, at least theoretically, some cache lines may remain in the cache that could have been evicted. Overall, however, these embodiments may advantageously provide similar performance to implementations with more precise tracking (e.g., that implement a use counter per register) with substantial reductions in power consumption and area relative to those implementations. Graphics Processing Overview Referring to FIG. 1A, a flow diagram illustrating an example processing flow 100 for processing graphics data