US-12625817-B1 - Cost-driven prefetching

US12625817B1US 12625817 B1US12625817 B1US 12625817B1US-12625817-B1

Abstract

Systems and techniques for cost-based prefetching are described. In one example, a processor includes a cache system having a hierarchy of one or more cache levels and an instruction prefetcher associated with a cache level of the cache system. The instruction prefetcher encodes fetch requests to a prefetcher table and determines a cost associated with each entry based on a memory level from where the instructions associated with the fetch requests were fetched. The instruction prefetcher evicts entries in the prefetcher table based on the cost associated with one or more corresponding fetch requests in response to adding new entries. The described techniques hide instruction fetch latency that a processor frontend experiences with workloads that have code working sets that do not fit in higher levels of the cache system.

Inventors

Uday Kumar Reddy Vengalam
John Kalamatianos

Assignees

ADVANCED MICRO DEVICES, INC.

Dates

Publication Date: 20260512
Application Date: 20241220

Claims (20)

1 . A processor, comprising: instruction prefetching circuitry associated with a cache level of a hierarchy of one or more cache levels, the instruction prefetching circuitry configured to: record metadata of fetch requests of instructions retired by a processor in an entry of a prefetcher table; determine, based on a memory level providing the instructions of the fetch requests, a cost associated with each entry of the prefetcher table; and in response to adding a new entry in the prefetcher table, evict an entry in the prefetcher table based on the cost.
2 . The processor of claim 1 , wherein the cost is based on the memory level from which one or more corresponding instructions were fetched.
3 . The processor of claim 2 , wherein the cost is recorded upon retirement of the one or more corresponding instructions.
4 . The processor of claim 3 , wherein the cost of an entry is replaced by a new cost in response to a subsequent retirement of the one or more corresponding instructions.
5 . The processor of claim 3 , wherein a new cost is averaged with a previous cost in response to a subsequent retirement of the one or more corresponding instructions.
6 . The processor of claim 2 , wherein the cost is selected from one or more cost classes, assigned to multiple memory levels communicatively coupled to the processor, based on an instruction fetch latency associated with the multiple memory levels.
7 . The processor of claim 6 , wherein a micro-op cache or a level one instruction cache is assigned to a cost class with a minimum cost value.
8 . The processor of claim 6 , wherein one or more slices within a particular memory level of the multiple memory levels are subdivided into a near-cost class and a far-cost class based on a non-uniform memory access domain associated with the one or more slices.
9 . The processor of claim 1 , wherein the entry evicted from the prefetcher table has a lowest cost value.
10 . The processor of claim 9 , wherein the instruction prefetching circuitry is further configured to, in response to multiple entries having the lowest cost value, use a random selection or a recency policy to determine the entry evicted from the prefetcher table.
11 . The processor of claim 1 , wherein the instruction prefetching circuitry is further configured to not evict an entry in response to the cost associated with the instructions having a cost value lower than each entry in the prefetcher table.
12 . A system comprising: a processor including a cache system with a cache level that includes instruction prefetching circuitry, the processor configured to execute one or more instructions; and the instruction prefetching circuitry configured to: upon retirement of the one or more instructions by the processor, determine, based on a memory level associated with a fetch request for the one or more instructions, a cost associated with each entry in a prefetcher table tracking metadata for the one or more instructions; and evict an entry in the prefetcher table based on the cost associated with one or more corresponding fetch requests.
13 . The system of claim 12 , wherein: the cost is based on the memory level from which one or more corresponding instructions were fetched; and the cost is selected from one or more cost classes, assigned to multiple memory levels of the system, based on an instruction fetch latency associated with each of the multiple memory levels.
14 . The system of claim 13 , wherein a fetch response includes an identification of a cost class associated with the memory level from which the one or more corresponding instructions were fetched.
15 . The system of claim 13 , wherein the cost is determined by a combined cost of fetching each cache-line worth of instructions tracked by the entry.
16 . The system of claim 13 , wherein the cost is determined by summing costs assigned to fetching each cache-line worth of instructions.
17 . The system of claim 12 , wherein: a new cost replaces a previous cost in response to a subsequent retirement of the one or more instructions.
18 . The system of claim 12 , wherein the entry evicted from the prefetcher table has a lowest cost value.
19 . The system of claim 18 , wherein the instruction prefetching circuitry is further configured to determine, using a presence status in an instruction lookaside buffer, the entry evicted from the prefetcher table in response to multiple entries having the lowest cost value.
20 . A method comprising: recording metadata of fetch requests of data retired by a processor in an entry of a prefetcher table; determining, based on a memory level providing the data of the fetch requests, a cost associated with each entry of the prefetcher table; and in response to adding a new entry in the prefetcher table, evicting an entry in the prefetcher table based on the cost.

Description

BACKGROUND Memory systems are slower than processors, creating a speed gap that processors address using multiple levels of caches. Caches store frequently accessed instructions and data for rapid retrieval. However, as the code footprint of certain workloads grows larger, the instructions or data needed by an application exceed the level one cache capacity, leading to more cache misses, higher access latency, and reduced performance. One way to address this issue is by employing prefetchers to anticipate future control flow of a workload and prefetch it into the level one cache before it is requested, helping to mitigate the performance degradation from cache misses. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 depicts a block diagram of a processing system configured to execute one or more applications in accordance with one or more implementations. FIG. 2 is a block diagram of a non-limiting example system to implement cost-driven prefetching. FIG. 3 depicts a non-limiting example of an instruction prefetcher that implements cost-driven instruction prefetching. FIG. 4 depicts an example procedure for assigning cost values to prefetched instructions and determining evictions from a prefetcher table in accordance with one or more implementations. FIG. 5 depicts a procedure in an example implementation of cost-driving instruction prefetching. DETAILED DESCRIPTION Overview An example system includes a processor or system on a chip (SoC) with one or more processor cores communicatively coupled to a memory system with volatile and non-volatile memory. The processor includes a cache system with multiple cache levels. For example, the cache system includes level one caches and level two caches that are private to respective cores of the processor, and a last level cache that is shared among the multiple cores of the processor. The processor further includes an instruction prefetcher associated with one or each cache level. Broadly, the instruction prefetcher is configured to prefetch instructions that are predicted to be accessed by a workload from a slower memory source in terms of memory access speed (e.g., the level two instruction cache, the last level instruction cache, the volatile memory, or the non-volatile memory) into the level one instruction cache. Some applications, such as applications on enterprise servers or in the cloud domain, have too large of code footprints to fit in higher cache levels (e.g., level one or level two caches) and are prefetched or stored in lower levels (e.g., shared level three cache). To tolerate higher access latency, these code footprints are prefetched into caches from level three cache or off-chip memory (e.g., dynamic random access memory (DRAM)). However, as the number of processor cores in systems increases and the number of channels to memory increases, prefetching instructions and data from the lower memory levels is more expensive and results in longer minimum and average roundtrip latency. For example, long-latency instruction fetches result in frequent front-end stalls, especially for out-of-order processor cores, limiting the instruction-level parallelism (ILP) and the overall instructions per cycle (IPC) per thread. Consistent instruction fetching is also important to maintain high utilization of arithmetic logic units (ALUs), including both vector and scalar ALUs. A conventional technique to address these issues is fetch-directed instruction prefetching, which leverages branch prediction to improve instruction prefetching. Branch predictors attempt to predict whether a branch will be taken and, if so, fetch instructions from the targeted address associated with the branch or continue fetching sequentially. In this conventional technique, the branch predictor is decoupled from the instruction fetch pipeline using a fetch target queue (FTQ), which stores predicted instruction addresses that are consumed later by the instruction fetch pipeline in accessing the instruction cache. The branch predictor predicts control flow and sends requests to the FTQ. Fetch requests from the FTQ are inspected on each lookup in the level one instruction cache (L1 IC). In response to a hit, the instructions are provided to the decode pipeline from the L1 IC. In response to a miss, the fetch requests from the FTQ are sent to lower levels of caches. In contrast, this document describes systems and techniques for cost-driven instruction prefetching. In particular, the systems and techniques utilize hardware-based mechanisms to enable higher performance and low-power instruction prefetching using dedicated tables to track prior fetch activity. Higher performance is accomplished by tracking prior fetch activity that has a higher probability of tolerating missing across a multi-level memory hierarchy. In contrast to conventional techniques that use recency-based replacement policies to manage metadata in a dedicated prefetcher table, the described instruction prefetching uses a memory cost-b