US-12625816-B2 - Prefetch throttling based on cache thrashing

US12625816B2US 12625816 B2US12625816 B2US 12625816B2US-12625816-B2

Abstract

In accordance with the described techniques, a processor includes a cache system having a level two cache, and a hardware prefetcher associated with the level two cache. The hardware prefetcher monitors a workload that includes accesses to the level two cache, and measures a degree of thrashing exhibited by the workload in the level two cache based on the accesses. Prefetch requests issued by the hardware prefetcher are throttled based on the degree of thrashing being greater than or equal to a thrashing threshold.

Inventors

Aswinkumar Sridharan
Anasua Bhowmik

Assignees

ADVANCED MICRO DEVICES, INC.

Dates

Publication Date: 20260512
Application Date: 20240328

Claims (20)

1 . A processor, comprising: a cache system including a level two cache; and prefetching circuitry associated with the level two cache, the prefetching circuitry configured to: monitor a workload that includes accesses to the level two cache; measure a degree of thrashing exhibited by the workload in the level two cache based on the accesses; and enable prefetch throttling by reducing or eliminating prefetch requests issued by the prefetching circuitry based on the degree of thrashing being greater than or equal to a thrashing threshold.
2 . The processor of claim 1 , wherein the level two cache includes multiple cache indices, and the prefetching circuitry is configured to measure the degree of thrashing based on the accesses to a subset of the multiple cache indices.
3 . The processor of claim 2 , wherein to measure the degree of thrashing, the prefetching circuitry is configured to: count a number of the accesses to cache indices of the subset having different memory addresses; and calculate a working set size as an average of the number across the cache indices in the subset.
4 . The processor of claim 3 , wherein the thrashing threshold is a function of cache associativity of the level two cache, and the prefetch throttling is enabled based on the working set size being greater than or equal to the thrashing threshold.
5 . The processor of claim 1 , wherein the prefetching circuitry is configured to: periodically measure, in successive time intervals, the degree of thrashing; and compare, after each respective time interval, the degree of thrashing exhibited by the workload during the respective time interval to the thrashing threshold.
6 . The processor of claim 5 , wherein the prefetching circuitry is configured to enable the prefetch throttling during a time interval based on the degree of thrashing being greater than or equal to the thrashing threshold during an immediately previous time interval.
7 . The processor of claim 5 , wherein the prefetching circuitry is configured to disable the prefetch throttling during a time interval based on the degree of thrashing being less than the thrashing threshold during an immediately previous time interval.
8 . The processor of claim 5 , wherein the prefetching circuitry is configured to: increment a counter responsive to thrashing time intervals during which the degree of thrashing was greater than or equal to the thrashing threshold; and decrement the counter responsive to non-thrashing time intervals during which the degree of thrashing exhibited by the workload was less than the thrashing threshold.
9 . The processor of claim 8 , wherein the prefetching circuitry is configured to enable the prefetch throttling based on the counter being greater than or equal to a counter threshold.
10 . The processor of claim 8 , wherein the prefetching circuitry is configured to disable the prefetch throttling based on the counter being less than a counter threshold.
11 . A system, comprising: a memory system; and a processor communicatively coupled to the memory system, the processor including a cache system having a level two cache, the level two cache including prefetching circuitry configured to perform operations including: periodically measuring, in successive time intervals, a degree of thrashing exhibited by a workload in the level two cache based on accesses of the workload to the level two cache; maintaining a counter, the counter being incremented responsive to thrashing time intervals during which the degree of thrashing was greater than or equal to a thrashing threshold, the counter being decremented responsive to non-thrashing time intervals during which the degree of thrashing was less than the thrashing threshold; and throttling prefetch requests issued by the prefetching circuitry based on the counter being greater than or equal to a counter threshold.
12 . The system of claim 11 , wherein the level two cache includes multiple cache indices, and the degree of thrashing is measured, for a respective time interval, based on the accesses to a subset of the multiple cache indices during the respective time interval.
13 . The system of claim 12 , wherein the degree of thrashing is measured, for the respective time interval, by: counting a number of the accesses to cache indices of the subset having different memory addresses during the respective time interval; and calculating a working set size as an average of the number across the cache indices in the subset.
14 . The system of claim 13 , wherein counting the number for a particular cache index of the subset during the respective time interval includes: sampling the accesses to the particular cache index until a maximum number of the accesses have been sampled, the maximum number being a function of cache associativity of the level two cache; and counting, responsive to completion of the respective time interval, the number from the sampled accesses.
15 . The system of claim 13 , wherein the thrashing threshold is a function of cache associativity of the level two cache, and the counter is incremented or decremented based on a comparison of the working set size to the thrashing threshold.
16 . The system of claim 11 , the operations further comprising disabling the throttling of the prefetch requests issued by the prefetching circuitry based on the counter being less than the counter threshold.
17 . The system of claim 11 , wherein the counter is a saturating counter associated with a range of values, the saturating counter remaining at a value to which the saturating counter is set responsive to being prompted to increment or decrement the saturating counter beyond the range of values.
18 . A method, comprising: monitoring, by a hardware prefetcher associated with a level two cache of a cache system, a workload that includes accesses to the level two cache; periodically measuring, by the hardware prefetcher and in successive time intervals, a degree of thrashing exhibited by the workload in the level two cache based on the accesses; enabling, by the hardware prefetcher, prefetch throttling by reducing or eliminating prefetch requests issued by the hardware prefetcher responsive to a first time interval during which the degree of thrashing was greater than or equal to a thrashing threshold; and disabling, by the hardware prefetcher, the prefetch throttling responsive to a second time interval during which the degree of thrashing was less than the thrashing threshold.
19 . The method of claim 18 , wherein the level two cache includes multiple cache indices, and periodically measuring the degree of thrashing includes: counting, during a respective time interval, a number of the accesses to a monitored subset of the multiple cache indices having different memory addresses; and calculating a working set size as an average of the number across cache indices in the monitored subset.
20 . The method of claim 19 , wherein the prefetch throttling is enabled based on the working set size calculated for the first time interval being greater than or equal to the thrashing threshold, and the prefetch throttling is disabled based on the working set size calculated for the second time interval being less than the thrashing threshold.

Description

BACKGROUND Prefetching is a technique implemented by processors to reduce memory access latency. To do so, prefetchers fetch data that is predicted to be used by a workload into a memory source that is accessible with increased speed, e.g., as compared to the memory source from which the data is fetched. In various scenarios, however, prefetching is detrimental to performance, for instance, when prefetch requests cause cache pollution, saturate memory access bandwidth, are predicted inaccurately, and/or are fetched in an untimely manner. Accordingly, processors utilize prefetch throttling to reduce and/or eliminate prefetching in scenarios in which prefetching is detrimental to performance. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of a non-limiting example system to implement prefetch throttling based on cache thrashing. FIG. 2 depicts a non-limiting example in which throttling logic throttles prefetch requests issued by a hardware prefetcher of a level two cache. FIG. 3 depicts a procedure in an example implementation of prefetch throttling based on cache thrashing. FIG. 4 depicts a procedure in an example implementation of prefetch throttling based on cache thrashing. DETAILED DESCRIPTION Overview A system includes a processor communicatively coupled to a memory system having a volatile memory and a non-volatile memory. The processor includes a cache system having multiple cache levels. For example, the cache system includes level one caches and level two caches that are private to respective cores of the processor, and a last level cache that is shared among the multiple cores of the processor. The processor further includes a hardware prefetcher associated with the level two cache. Broadly, the hardware prefetcher of the level two cache is configured to prefetch data that is predicted to be accessed by a workload from a slower memory source in terms of memory access speed (e.g., the last level cache, the volatile memory, or the non-volatile memory) into the level two cache. Further, the hardware prefetcher of the level two cache includes throttling logic configured to reduce and/or eliminate prefetch requests issued by the hardware prefetcher. Broadly, thrashing occurs in the level two cache when the amount of data entering the level two cache exceeds capacity constraints. Thrashing in the level two cache is a data access pattern in which a workload frequently evicts data from the level two cache to make room for incoming data, and re-fetches the evicted data soon thereafter. As a result, the workload experiences frequent cache misses despite the workload having accessed the requested data recently, which causes performance degradation for the system. In these scenarios, there is significant traffic in the communication channels between the level two cache and the last level cache (which are non-shared or private to a respective core of the processor). Prefetching while a workload is thrashing the level two cache is detrimental to performance because prefetched data further occupies memory in the level two cache (leading to cache pollution), and prefetch requests further dominate the access bandwidth of the level two cache (e.g., the communication channels between level two cache and the last level cache), thereby delaying demand requests. Conventional techniques throttle prefetch requests based on system bandwidth usage (e.g., the communication channels that are shared among the multiple cores of the processor), accuracy and timeliness of prefetch requests, and prefetch caused cache pollution. However, these metrics do not capture a degree of thrashing in the level two cache, or bandwidth usage in the non-shared communication channels between the level two cache and the level three cache. For these reasons, conventional prefetchers often elect not to throttle prefetch requests in scenarios in which a workload thrashes the level two cache, but fits within the last level cache. By failing to throttle prefetch requests in these scenarios, conventional throttlers cause higher degrees of thrashing in the level two cache, delay of demand requests, more frequent cache misses, and overall performance degradation. Accordingly, the throttling logic of the described techniques is configured to throttle prefetch requests issued by the hardware prefetcher based on a workload exhibiting thrashing behavior in the level two cache. To do so, the throttling logic monitors accesses to the level two cache. The accesses to the level two cache include, for example, demand requests (e.g., load and/or store requests issued by the processor) and prefetch requests, e.g., issued by hardware prefetchers of the level one cache and/or the level two cache. The throttling logic periodically measures, in successive time intervals, a degree of thrashing exhibited by the workload in the level two cache. The degree of thrashing is based on a number of the accesses having different memory addresses, and the number is co