US-12625813-B2 - System level cache with configurable partitioning
Abstract
A data processing apparatus includes one or more cache configuration data stores, a coherence manager, and a shared cache. The coherence manager is configured to track and maintain coherency of cache lines accessed by local caching agents and one or more remote caching agents. The cache lines include local cache lines accessed from a local memory region and remote cache lines accessed from a remote memory region. The shared cache is configured to store local cache lines in a first partition and to store remote cache lines in a second partition. The sizes of the first and second partitions are determined based on values in the one or more cache configuration data stores and may or not overlap. The cache configuration data stores may be programmable by a user or dynamically programmed in response to local memory and remote memory access patterns.
Inventors
- Devi Sravanthi Yalamarthy
- Jamshed Jalal
- Mark David Werkheiser
- WenXuan ZHANG
- Ritukar KHANNA
- Rajani Pai
- Gurunath RAMAGIRI
- Mukesh Patel
- Tushar P RINGE
Assignees
- ARM LIMITED
Dates
- Publication Date
- 20260512
- Application Date
- 20230214
Claims (18)
- 1 . A multi-chip data processing system comprising: a first chip directly coupled to a first memory, where the first memory is a local memory of the first chip; a second chip directly coupled to a second memory, where the second memory is a remote memory of the first chip; and a chip-chip link that couples between the first chip and the second chip; where the first chip includes: one or more cache configuration data stores; a plurality of local caching agents; a coherence manager configured to track and maintain coherency of cache lines including: local cache lines accessed by one or more local caching agents of the plurality of local caching agents or accessed via a chip-chip link by one or more remote caching agents, where the local cache lines are mapped to addresses in the local memory of the first chip; and remote cache lines accessed by one or more local caching agents of the plurality of local caching agents, where the remote cache lines are mapped to addresses in the remote memory of the first chip; and a shared cache, shared between the plurality of local caching agents, configured to store local cache lines in a first partition and to store remote cache lines in a second partition, where sizes of the first and second partitions are determined based on one or more values stored in the one or more cache configuration data stores.
- 2 . The multi-chip data processing system of claim 1 , where the one or more cache configuration data stores are programmable by one or more user programming instructions.
- 3 . The multi-chip data processing system of claim 1 , where the one or more cache configuration data stores are dynamically programmed in response to local memory and remote memory access patterns.
- 4 . The multi-chip data processing system of claim 1 , where the one or more cache configuration data stores indicate a size, maximum size or minimum size of the first partition.
- 5 . The multi-chip data processing system of claim 1 , where a first register of the one or more cache configuration data stores indicates a maximum size of the first partition and a second register of the one or more cache configuration data stores indicates a maximum size of the second partition.
- 6 . The multi-chip data processing system of claim 1 , where the shared cache is a system level cache.
- 7 . The multi-chip data processing system of claim 1 , where the caching agents includes level one (L1) and level two (L2) caching agents.
- 8 . The multi-chip data processing system of claim 1 , where the shared cache memory includes a level three (L3) cache, further comprising: an eviction control register; and a cache controller configured to evict cache lines from the L3 cache based on an eviction policy, where the eviction of remote cache lines from the shared cache to a level four (L4) cache memory is enabled or disabled based on a content of the eviction control register.
- 9 . The multi-chip data processing system of claim 8 , where the L4 cache memory includes a remote cache memory.
- 10 . A computer-implemented method comprising: reading values of one or more cache configuration data stores; configuring, based at least in part on the values of the one or more cache configuration data stores, a shared cache to have a first partition and a second partition, where the shared cache is shared between a plurality of local caching agents of a first chip of a multi-chip data processing system and is located on the first chip; accessing, by one or more local caching agents of the plurality of local caching agents or a remote caching agent, local cache lines mapped to addresses in a first memory directly coupled to the first chip, where the first memory is a local memory of the first chip; accessing, by the one or more local caching agents via a chip-chip link, remote cache lines mapped to addresses in a second memory directly coupled to a second chip of the multi-chip data processing system, where the second memory is a remote memory of the first chip; storing the local cache lines in the first partition of the shared cache; storing the remote cache lines in the second partition of the shared cache; and tracking and maintaining coherency of local and remote cache lines by a coherence manager located on the first chip.
- 11 . The computer-implemented method of claim 10 , further comprising: storing a local cache line in the first partition of the shared cache when the local cache line is evicted from a local cache at a higher level than the shared cache; and storing a remote cache line in the second partition of the shared cache when the remote cache line is evicted from a local cache at a higher level than the shared cache.
- 12 . The computer-implemented method of claim 10 , further comprising programming the one or more cache configuration data stores responsive to execution of one or more user programming instructions.
- 13 . The computer-implemented method of claim 10 , further comprising: monitoring access patterns of local memory and remote memory; and dynamically programming the one or more cache configuration data stores in response to the access patterns.
- 14 . The computer-implemented method of claim 10 , further comprising setting a size, maximum size or minimum size of the first partition based, at least in part, on values of the one more cache configuration data stores.
- 15 . The computer-implemented method of claim 10 , further comprising: setting a maximum size of the first partition based, at least in part, on a value of a first cache configuration data store of the one or more cache configuration data stores; and setting a maximum size of the first partition based, at least in part, on a value of a first cache configuration data store of the one or more cache configuration data stores.
- 16 . The computer-implemented method of claim 10 , where the local caching agents include L1 and L2 caching agents.
- 17 . The computer-implemented method of claim 10 , further comprising: reading a value of an eviction control register; and evicting a remote cache line from the shared cache; enabling storage of the evicted remote cache line in a level four (L4) cache when the value of the eviction control register has a first value; and disabling storage of the evicted remote cache line in the L4 cache when the value of the eviction control register has a second value.
- 18 . The computer-implemented method of claim 17 , where the L4 cache includes a remote cache memory.
Description
BACKGROUND In a data processing system with multiple caching agents (such as processor cores and accelerators), caches may be arranged in a hierarchy in which each level of the hierarchy acts as an aggregation layer for the caches before it. A system level cache (SLC) stores local cache lines so that subsequent accesses to these lines can be retrieved from the SLC instead of being loaded from a slower memory. A snoop filter tracks cache lines accessed by each caching agent so that any subsequent accesses to these addresses by another caching agent can be easily looked up for coherency resolution. A hierarchy of snoop filters may be used such that, for example, multiple private (L1) Data and Instruction caches are tracked at a shared L2 snoop filter and any evictions from L1 caches can be allocated to L2 caches. At the lowest level cache (L3), the snoop filter tracks all the cache lines in L1/L2 caches and evictions from L1/L2 caches can be cached in the L3 cache. Large scale data processing systems may include multiple chips with many caching agents accessing shared memories. Accessing data via a remote chip introduces considerable latency into the system. In addition, maintaining coherency of cached data requires transmission of coherence messages between chips. As the scale of multi-chip systems increases, there is a growing need to improve the efficiency and performance of coherent data access. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings provide visual representations which will be used to describe various representative embodiments more fully and can be used by those skilled in the art to better understand the representative embodiments disclosed and their inherent advantages. In these drawings, like reference numerals identify corresponding or analogous elements. FIG. 1 is a block diagram of a multi-chip data processing system. FIG. 2 is a block diagram of a further multi-chip data processing system. FIG. 3 is a block diagram of a data processing apparatus, in accordance with various representative embodiments. FIGS. 4-6 show example partitions of a shared cache, in accordance with various representative embodiments. FIG. 7 is a flow chart of a computer-implemented method, in accordance with various representative embodiments. FIG. 8 is a flow chart of another computer-implemented method, in accordance with various representative embodiments. DETAILED DESCRIPTION The various apparatus and devices described herein provide mechanisms for cache sharing in a data processing apparatus. In particular, configurable partitioning of a shared cache is disclosed. While this present disclosure is susceptible of embodiment in many different forms, there is shown in the drawings and will herein be described in detail specific embodiments, with the understanding that the embodiments shown and described herein should be considered as providing examples of the principles of the present disclosure and are not intended to limit the present disclosure to the specific embodiments shown and described. In the description below, like reference numerals are used to describe the same, similar or corresponding parts in the several views of the drawings. For simplicity and clarity of illustration, reference numerals may be repeated among the figures to indicate corresponding or analogous elements. FIG. 1 is a simplified block diagram of a multi-chip data processing system that includes first chip 100 and second chip 120. First chip 100 includes one or more central processing unit (CPU) cores 102 that access data in a shared, directly coupled, memory 104. CPUs 102 have private (level one or L1) caches 106 and level two (L2) caches 108 may be shared between multiple CPUs in a cluster. A system level cache (SLC) 110 may be a level three (L3) cache. The caches are used as temporary stores for data from recently accessed memory addresses, so that any subsequent accesses to these addresses can be serviced without accessing memory again. Under an exclusive cache policy, a SLC is used to store data evicted from the L1/L2 caches. In an inclusive cache policy, L2 caches hold a subset of data in the SLC, and L1 caches hold a subset of data in the L2 caches. Herein, an L1 cache is referred to as the highest-level cache. Thus, L1 is higher than L2, and L2 is higher than L3. A cache on a remote device may be considered a level four (L4) cache. A caching agent may be any device that stores and serves data from a cache. For example, CPUs 102 may be caching agents for L1 caches 106 in FIG. 1 and are local caching agents for first chip 100. Home agent 112 acts as a point of coherency (PoC) and point of serialization (POS) for accesses to directly coupled memory 104 by local and remote caching agents. Accesses by local caching agents to remote memory over chip-chip link 114 are handled by link agent 116. In the example shown, remote device 120 includes corresponding link agent 122 and home agent 124 that accesses remote memory 126 a