EP-4345633-B1 - DISTRIBUTED SYSTEM LEVEL CACHE

EP4345633B1EP 4345633 B1EP4345633 B1EP 4345633B1EP-4345633-B1

Inventors

LANDERS, MARK
KING, IAN
GOUDIE, ALISTAIR
LIVESLEY, Michael John

Dates

Publication Date: 20260506
Application Date: 20230928

Claims (15)

A processor (100), comprising: a plurality of cores (110) comprising a first core (111) and a second core (112); a distributed cache (120) comprising a plurality of cache slices including a first cache slice (121) and a second cache slice (122); and a first interconnect (131) between the first cache slice (121) and the second cache slice (122), wherein the distributed cache (120) is configured to cache a copy of data stored at a plurality of memory addresses of a memory, wherein the first cache slice (121) is connected to the first core (111), and the second cache slice (122) is connected to the second core (112), wherein the first cache slice (121) is configured to cache a copy of data stored at a first set of memory addresses of the plurality of memory addresses, wherein the second cache slice (122) is configured to cache a copy of data stored at a second, different, set of memory addresses of the plurality of memory addresses, wherein the first cache slice (121) is configured to: receive, from the first core (111), a first memory access request specifying a target memory address of the memory, wherein the plurality of memory addresses includes the first target memory address; identify based on the target memory address a target cache slice among the first and second cache slices (121), (122), wherein the target cache slice is the cache slice configured to cache a copy of the data stored at the target memory address; and responsive to the target cache slice being identified as the second cache slice (122), forward the first memory access request to the target cache slice, and wherein the first interconnect (131) is configured to convey the first memory access request to the second cache slice (122), characterised in that the first cache slice (121) comprises a first cache bank (211) configured to cache the copy of the data stored at the first set of memory addresses, and a first crossbar (221) connected to the first cache bank (211), wherein the second cache slice (122) comprises a second cache bank (212) configured to cache the copy of the data stored at the second set of memory addresses, and a second crossbar (222) connected to the second cache bank (212), wherein the first crossbar (221) is configured to: receive, from the first core (111), the first memory access request; identify based on the target memory address a target cache bank among the first and second cache banks (211), (212), wherein the target cache bank is the cache bank configured to cache the copy of the data stored at the target memory address; and forward the first memory access request to the target cache bank, and wherein the first interconnect (131) is configured to convey the first memory access request to the second crossbar (222) when the target cache bank is identified as the second cache bank (212).
The processor (100) of claim 1, wherein the first crossbar (221) is configured to transmit the first memory access request to the second crossbar (222) via the first interconnect (131) when the target cache bank is identified as the second cache bank (212), and wherein the second crossbar (222) is configured to: receive, via the first interconnect (131), the first memory access request when the target cache bank is the second cache bank (212); and send, to the second cache bank (212), the first memory access request when the target cache bank is the second cache bank (212).
The processor (100) of any one of claims 1 and 2, wherein the processor (100) further comprises: a third core (113); a third cache slice (123); and a second interconnect (132) between the second cache slice (122) and the third cache slice (123), wherein the third cache slice (123) is connected to the third core (113), wherein the third cache slice (123) comprises a third cache bank (213) configured to cache a copy of data stored at a third set of memory addresses of the plurality of memory addresses, and a third crossbar (223) connected to the third cache bank (213), wherein the first crossbar (221) is configured to transmit the first memory access request to the second crossbar (222) via the first interconnect (131) when the third cache bank (213) is identified as the target cache bank, wherein the second crossbar (222) is configured to transmit the first memory access request to the third crossbar (223) via the second interconnect (132) when the target cache bank is identified as the third cache bank (213), and wherein the third crossbar (223) is configured to send, to the third cache bank (213), the first memory access request when the target cache bank is identified as the third cache bank (213).
The processor (100) of any one of the preceding claims, wherein the plurality of cache slices are connected in one of: a linear topology, wherein at least two cache slices are each directly connected to exactly one other cache slice, and optionally wherein at least one cache slice is directly connected to exactly two other cache slices a ring topology, wherein each cache slice is directly connected to exactly two other cache slices to define the ring topology; a partially cross-linked ring topology, wherein each cache slice is directly connected to at least two other cache slices to define the ring topology, wherein at least two cache slices are each directly connected to exactly two other cache slices, and wherein at least two cache slices are each directly connected to at least three other cache slices; a densely cross-linked ring topology, wherein each cache slice is directly connected to at least three other cache slices, and wherein at least two cache slices are not directly connected to one another; a fully connected topology, in which each cache slice is directly connected to every other cache slice; and a hybrid topology, wherein at least one cache slice is directly connected to at least three other cache slices, and wherein at least one cache slice is directly connected to exactly one other cache slice.
The processor (100) of any one of the preceding claims, wherein the first core (111) comprises a first cache (161), wherein the first cache (161) is configured to cache a copy of the data stored in the memory, and wherein the first core (111) is configured to: search the first cache (161) for the data stored at the target memory address; and responsive to the search failing to find the data in the first cache (161), transmit the first memory access request to the first cache slice (121), and optionally wherein the first cache (161) comprises a compressor (171) and a decompressor (172), wherein the compressor (171) is configured to: compress a first set of uncompressed data stored in the first cache; and provide the compressed first set of data to the first cache slice (121); wherein the decompressor (181) is configured to: receive a second set of data from the distributed cache (120), wherein the second set of data is compressed; and decompress the second set of data.
A method (400) of obtaining data for a processor (100), wherein the processor (100) comprises: a plurality of cores comprising a first core (111) and a second core (112); a distributed cache (120) comprising a plurality of cache slices including a first cache slice (121) and a second cache slice (122); and a first interconnect (131) between the first cache slice (121) and the second cache slice (122), wherein the distributed cache (120) is configured to cache a copy of data stored at a plurality of memory addresses of a memory, wherein the first cache slice (121) is connected to the first core (111), and the second cache slice (122) is connected to the second core (112), wherein the first cache slice (121) is configured to cache a copy of data stored at a first set of memory addresses of the plurality of memory addresses, wherein the second cache slice (122) is configured to cache a copy of data stored at a second, different, set of memory addresses of the plurality of memory addresses, and wherein the method comprises: receiving (410), by the first cache slice (121), a first memory access request specifying a target memory address of the memory, wherein the plurality of memory addresses includes the target memory address; identifying (420), by the first cache slice (121), based on the target memory address, a target cache slice among the first and second cache slices (121), (122), wherein the target cache slice is the cache slice configured to cache a copy of the data stored at the target memory address; and responsive to the target cache slice being identified as the second cache slice, forwarding (430), by the first cache slice (122), the first memory access request to the target cache slice, wherein the first interconnect (131) is configured to convey the first memory access request to the second cache slice (122), characterised in that the first cache slice (121) comprises a first cache bank (211) configured to cache the copy of the data stored at the first set of memory addresses, and a first crossbar (222) connected to the first cache bank (211), wherein the second cache slice (122) comprises a second cache bank (212) configured to cache the copy of the data stored at the second set of memory addresses, and a second crossbar (222) connected to the second cache bank (212), and wherein the method further comprises: receiving (410), by the first crossbar (221), the first memory access request; identifying (420), by the first crossbar (221), based on the target memory address, a target cache bank among the first and second cache banks, wherein the target cache bank is the cache bank configured to cache a copy of the data stored at the target memory address; and forwarding (430), by the first crossbar (221), the first memory access request to the target cache bank, wherein the first interconnect (131) is configured to convey the first memory access request to the second crossbar (222) when the target cache bank is identified as the second cache bank (212).
The method (400) of claim 6, further comprising, when the first memory access request is a read request: searching (450) the target cache bank for the cached copy of the data stored at the target memory address; responsive to the search finding the data, reading (460) the data stored at the target memory address from the target cache bank; and responsive to the search failing to find the data, reading (470) from the memory the data from the target memory address.
The method (400) of any one of claims 6-7, further comprising: receiving (490), by the second crossbar (222), a second memory access request specifying the target memory address; identifying (491), by the second crossbar (222), the target cache bank; and forwarding (492), by the second crossbar (222), the second memory access request to the target cache bank, and optionally wherein the method further comprises: receiving (441), by the target cache bank, the first memory access request; receiving (493), by the target cache bank, the second memory access request, wherein the target cache bank receives the first memory access request before receiving the second memory access request; locking (451) access to the cached copy of the data; reading (460), by the first core (111), the cached copy of the data; overwriting (461), by the first core (111), at least a part of the cached copy of the data with updated data; unlocking (462) access to the cached copy of the data; after unlocking access to the cached copy of the data, locking access to the cached copy of the data; reading (495), by the second core (112), the cached copy of the data; overwriting (496), by the second core (112), at least a part of the cached copy of the data with updated data; and unlocking (497) access to the cached copy of the data.
The method (400) of any one of claims 6-8, wherein each cache bank is associated with an identifier, and wherein the identifying (420) comprises mapping (421), by the first crossbar (221) using a hash function, the target memory address to the target cache bank.
The method (400) of claim 9, wherein the processor further comprises: a third core (113); a third cache slice (123); and a second interconnect (132) between the second cache slice (122) and the third cache slice (133), wherein the third cache slice (123) is connected to the third core (113), wherein the third cache slice (123) comprises a third cache bank (213) configured to cache a copy of data stored at a third set of memory addresses of the plurality of memory addresses, and a third crossbar (223) connected to the third cache bank (213), wherein the method further comprises: partitioning the processor (100) into: a first domain (231) comprising the first core (111), the second core (121), the first cache slice (121) and the second cache slice (122); and a second domain (232) comprising the third core (113) and the third cache slice (123); configuring the first crossbar (221) and the second crossbar (222) to use a first hash function; and configuring the third crossbar (223) to use a second hash function, wherein the first hash function is configured such that: for any target memory address, the first crossbar (221) can identify the first cache bank (211) or the second cache bank (212) as a target cache bank, and cannot identify the third cache bank (213) as the target cache bank; and for any target memory address, the second crossbar (222) can identify the first cache bank (211) or the second cache bank (212) as the target cache bank, and cannot identify the third cache bank (213) as the target cache bank; and wherein the second hash function is configured such that, for any target memory address, the third crossbar (223) may identify the third cache bank (213) as the target cache bank, and cannot identify the first cache bank (211) or the second cache bank (212) as the target cache bank, and optionally wherein the method further comprises configuring the routing table of the second crossbar (222) such that the routing table does not identify an output channel leading to the third crossbar (223).
A method of manufacturing, using an integrated circuit manufacturing system, a processor as claimed in any of claims 1-5, the method comprising: processing, using a layout processing system, a computer readable description of the processor so as to generate a circuit layout description of an integrated circuit embodying the processor; and manufacturing, using an integrated circuit generation system, the processor according to the circuit layout description.
A computer program comprising instructions which, when the program is executed by a computer, cause the computer to carry out the method of any of claims 6-10.
An integrated circuit definition dataset that, when processed in an integrated circuit manufacturing system, configures the integrated circuit manufacturing system to manufacture a processor as claimed in any of claims 1-5.
A computer readable storage medium having stored thereon a computer readable description of a processor as claimed in any of claims 1-5 that, when processed in an integrated circuit manufacturing system, causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying the processor.
An integrated circuit manufacturing system configured to manufacture a processor as claimed in any of claims 1-5, the integrated circuit manufacturing system comprising: a layout processing system configured to process a computer readable description of the processor so as to generate a circuit layout description of an integrated circuit embodying the processor; and an integrated circuit generation system configured to manufacture the processor according to the circuit layout description.

Description

TECHNICAL FIELD The present disclosure relates to cache systems for processors, in particular multicore processors. It may be particularly relevant for a multicore graphics processing unit (GPU). BACKGROUND In order to perform tasks, a processing unit (PU) requires data to process. This data is often stored in a memory device external to the PU, which the PU must access in order to obtain the required data. However, accessing external memory is slow, and generally subject to limited bandwidth, and the same data may need to be accessed multiple times. Consequently, the need to access data from external memory tends to reduce PU performance. To address this problem, a PU may be provided with a cache. A cache is a memory device located inside the PU, or at least closer to the PU than the external memory. Due to the relative proximity of the cache to the PU, the PU is able to access the cache more quickly than the PU can access the external memory. Furthermore, caches typically consist of static RAM (SRAM), while external memory typically consists of dynamic RAM (DRAM). SRAM can be read from and written to more quickly than DRAM, even where each memory has the same proximity to the PU. By storing the data to be processed in the cache, data can be obtained more quickly and the PU performance can be improved. However, including a cache within a PU occupies chip space that might otherwise have been used to provide additional processing hardware. Additionally, SRAM is more expensive DRAM, and including SRAM in a PU can increase the manufacturing cost of the PU. In order to limit the costs incurred by the cache (both financially and in terms of silicon area), the cache is typically substantially smaller than the external memory (both physically, and in terms of memory capacity). Consequently, the cache is only able to store a subset of the data stored in the external memory. A PU provided with a cache can achieve the greatest performance gains when the limited memory capacity of the cache is prioritised for storing the data most frequently required by the PU. This prioritisation of memory capacity leads to the most significant reduction of the number of times that the PU accesses the external memory. When the PU requires an element of data, it first checks the cache for that data. If the cache contains the data, the PU can read the data from the cache and does not need to access the external memory, saving a substantial amount of time, as well as using memory-access bandwidth more efficiently. If the cache does not contain the data, the PU then accesses the external memory to obtain the data, and can cache a copy of the data for future use. In this way, use of a cache can reduce the number of times a PU accesses external memory, improving the performance of the PU. To overcome the performance limitations caused by the limited memory capacity of the cache, a multi-level cache system can be implemented. In this system, the PU is provided with a hierarchy of caches that have increasing memory sizes but decreasing access speeds. When the PU requires an element of data, the caches can be searched for the data (in an order corresponding to their position within the cache hierarchy). The smallest and fastest cache may be searched first, and, if that cache does not contain the data, the next smallest (and next fastest) cache may then be searched. Ultimately, if none of the caches contain the data, the data will be obtained from the external memory, and may be cached in one of the caches. US 2017/0177492 A1 discloses a multi-core processor including hybrid L2/L3 caches for use by the cores. The sizes of the L2 caches and L3 caches can be dynamically adjusted based on cache hit rates or based on the application being executed by a processor. SUMMARY The invention is set out in the appended set of claims. The dependent claims set out particular embodiments. In order to improve the speed at which a set of tasks can be performed, a multi-core PU can be utilised. The cores of the PU can operate in parallel to perform tasks. It would be desirable to provide each core with a cache system to further improve the performance of the PU. However, providing each of the cores of the PU with a cache system can lead to an inefficient use of bandwidth. For example, consider the case where an element of data is required by two or more cores. When a core first requires the element of data, it will access the external memory and copy the element of data into its cache system. Later, when another core requires that same element of data, it will also access the external memory and copy the element of data into its cache system. In other words, each time a new core requires that same element of data, it must access the external memory and copy the data into its cache system. This duplication of the accessing of the external memory and copying of the element of data wastes bandwidth and processing time, reducing the performance of the multi-core GP