US-12625814-B2 - Graphics processor memory access architecture with address sorting

US12625814B2US 12625814 B2US12625814 B2US 12625814B2US-12625814-B2

Abstract

One embodiment provides a graphics processor including a processing resource including a register file, memory, a cache, and load/store/cache circuitry to process load, store, and prefetch messages from the processing resource. The circuitry will sort received memory access messages into address sorted lists of reads and writes. The circuitry schedules a first set of address sorted requests from a first request buffer for a first period of time, then schedules a second set of address sorted requests from a second request buffer for a second period of time.

Inventors

Joydeep Ray
Sabareesh Ganapathy
Prathamesh Raghunath Shinde
Abhishek R. Appu
Altug Koker
Aditya Navale
Varghese George
Vasanth Ranganathan
Fangwen Fu
Ben J. Ashbaugh
VIDHYA KRISHNAN

Assignees

INTEL CORPORATION

Dates

Publication Date: 20260512
Application Date: 20210924

Claims (20)

1 . A graphics processor comprising: a processing resource including a register file; a memory device; a cache coupled with the processing resource and the memory device; and circuitry to process memory access messages received from the processing resource, wherein to process the memory access messages, the circuitry is configured to: schedule a batch of multiple address-sorted read requests from a read request buffer during a first period of clock cycles without regard to receipt order of the memory access messages associated with the multiple address-sorted read requests; and schedule a batch of multiple address-sorted write requests from a write request buffer during a second period of clock cycles without regard to receipt order of the memory access messages associated with the multiple address-sorted write requests, the read request buffer distinct from the write request buffer, the circuitry to alternately schedule batches of multiple requests from the read request buffer and the write request buffer to reduce a memory switching penalty occurrence.
2 . The graphics processor as in claim 1 , wherein the circuitry is configured to: store received memory access messages to a memory access request buffer; and sort the memory access messages in the memory access request buffer to generate an address-sorted list of memory access messages.
3 . The graphics processor as in claim 2 , wherein the memory access messages indicate to transfer data between the register file and the memory device or between the memory device and the cache.
4 . The graphics processor as in claim 3 , wherein memory access messages include a load message to transfer data from the memory device and a store message to transfer data from the register file.
5 . The graphics processor as in claim 4 , wherein the circuitry is configured to: generate a set of address-sorted read requests in response to the load message; generate a set of address-sorted write requests in response to the store message; write the set of address-sorted read requests to the read request buffer; and write the set of address-sorted write requests to the write request buffer.
6 . The graphics processor as in claim 5 , wherein the load message includes a response length to indicate an amount of data to write to the register file in response to the load message.
7 . The graphics processor as in claim 6 , wherein the circuitry is configured to: determine that the load message has a response length of zero; and transfer data between the memory device and the cache in response to the load message without transferring data to the register file.
8 . The graphics processor as in claim 7 , wherein the circuitry is configured to: determine whether the load message and the store message each include a reference to a same memory address; and schedule the load message and the store message in order of receipt in response to the determination.
9 . The graphics processor as in claim 8 , wherein to schedule the load message and the store message in order of receipt, the circuitry is to: determine that the load message has an earlier time of receipt relative to the store message; write the set of address-sorted read requests to the read request buffer; and write the set of address-sorted write requests to the write request buffer after the set of address-sorted read requests are scheduled from the read request buffer.
10 . The graphics processor as in claim 1 , wherein the circuitry is configured to submit one or more access requests to the cache for each of the multiple address-sorted read requests and the multiple address-sorted write requests.
11 . The graphics processor as in claim 10 , further comprising cache control circuitry associated with the cache, wherein the cache control circuitry is to: read a tag and a cache control setting associated with an access request submitted to the cache; and service the access request from the cache or the memory device based on the tag and the cache control setting.
12 . The graphics processor as in claim 1 , further comprising a surface state cache to store surface state parameters for a memory surface stored on the memory device.
13 . The graphics processor as in claim 12 , wherein the circuitry is configured to read the surface state parameters from the surface state cache and submit one or more access requests to the cache for each of the multiple address-sorted read requests and the multiple address-sorted write requests that are to a location within bounds of the memory surface, the circuitry to determine the bounds of the memory surface based on the surface state parameters.
14 . The graphics processor as in claim 13 , wherein surface is a two-dimensional surface and the circuitry is to submit multiple access requests to the cache in response to a single memory access message to access a two-dimensional block of data within the two-dimensional surface.
15 . The graphics processor as in claim 14 , wherein the surface state parameters include a tiling format for the surface and the circuitry is to determine the multiple access requests to submit to the cache based on the tiling format for the surface.
16 . The graphics processor as in claim 15 , wherein the circuitry is configured to: determine the tiling format for the surface via the surface state cache; determine, based on the tiling format, a mapping between a cacheline of the cache and a row of the two-dimensional block of data; and submit one or more access requests to the cache for each cacheline associated with the row of the two-dimensional block of data, the one or more access requests determined based on the mapping between the cacheline of the cache and the row of the two-dimensional block of data.
17 . A method comprising: receiving a memory access request at circuitry of a graphics processor, wherein the circuitry is configured to perform memory load and store operations and the memory access request is received from a processing resource of the graphics processor; storing the memory access request to a memory access request buffer within the circuitry; retrieving a set of multiple memory access requests from the memory access request buffer; sorting the set of multiple memory access requests into an address-sorted list of memory access requests without regard to receipt order of the set of multiple memory access requests; storing read requests in the address-sorted list of memory access requests to a read request buffer within the circuitry; storing write requests in the address-sorted list of memory access requests to a write request buffer within the circuitry, the write request buffer distinct from the read request buffer; and alternately scheduling batches of multiple requests from the read request buffer and the write request buffer to reduce a memory switching penalty occurrence.
18 . The method as in claim 17 , further comprising: determining whether a read request and a write request reference a same memory address; and scheduling the read request and the write request in order of receipt in response to the determination, wherein the read request is a request to prefetch data to a cache according to a cache control setting associated with the read request.
19 . A data processing system comprising: a memory device; and one or more processors coupled with the memory device, the one or more processors to execute instructions stored on the memory device, the instructions to cause the one or more processors to: store received memory access messages to a memory access request buffer; sort memory access messages in the memory access request buffer to generate an address-sorted list of memory access messages; write one or more memory read requests to a read request buffer in response to a first memory access message in the address-sorted list of memory access messages; write one or more memory write requests to a write request buffer in response to a second memory access message in the address-sorted list of memory access messages, the write request buffer distinct from the read request buffer; via circuitry configured to process memory access messages triggered in response to instructions executed by the one or more processors: schedule multiple address-sorted read requests from a read request buffer during a first period of clock cycles without regard to receipt order of the memory access messages associated with the multiple address-sorted read requests; and schedule multiple address-sorted write requests from the write request buffer during a second period of clock cycles without regard to receipt order of the memory access messages associated with the multiple address-sorted write requests, the circuitry to alternately schedule batches of multiple requests from the read request buffer and the write request buffer to reduce a memory switching penalty occurrence.
20 . The data processing system as in claim 19 , wherein the memory access messages include a load message to transfer data from the memory device to a cache or a register file and a store message to transfer data from a register file to the memory device.

Description

FIELD This disclosure relates generally to data processing and more particularly to data processing via a general-purpose graphics processing unit. BACKGROUND OF THE DISCLOSURE As the graphics processors increase in size and complexity, the latency to the last level cache and graphics processor memory may see an associated increase, even if the overall throughput of the memory system also increases. The increased latency may result in thread stalls, which lower the utilization of the graphics processor, particularly for processing resources having a wider compute width, such as systolic arrays or wide vector engines. The traditional technique used to hide cache/memory latency in a graphics processor is to add more hardware threads. However, increasing the number of hardware threads has an associated cost in terms of silicon area and power consumption. BRIEF DESCRIPTION OF THE DRAWINGS The present invention is illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements, and in which: FIG. 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the embodiments described herein; FIG. 2A-2D illustrate parallel processor components; FIG. 3A-3C are block diagrams of graphics multiprocessors and multiprocessor-based GPUs; FIG. 4A-4F illustrate an exemplary architecture in which a plurality of GPUs is communicatively coupled to a plurality of multi-core processors; FIG. 5 illustrates a graphics processing pipeline; FIG. 6 illustrates a machine learning software stack; FIG. 7 illustrates a general-purpose graphics processing unit; FIG. 8 illustrates a multi-GPU computing system; FIG. 9A-9B illustrate layers of exemplary deep neural networks; FIG. 10 illustrates an exemplary recurrent neural network; FIG. 11 illustrates training and deployment of a deep neural network; FIG. 12A is a block diagram illustrating distributed learning; FIG. 12B is a block diagram illustrating a programmable network interface and data processing unit; FIG. 13 illustrates an exemplary inferencing system on a chip (SOC) suitable for performing inferencing using a trained model; FIG. 14 is a block diagram of a processing system; FIG. 15A-15C illustrate computing systems and graphics processors; FIG. 16A-16C illustrate block diagrams of additional graphics processor and compute accelerator architectures; FIG. 17 is a block diagram of a graphics processing engine of a graphics processor; FIG. 18A-18B illustrate thread execution logic including an array of processing elements employed in a graphics processor core; FIG. 19 illustrates an additional execution unit; FIG. 20 is a block diagram illustrating graphics processor instruction formats; FIG. 21 is a block diagram of an additional graphics processor architecture; FIG. 22A-22B illustrate a graphics processor command format and command sequence; FIG. 23 illustrates exemplary graphics software architecture for a data processing system; FIG. 24A is a block diagram illustrating an IP core development system; FIG. 24B illustrates a cross-section side view of an integrated circuit package assembly; FIG. 24C illustrates a package assembly that includes multiple units of hardware logic chiplets connected to a substrate (e.g., base die); FIG. 24D illustrates a package assembly including interchangeable chiplets; FIG. 25 is a block diagram illustrating an exemplary system on a chip integrated circuit; FIG. 26A-26B are block diagrams illustrating exemplary graphics processors for use within an SoC; FIG. 27 illustrates a data processing system including a graphics core cluster having a memory load/store unit; FIG. 28 illustrates a graphics core cluster including graphics cores having a memory load/store unit; FIG. 29 illustrates a memory load/store unit including memory access request sorting logic, according to an embodiment; FIG. 30 illustrates a method to sort memory access requests, according to an embodiment; FIG. 31 illustrates a method to handle memory access request conflicts, according to an embodiment; FIG. 32 illustrates serialization of potentially hazardous memory accesses in an address-sorted list of memory access requests; FIG. 33 illustrates a system of graphics processor components that are configurable to process 2D block load and store messages associated with a graphics core; FIG. 34 illustrates a system for caching typed 2D block data in an L1 cache, according to an embodiment; FIG. 35 illustrates a method to process a typed 2D block message, according to an embodiment; FIG. 36 illustrates a tile of a multi-tile graphics processor, according to an embodiment; FIG. 37 illustrates a system to enable 2D and SIMT prefetch to cache memory of a graphics processor, according to an embodiment; FIG. 38 illustrates a method to process a load message as a load or prefetch, according to an embodiment; FIG. 39A illustrates a method to prefetch to an L1 cache, according to