US-20260126996-A1 - DATA PREFETCHING BASED ON BOTH INTRA-TILE AND INTER-TILE STRIDE INFORMATION

US20260126996A1US 20260126996 A1US20260126996 A1US 20260126996A1US-20260126996-A1

Abstract

Techniques and mechanisms for tile prefetching to be performed based on both intra-tile stride characteristics and inter-tile stride characteristics. In an embodiment, a prefetch circuit of a processor core detects that multiple demand fetch instructions target different respective tiles of a matrix. Based on the multiple demand fetch instructions, fetch pattern information is registered and made available for future reference to facilitate detection of a later instance of the fetch pattern. Fetch pattern information corresponding to a first demand fetch instruction comprises both an inter-tile stride and an inter-tile stride. In another embodiment, the prefetch circuit generates micro-operations, based on the fetch pattern information, to prefetch one or more tiles of the matrix.

Inventors

Stijn EYERMAN
Wim Heirman

Assignees

INTEL CORPORATION

Dates

Publication Date: 20260507
Application Date: 20241104

Claims (20)

1 . A processor comprising: first circuitry to: detect one or more events wherein multiple demand fetch instructions each fetch a different respective tile of a matrix; provide fetch pattern information at a registry based on the one or more events, the fetch pattern information to correspond to a first demand fetch instruction of the multiple demand fetch instructions, wherein the fetch pattern information is to identify both an intra-tile stride and an inter-tile stride; and detect, after the one or more events, that a second demand fetch instruction is to fetch a first tile of the matrix; and second circuitry coupled to the first circuitry, the second circuitry to: perform an access of the registry, based on the second demand fetch instruction, to identify the intra-tile stride and the inter-tile stride; and generate one or more microoperations based on the access, wherein the one or more microoperations are to prefetch one or more tiles of the matrix.
2 . The processor of claim 1 , wherein: the fetch pattern information is provided at an entry of the registry; a first field of the entry is to identify the intra-tile stride based on an operand of the first demand fetch instruction; and a second field of the entry is to identify the inter-tile stride based on respective base addresses of the first demand fetch instruction and another of the multiple demand fetch instructions.
3 . The processor of claim 2 , wherein: the inter-tile stride is a first inter-tile stride; and a third field of the entry is to identify a second inter-tile stride based on respective base addresses of the first demand fetch instruction and a third demand fetch instruction of the multiple demand fetch instructions.
4 . The processor of claim 1 , wherein: the registry is a first registry; and the second circuitry to generate the one or more microoperations based on the access comprises the second circuitry to: provide stream information at a second registry based on the access, wherein the stream information is to correspond to the one or more tiles of the matrix, and wherein the stream information is to comprise the intra-tile stride and the inter-tile stride; and generate the one or more microoperations based on the stream information.
5 . The processor of claim 4 , wherein: the second circuitry is to provide the stream information at an entry of the second registry; and the second circuitry further to: detect that a third demand fetch instruction, which is subsequent to the first demand fetch instruction, targets data of the one or more tiles; and based on the third demand fetch instruction, update the entry to replace a first base address with a second base address.
6 . The processor of claim 5 , wherein a field of the entry is to identify a maximum number of rows of a tile which is currently a target of a stream to which the entry corresponds.
7 . The processor of claim 5 , wherein a field of the entry is to indicate, for a tile which is currently a target of a stream to which the entry corresponds, a number of rows of the tile which remain to be prefetched.
8 . The processor of claim 5 , wherein a field of the entry is to indicate a distance, relative to a location indicated by a base address, from which tile data has already been prefetched.
9 . A method comprising: detecting one or more events wherein multiple demand fetch instructions each target a different respective tile of a matrix; based on the one or more events, providing at a registry fetch pattern information which corresponds to a first demand fetch instruction of the multiple demand fetch instructions, wherein the fetch pattern information identifies both an intra-tile stride and an inter-tile stride; after detecting the one or more events, detecting that a second demand fetch instruction is to fetch a first tile of the matrix; performing an access of the registry, based on the second demand fetch instruction, to identify the intra-tile stride and the inter-tile stride; and based on the access, generating one or more microoperations to prefetch one or more tiles of the matrix.
10 . The method of claim 9 , wherein: the fetch pattern information is provided at an entry of the registry; a first field of the entry identifies the intra-tile stride based on an operand of the first demand fetch instruction; and a second field of the entry identifies the inter-tile stride based on respective base addresses of the first demand fetch instruction and another of the multiple demand fetch instructions.
11 . The method of claim 10 , wherein: the inter-tile stride is a first inter-tile stride; and a third field of the entry identifies a second inter-tile stride based on respective base addresses of the first demand fetch instruction and a third demand fetch instruction of the multiple demand fetch instructions.
12 . The method of claim 9 , wherein: the registry is a first registry; and generating the one or more microoperations based on the access comprises: based on the access, providing at a second registry stream information which corresponds to the one or more tiles of the matrix, the stream information comprising the intra-tile stride and the inter-tile stride; and generating the one or more microoperations based on the stream information.
13 . The method of claim 12 , wherein: the stream information is provided at an entry of the second registry; and the method further comprises: detecting that a third demand fetch instruction, which is subsequent to the first demand fetch instruction, targets data of the one or more tiles; and based on the third demand fetch instruction, updating the entry to replace a first base address with a second base address.
14 . The method of claim 13 , wherein a field of the entry indicates, for a tile which is currently a target of a stream to which the entry corresponds, a number of rows of the tile which remain to be prefetched.
15 . A system comprising: a memory; and a processor coupled to the memory, the processor comprising: first circuitry to: detect one or more events wherein multiple demand fetch instructions each fetch a different respective tile of a matrix; provide fetch pattern information at a registry based on the one or more events, the fetch pattern information to correspond to a first demand fetch instruction of the multiple demand fetch instructions, wherein the fetch pattern information is to identify both an intra-tile stride and an inter-tile stride; and detect, after the one or more events, that a second demand fetch instruction is to fetch a first tile of the matrix; and second circuitry coupled to the first circuitry, the second circuitry to: perform an access of the registry, based on the second demand fetch instruction, to identify the intra-tile stride and the inter-tile stride; and generate one or more microoperations based on the access, wherein the one or more microoperations are to prefetch one or more tiles of the matrix.
16 . The system of claim 15 , wherein: the fetch pattern information is provided at an entry of the registry; a first field of the entry is to identify the intra-tile stride based on an operand of the first demand fetch instruction; and a second field of the entry is to identify the inter-tile stride based on respective base addresses of the first demand fetch instruction and another of the multiple demand fetch instructions.
17 . The system of claim 16 , wherein: the inter-tile stride is a first inter-tile stride; and a third field of the entry is to identify a second inter-tile stride based on respective base addresses of the first demand fetch instruction and a third demand fetch instruction of the multiple demand fetch instructions.
18 . The system of claim 15 , wherein: the registry is a first registry; and the second circuitry to generate the one or more microoperations based on the access comprises the second circuitry to: provide stream information at a second registry based on the access, wherein the stream information is to correspond to the one or more tiles of the matrix, and wherein the stream information is to comprise the intra-tile stride and the inter-tile stride; and generate the one or more microoperations based on the stream information.
19 . The system of claim 18 , wherein: the second circuitry is to provide the stream information at an entry of the second registry; and the second circuitry further to: detect that a third demand fetch instruction, which is subsequent to the first demand fetch instruction, targets data of the one or more tiles; and based on the third demand fetch instruction, update the entry to replace a first base address with a second base address.
20 . The system of claim 19 , wherein a field of the entry is to indicate, for a tile which is currently a target of a stream to which the entry corresponds, a number of rows of the tile which remain to be prefetched.

Description

BACKGROUND 1. Technical Field This disclosure generally relates to matrix multiplication and more particularly, but not exclusively, to a tile prefetcher which operates based on a various types of tile strides. 2. Background Art General matrix multiplication (GEMM) is an important functionality for various technologies, such as generative large language models (LLM) and various other artificial intelligence (AI) models. Such models often comprise multiple fully connected layers with different dimensions, which are implemented using GEMMs. In many instances, GEMMs are interspersed with non-linear functions, but the overall execution time is largely dominated by the GEMMs. Some models, such as image generating diffusion models, use convolution layers, that can also be implemented using GEMMs. As successive generations of artificial intelligence technologies continue to increase in number, variety, and capability, there is expected to be an increasing premium placed on improvements to the efficiency of GEMM performance. BRIEF DESCRIPTION OF THE DRAWINGS The various embodiments of the present invention are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which: FIG. 1 shows a block diagram illustrating features of a system to facilitate tile prefetching according to one embodiment. FIG. 2 shows a flow diagram illustrating features of a method to access stride information which facilitates tile prefetching according to an embodiment. FIG. 3 shows a block diagram illustrating features of a core to perform tile prefetching based on different types of stride information according to an embodiment. FIG. 4 shows a data diagram illustrating operations of a matrix multiplication using tiles which are prefetched according to an embodiment. FIG. 5 shows a format diagram illustrating features of a fetch pattern table to provide various types of stride information according to an embodiment. FIG. 6 shows a format diagram illustrating features of a prefetch table to identify prefetch operations performed according to an embodiment. FIG. 7 illustrates an exemplary system. FIG. 8 illustrates a block diagram of an example processor that may have more than one core and an integrated memory controller. FIG. 9A is a block diagram illustrating both an exemplary in-order pipeline and an exemplary register renaming, out-of-order issue/execution pipeline according to examples. FIG. 9B is a block diagram illustrating both an exemplary example of an in-order architecture core and an exemplary register renaming, out-of-order issue/execution architecture core to be included in a processor according to examples. FIG. 10 illustrates examples of execution unit(s) circuitry. FIG. 11 is a block diagram of a register architecture according to some examples. DETAILED DESCRIPTION Embodiments discussed herein variously provide techniques and mechanisms for tile prefetching to be performed based on both intra-tile stride characteristics and inter-tile stride characteristics. The technologies described herein may be implemented in one or more electronic devices. Non-limiting examples of electronic devices that may utilize the technologies described herein include any kind of mobile device and/or stationary device, such as cameras, cell phones, computer terminals, desktop computers, electronic readers, facsimile machines, kiosks, laptop computers, netbook computers, notebook computers, internet devices, payment terminals, personal digital assistants, media players and/or recorders, servers (e.g., blade server, rack mount server, combinations thereof, etc.), set-top boxes, smart phones, tablet personal computers, ultra-mobile personal computers, wired telephones, combinations thereof, and the like. More generally, the technologies described herein may be employed in any of a variety of electronic devices including a tile prefetcher. The description herein includes numerous details to provide a more thorough explanation of the embodiments of the present disclosure. It will be apparent to one skilled in the art, however, that embodiments of the present disclosure may be practiced without these specific details. In other instances, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring embodiments of the present disclosure. Note that in the corresponding drawings of the embodiments, signals are represented with lines. Some lines may be thicker, to indicate a greater number of constituent signal paths, and/or have arrows at one or more ends, to indicate a direction of information flow. Such indications are not intended to be limiting. Rather, the lines are used in connection with one or more exemplary embodiments to facilitate easier understanding of a circuit or a logical unit. Any represented signal, as dictated by design needs or preferences, may actually comprise one or more signals that may travel in either direction and may