US-12619407-B2 - Method and apparatus for computer operation improvement by flattening multi-level data structures to optimize pointer chase

US12619407B2US 12619407 B2US12619407 B2US 12619407B2US-12619407-B2

Abstract

A method and apparatus for computer operation improvement by flattening multi-level data structures to optimize pointer chase is provided. Within computer program code, one or more chains of multiple pointers which ultimately reference stored information to be processed are identified. These chains are analyzed to determine the feasibility of replacing the chains with a flatter data structure. In the flatter data structure, the stored information is accessible more directly. For example the data may be stored in a single array by which it can be accessed directly. When feasible, a chain is replaced with such a data structure and the computer program code is adjusted to implement the improved data structure in place of the associated chain. Candidate chains can be identified or prioritized for example on the basis of the expected impact to computer performance, e.g. in terms of operating speed or resource usage.

Inventors

Hao Jin
Mehrnoosh HEIDARPOUR
Ehsan Amiri
Bryan Pak Kwai CHAN
Yibo Yu

Assignees

HUAWEI TECHNOLOGIES CO., LTD.

Dates

Publication Date: 20260505
Application Date: 20231219

Claims (15)

1 . A method for compiler optimization, the method comprising: determining, in an intermediate representation (IR) of a computer program, one or more optimization candidates, each optimization candidate comprising a chain of two or more pointers ultimately referencing information stored in memory; generating, for the one or more optimization candidates, a replacement data structure ultimately indicative of the information stored in memory; for the one or more optimization candidates, analyzing memory location modification and reference (mod/ref) effects to determine feasibility of replacing the chain with a flattened data structure for the one or more optimization candidates; for the one or more optimization candidates determined to be feasible, determining shapes for one or more caches to be associated with one or more optimization candidates determined to be feasible and allocating the one or more caches for storing the replacement data structure ultimately indicative of the information stored in memory; and generating computer code to write to, read from, or both write to and read from the one or more caches.
2 . The method of claim 1 , wherein determining one or more optimization candidates is performed using a cost model indicative of one or both of: amount or frequency of usage of code involving the one or more optimization candidates; and potential for performance improvement resulting from using the replacement data structure in place of the one or more optimization candidates.
3 . The method of claim 1 , wherein analyzing memory location modification and reference (mod/ref) effects includes a pointer analysis.
4 . The method of claim 1 , wherein analyzing memory location modification and reference (mod/refs) effects further includes evaluating load instructions.
5 . The method of claim 1 , wherein determining shapes for one or more caches further includes determining a size of the one or more caches.
6 . The method of claim 1 , wherein the one or more optimization candidates form a flattened tree structure.
7 . The method of claim 6 , further comprising determining dimensions and size associated with the flattened tree structure.
8 . An apparatus for compiler optimization, the apparatus comprising: a processor; and a memory having machine executable instructions stored thereon, the instructions when executed by the processor configure the apparatus to: determine, in an intermediate representation (IR) of a computer program, one or more optimization candidates, each optimization candidate comprising a chain of two or more pointers ultimately referencing information stored in memory; generate, for the one or more optimization candidates, a replacement data structure ultimately indicative of the information stored in memory; for the one or more optimization candidates, analyze memory location modification and reference (mod/ref) effects to determine feasibility of replacing the chain with a flattened data structure for the one or more optimization candidates; for the one or more optimization candidates determined to be feasible, determine shapes for one or more caches to be associated with one or more optimization candidates determined to be feasible and allocating the one or more caches for storing the replacement data structure ultimately indicative of the information stored in memory; and generate computer code to write to, read from, or both write to and read from the one or more caches.
9 . The apparatus of claim 8 , wherein determining one or more optimization candidates is performed using a cost model indicative of one or both of: amount or frequency of usage of code involving the one or more optimization candidates; and potential for performance improvement resulting from using the replacement data structure in place of the one or more optimization candidates.
10 . The apparatus of claim 8 , wherein analyzing memory location modification and reference (mod/ref) effects includes a pointer analysis.
11 . The apparatus of claim 8 , wherein analyzing memory location modification and reference (mod/refs) effects further includes evaluating load instructions.
12 . The apparatus of claim 8 , wherein determining shapes for one or more caches further includes determining a size of the one or more caches.
13 . The apparatus of claim 8 , wherein the one or more optimization candidates form a flattened tree structure.
14 . The apparatus of claim 13 , wherein the instructions when executed by the processor further configure the apparatus to determine dimensions and size associated with the flattened tree structure.
15 . A computer program product comprising a non-transitory computer readable medium storing computer executable instructions thereon that when executed by a computer perform the method for compiler optimization, the method comprising: determining, in an intermediate representation (IR) of a computer program, one or more optimization candidates, each optimization candidate comprising a chain of two or more pointers ultimately referencing information stored in memory; generating, for the one or more optimization candidates, a replacement data structure ultimately indicative of the information stored in memory; for the one or more optimization candidates, analyzing memory location modification and reference (mod/ref) effects to determine feasibility of replacing the chain with a flattened data structure for the one or more optimization candidates; for the one or more optimization candidates determined to be feasible, determining shapes for one or more caches to be associated with one or more optimization candidates determined to be feasible and allocating the one or more caches for storing the replacement data structure ultimately indicative of the information stored in memory; and generating computer code to write to, read from, or both write to and read from the one or more caches.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application Ser. No. 63/435,855, titled “Method and Apparatus for Computer Operation Improvement by Flattening Multi-Level Data Structures to Optimize Pointer Chase” filed on Dec. 29, 2022, which is incorporated by reference herein in its entirety. FIELD The present disclosure pertains to the field of computer operations, and associated methods, apparatus and system and in particular to method apparatuses for computer operations for memory access which includes pointer chase. BACKGROUND Computers generally operate by having a computer processor (CPU) obtain instructions and data stored in memory. The data is operated on according to the instructions, and results can further be stored to memory. These memory accesses are a well-known bottleneck for computer programs, because memory access speed is typically less than CPU speed. Caching is a well-known mechanism to address this issue. Data or instructions can be held in a limited computer memory located close to the CPU and accessible at high speeds. Multiple levels of caches can be provided and maintained, depending on a computer's architecture. FIG. 1 illustrates an example of a computing architecture in which a processor 110 interacts with multiple levels of memory to retrieve and store instructions, data, or both. The levels of memory include the L1 cache 120, L2 cache 130, L3 cache 140 and main memory 150. The L1 cache 120 is the smallest but most rapidly accessible by the processor 110. The L2 cache 130, L3 cache 140 and main memory 150 are progressively larger but have longer access times. To take advantage of caching, programs should ideally exhibit a property called locality, which may be defined such that a program should access the same memory location in very short time intervals (e.g. temporal locality) or access memory locations close to the memory that was recently accessed (e.g. spatial locality). Many programs lack data locality, which results in significant performance degradation. Compilers have developed various techniques that deal with this issue and aim to improve data locality of the program. Many loop optimization techniques fall in this category. These optimizations attempt to restructure loops in the program to improve data locality and hence speed up the program. These optimizations generally do not change how the program data is laid out on the memory, instead they change the order of execution of program statements so that locality is improved. Referring back to FIG. 1, non-locality may require data to be accessed in a variety of locations. This may lead to a lot of L1 and L2 cache misses, requiring a program to read from the L3 cache 140 or main memory 150, thus slowing operation. A second group of compiler optimizations to deal with this issue is called data-layout optimizations. Data-layout optimizations attempt to determine how a program lays out its data on the memory and identify ways to change the data-layout in a way that increases program speed. A well-known family of optimizations in this class is known as structure peeling or structure splitting. These optimizations change aggregate data types in the program (such as the “struct” construct in the C programming language) in one of the following manners. In a first manner, an array of “struct” is broken into multiple arrays (e.g. a structure of arrays), each containing a smaller “struct” that includes one or more of the fields of the original “struct”. In another manner, fields of the “struct” are re-ordered such that the fields that are accessed in the same loop or region of the program are close to each other. Another type of data layout optimization deals with arrays of scalars and aims to change the order in which elements are stored in the array. The above-identified optimizations are known in the compiler-optimization community due to their impact on important industry-standard benchmarks. Other approaches may use prefetching to speed up access to memory when facing locality issues. This may be provided either as an enhancement in hardware prefetch technology or in software prefetch techniques, or a combination of both. There are also hardware techniques that take advantage of execution traces of a program as seen so far. These techniques attempt to extract some relevant information from the traces and use it while executing the rest of the program. These hardware techniques can potentially speed up access to a pointer chasing chain or other patterns of a similar nature e.g. ray tracing in GPUs. However, the above-identified approaches can suffer from a variety of drawbacks or limitations. For example, execution traces or other runtime information may be an absolute requirement, increasing complexity. As another example, prefetching does not eliminate data loads and does not improve data locality. Rather, prefetching only attempts to overlap the data load latency with th