EP-4740103-A1 - METHOD AND APPARATUS TO TRACE AND VISUALIZE DATA MOVEMENT
Abstract
Methods and apparatus are provided for identifying performance bottlenecks in computer programs by tracking data accesses. In embodiments of the present disclosure, trace points are inserted at accessor functions of a computer program to record accesses to data structures of the computer program. The accesses are analyzed to determine patterns of access and to prioritize data structures for refactoring. In embodiments, the analysis is visualized using a hailstone plot wherein the data structures are ranked by priority. In some embodiments, transformation strategies may be generated based on the analysis to improve the performance of the computer program.
Inventors
- LEUNG, Arthur Chun-Yin
- HASSAN, AHMED E.
- CHEN, BOYUAN
- Tan, Kenneth Chong Yin
Assignees
- Huawei Technologies Co., Ltd.
Dates
- Publication Date
- 20260513
- Application Date
- 20240924
Claims (20)
- A method comprising, by a computing device including a processor coupled to tangible processor-readable memory: receiving a computer program having a plurality of data structures each having associated thereto one or more accessor functions configured to interact with a respective data structure of the plurality of data structures, the computer program further having a plurality of components each calling one or more of the accessor functions when the computer program is executed; inserting, at each of a set of the accessor functions, a respective trace point, each trace point configured to, when the respective accessor function is called, report an access of the data structure associated with the respective accessor function; executing the computer program; recording, for each access reported by each trace point, a respective trace including a plurality of access metrics; analyzing, for each data structure of the plurality of data structures, the plurality of access metrics of the traces corresponding to the respective data structure to obtain a respective rank indicating a priority for refactoring the respective data structure; and displaying, for each data structure of the plurality of data structures, the respective rank.
- The method of claim 1 further comprising: selecting one or more data structures of the plurality of data structures for refactoring in accordance with the respective rank to obtain one or more candidate data structures; and refactoring the one or more candidate data structures.
- The method of claim 2 further comprising: compiling one or more of the plurality of components in accordance with the refactored one or more candidate data structures; and executing the computer program in accordance with the compiled one or more components.
- The method of any one of claims 1 to 3, wherein the plurality of access metrics of each trace identifies the respective accessor function called and includes at least one of a measured latency and a return memory address.
- The method of any one of claims 1 to 4, wherein analyzing, for each data structure of the plurality of data structures, the plurality of access metrics of the traces corresponding to the respective data structure to obtain the respective rank indicating the priority for refactoring the respective data structure includes: simulating, for each data structure of the plurality of data structures, a respective cache hit rate.
- The method of any one of claims 1 to 4, wherein analyzing, for each data structure of the plurality of data structures, the plurality of access metrics of the traces corresponding to the respective data structure to obtain the respective rank indicating the priority for refactoring the respective data structure includes: determining aggregate statistics depending from the plurality of access metrics of each trace.
- The method of any one of claims 1 to 6 further comprising: determining, in accordance with the plurality of access metrics of each trace, one or more patterns of data movement.
- The method of any one of claims 1 to 7, wherein displaying, for each data structure of the plurality of data structures, the respective rank includes: generating one or more hailstone plots each depicting a set of the data structures from the plurality of data structures as a corresponding set of markers indicating the respective rank.
- The method of claim 8 wherein each marker of the set of markers of each hailstone plot further indicates, for the corresponding data structure, a respective cache hit rate, a respective count of the accesses of the respective data structure, and a respective duration of the accesses of the respective data structure.
- The method of claim 9 wherein each marker of the set of markers of each hailstone plot: has associated thereto a respective size, a respective position, and a respective color; and, indicates the respective cache hit rate, the respective count of the accesses of the respective data structure, and the respective duration of the accesses of the respective data structure through the respective size, the respective position, and the respective color.
- The method of any one of claims 8 to 10, wherein each set of the data structures from the plurality of data structures corresponds to one of a plurality of types of data structures.
- The method of claim 11 wherein the plurality of types of data structures includes a map type, a vector type, and a queue type.
- The method of any one of claims 1 to 12, wherein: each access of each data structure of the plurality of data structures has associated thereto a respective duration of data access and a respective memory bandwidth consumption; all the durations of data access associated with the accesses of each data structure define a respective total duration of data access for the respective data structure; all the memory bandwidth consumptions associated with the accesses of each data structure define a respective total memory bandwidth consumption for the respective data structure; and the respective rank of each data structure of the plurality of data structures depends from a weighted sum of the respective total duration of data access and the respective total memory bandwidth consumption.
- The method of any one of claims 1 to 13, wherein at least one data structure is a container having a plurality of data elements.
- The method of any one of claims 1 to 14 further comprising: determining, in accordance with the respective rank of each data structure of the plurality of data structures, one or more transformation strategies each identifying one or more groups of data structures from the plurality of data structures, each transformation strategy further identifying, for each of the respective one or more groups of data structures, a respective one or more refactoring transformations; selecting a target transformation strategy from the one or more transformation strategies; refactoring each of the one or more groups of data structures of the target transformation strategy in accordance with the respective one or more refactoring transformations; compiling one or more of the plurality of components in accordance with the refactored one or more groups of data structures; and executing the computer program in accordance with the compiled one or more components.
- The method of claim 15 wherein each of the respective one or more refactoring transformations of each transformation strategy is a respective cache replacement algorithm.
- The method of claim 16 wherein each cache replacement algorithm is one of: a first-in-first-out algorithm; a last-in-first-out algorithm; a least-recently-used algorithm; a least-frequently-used algorithm; a least-frequently-recently-used algorithm; and a clock algorithm.
- The method of any one of claims 1 to 17, wherein each data structure of the plurality of data structures belongs to a data structure library.
- A computing device comprising a processor coupled to tangible processor-readable memory, the memory having stored thereon instructions to be executed by the processor to implement the method of any one of claims 1 to 18.
- A computer-readable storage medium, the computer-readable storage medium including instructions that when executed by a computing system, cause the computing system to implement the method of any one of claims 1 to 18.
Description
METHOD AND APPARATUS TO TRACE AND VISUALIZE DATA MOVEMENT This application claims priority to U.S. Patent Application No. 18/417,751 filed on January 19, 2024, the content of which is hereby incorporated by reference in its entirety herein. FIELD OF THE INVENTION The present invention pertains to computer performance and in particular to methods and apparatus for tracking computer performance. BACKGROUND The Von-Neumann architecture, comprising one or more processors and memory subsystems, is the model system for typical modern computing systems. In this architecture, the processors are responsible for performing arithmetic operations on given input data and outputting results of the operations, while the memory subsystems are responsible for storing data and transferring it to and from the processors. Algorithms processed by the architecture can either become compute bound or memory bound, both of which will limit the performance of the algorithm and form a processing bottleneck. When an algorithm is compute bound, its raw processing throughput is at the limit of the hardware processor pipeline. Further speedup of the algorithm can only be achieved by adding more processors or parallelizing computations. Alternatively, when an algorithm is memory bound, data cannot be delivered fast enough to the processors, which can leave them idle and underutilized. Here, further speedup, can only be achieved by compressing the data or using an additional or wider memory bus for feeding the data to the processors. Tools have been developed to identify performance bottlenecks in computer programs. They can either be implemented statically to analyze the program’s source code for antipatterns or dynamically to profile the program at runtime. Static analysis of a program does not account for the differences among the compilers or hardware configurations that can be used to execute the program. Some tools that dynamically analyze a program focus on assessing the arithmetic intensity and throughput of each program function; however, the analysis is tied to the specific computing hardware, and the recognizability of data structures from the data is challenging. Some dynamic tools report the relative execution duration of each function and functional call stack via the performance counter of a processor, which can be accessed by the operating system kernel; these tools lack representation of inefficient repeated accesses to certain classes, require special kernel permissions, and do not provide analysis pertinent to data movements. Lastly, some other dynamic tools simulate the hardware at various cache levels to report hit and miss rates of data at each source code line, but this does not represent the scope and size of data movements well. Altogether, tools available for identifying performance bottlenecks are unable to provide an understanding of performance at a data structure and software engineering level. Therefore, there is a need for a method and apparatus for identifying performance bottlenecks in computer programs that obviates or mitigates one or more limitations of the prior art. This background information is provided to reveal information believed by the applicant to be of possible relevance to the present invention. No admission is necessarily intended, nor should be construed, that any of the preceding information constitutes prior art against the present invention. SUMMARY An object of embodiments of the present invention is to provide methods and apparatus for identifying performance bottlenecks in computer programs. A first aspect of the present disclosure is to provide a method to be performed by a computing device including a processor coupled to tangible processor-readable memory. The method may comprise receiving a computer program having a plurality of data structures each having associated thereto one or more accessor functions configured to interact with a respective data structure of the plurality of data structures. The computer program may further have a plurality of components each calling one or more of the accessor functions when the computer program is executed. The method may further comprise inserting, at each of a set of the accessor functions, a respective trace point. Each trace point may be configured to, when the respective accessor function is called, report an access of the data structure associated with the respective accessor function. The method may further comprise executing the computer program, recording, for each access reported by each trace point, a respective trace including a plurality of access metrics, analysing, for each data structure of the plurality of data structures, the plurality of access metrics of the traces corresponding to the respective data structure to obtain a respective rank indicating a priority for refactoring the respective data structure, and displaying, for each data structure of the plurality of data structures, the respective rank. In some embodimen