EP-4453710-B1 - SYSTEMS AND METHODS FOR APPLICATION PERFORMANCE PROFILING AND IMPROVEMENT

EP4453710B1EP 4453710 B1EP4453710 B1EP 4453710B1EP-4453710-B1

Inventors

LEIDEL, John D.
DONOFRIO, DAVID
KABRICK, RYAN

Dates

Publication Date: 20260513
Application Date: 20221213

Claims (15)

A method for analyzing and improving a target computer application, the method performed by at least one computer system (101) having a processor (111) and an accessible memory (107), the method comprising: receiving (1602) the target application by the computer system (101); generating (1604), by the computer system (101), a parallel control flow graph, ParCFG, corresponding to the target application; characterized by analyzing (1606) the ParCFG by the computer system (101) to identify nested sets of parallel dispatches where outer target execution devices are not permitted to initiate parallel workloads to target execution devices contained in one or more inner parallel constructs; generating (1610) a modified ParCFG to correct dispatch ordering by automatically inverting the ParCFG structure to produce a correctly nested set of device constructs; storing (1610) the modified ParCFG for the target application by the computer system (101); and generating and storing (1612), by the computer system (101), a modified application according to the modified ParCFG.
The method of claim 1, wherein the ParCFG is a directed acyclic graph including vertices and directed edges that represent sequential and parallel processes performed by the target application.
The method of claim 2, wherein at least one vertex includes metadata attributes describing at least one of execution type, parallel width, device type, memory type, or memory transforms.
The method of claim 1, wherein analyzing (1606) the ParCFG includes at least one of analyzing serialization of the application, analyzing nested parallelism of the application, analyzing memory transforms of the application, analyzing parallel width of the application, or analyzing device dispatch of the application.
The method of claim 1, wherein generating (1610) the modified ParCFG further includes at least one of reverting deeply nested parallel constructs to sequential operation, triggering adjacent parallel regions to execute concurrently, fusing multiple parallel region nests into a singular parallel dispatch, parallelizing sequential regions that have no external data dependencies, removing unnecessary barrier synchronization primitives, or adjusting data transformations to ensure correct memory alignment and layout.
A computer system (101), comprising: at least one processor (111); and a memory (107) accessible by the at least one processor (111) and storing executable code, wherein the at least one processor (111) is configured to execute the executable code to: receive (1602) a target application; generate (1604) a parallel control flow graph, ParCFG, corresponding to the target application; characterized in that the at least one processor (111) is configured to execute the executable code to: analyze (1606) the ParCFG to identify nested sets of parallel dispatches where outer target execution devices are not permitted to initiate parallel workloads to target execution devices contained in one or more inner parallel constructs; generate (1608, 1610) a modified ParCFG to correct dispatch ordering by automatically inverting the ParCFG structure to produce a correctly nested set of device constructs; store (1610) the modified ParCFG for the target application by the computer system (101); and generate and store (1612) a modified application according to the modified ParCFG.
The computer system (101) of claim 6, wherein the ParCFG is a directed acyclic graph including vertices and directed edges that represent sequential and parallel processes performed by the target application.
The computer system (101) of claim 7, wherein at least one vertex includes metadata attributes describing at least one of execution type, parallel width, device type, memory type, or memory transforms.
The computer system (101) of claim 6, wherein analyzing (1606) the ParCFG includes at least one of analyzing serialization of the application, analyzing nested parallelism of the application, analyzing memory transforms of the application, analyzing parallel width of the application, or analyzing device dispatch of the application.
The computer system (101) of claim 6, wherein generating the modified ParCFG further includes at least one of reverting deeply nested parallel constructs to sequential operation, triggering adjacent parallel regions to execute concurrently, fusing multiple parallel region nests into a singular parallel dispatch, parallelizing sequential regions that have no external data dependencies, removing unnecessary barrier synchronization primitives, or adjusting data transformations to ensure correct memory alignment and layout.
A non-transitory computer-readable medium encoded with executable instructions which, when executed by one or more processors (111) of a computer system (101), cause the one or more processors (111) to: receive (1602) a target application; generate (1604) a parallel control flow graph, ParCFG, corresponding to the target application; characterized in that the executable instructions further cause the one or more processors (111) to: analyze (1606) the ParCFG to identify nested sets of parallel dispatches where outer target execution devices are not permitted to initiate parallel workloads to target execution devices contained in one or more inner parallel constructs; generate (1608, 1610) a modified ParCFG to correct dispatch ordering by automatically inverting the ParCFG structure to produce a correctly nested set of device constructs; store (1610) the modified ParCFG for the target application by the computer system (101); and generate and store (1612) a modified application according to the modified ParCFG.
The non-transitory computer-readable medium of claim 11, wherein the ParCFG is a directed acyclic graph including vertices and directed edges that represent sequential and parallel processes performed by the target application.
The non-transitory computer-readable medium of claim 12, wherein at least one vertex includes metadata attributes describing at least one of execution type, parallel width, device type, memory type, or memory transforms.
The non-transitory computer-readable medium of claim 11, wherein analyzing the ParCFG includes at least one of analyzing serialization of the application, analyzing nested parallelism of the application, analyzing memory transforms of the application, analyzing parallel width of the application, or analyzing device dispatch of the application.
The non-transitory computer-readable medium of claim 11, wherein generating the modified ParCFG further includes at least one of reverting deeply nested parallel constructs to sequential operation, triggering adjacent parallel regions to execute concurrently, fusing multiple parallel region nests into a singular parallel dispatch, parallelizing sequential regions that have no external data dependencies, removing unnecessary barrier synchronization primitives, or adjusting data transformations to ensure correct memory alignment and layout.

Description

TECHNICAL FIELD The present disclosure is directed, in general, to analysis and improvement of application code, in particular code intended to run on high-performance computing (HPC) devices. BACKGROUND OF THE DISCLOSURE Recent efforts by the Department of Energy and Department of Defense have increased visibility and development efforts in template metaprogramming approaches for high performance computing. The goal of these template metaprogramming techniques is to provide users a common, C++-based language and application programming interface (API) for users to create high performance, portable parallel applications that have the ability to execute (in a high-performance manner) in a multitude of different HPC platforms. These template metaprogramming language constructs mask the underlying hardware architecture using platform-agnostic parallel keywords. For example, a given code construct can be executed on a standard CPU, GPU or other computing by simply defining and targeting different policies. These template metaprogramming techniques have inherent weaknesses. For example, this performance portability can be achieved assuming that the user has a deep grasp of both the template metaprogramming environment and the underlying hardware architecture. This is often not the case given that the individuals writing the applications are domain experts and not computer architects or compiler experts. As another example, there are currently no tools to analyze the potential performance of various template metaprogramming techniques at compile time. Given the inherent design goal of template metaprogramming techniques to hide device-specific implementation details, this implies that users are required to write, compile and subsequently execute the applications in order to detect issues in parallelism, memory layout, performance optimization and race conditions. Improved systems are desirable. Marcelo Pasin, in "D4.4 "REPORT ON ENERGY-EFFICIENCY EVALUATIONS AND OPTIMIZATIONS FOR ENERGY-EFFICIENT, SECURE, RESILIENT TASK-BASED PROGRAMMING MODEL AND COMPILER EXTENSIONS"", Legato project report, 30 November 2020, discloses the use of a Mercurium analysis infrastructure to compute expressions for the number of operations in each of a set of tasks suitable for collective acceleration, the infrastructure built on top of a Parallel Control Flow Graph that represents the flow of source code as well as the parallelism and other semantics expressed by OpenMP/OmpSs directives. US 10,901,715 B1 discloses methods, systems, and apparatus, including computer programs encoded on computer storage media, for lazy compilation and kernel fusion in dynamic computation graphs. One of the operations is performed by generating an input graph based on translation of user code into an expression graph. The expression graph represents control flow dependencies of operations of the generated input graph. Optimization of the input graph is then performed by iterative application of optimization rules to the input graph. An optimized version of the input graph results from the application of the optimization rules. A transformation graph then is generated by comparing changes made from the original input graph to the final optimized version of the input graph. The transformation graph provides a blueprint such that the system may recreate the optimization of a similarly structured later generated input graph without having to reapply the optimization rules. During the optimization of the input graph, the system may also generate fused blocks of just-in-time operations which the system may later optimize for parallel processing on one or more graphic processing units, and/or hardware accelerated computation units. SUMMARY OF THE DISCLOSURE Various disclosed embodiments include systems and method to build and analyze performance, memory, and race condition issues in parallel applications at compile time written using one or more template metaprogramming techniques in a language agnostic manner. Disclosed embodiments can be implemented using a Multi-Environment Template API (META) in order to provide compile-time performance analysis for applications using C++ template meta-programming models. Various embodiments can employ a "parallel control flow graph" (ParCFG) which encapsulates a directed acyclic graph (DAG) representation of the execution of a parallel application containing nodes, directed vertices and vertex metadata. The discovery and constructing of the ParCFG can be performed at compile time using the template meta-programming language specific keywords in a language agnostic manner. Various embodiments can perform analysis of the functionality, performance, stability and memory layout of the target parallel application using language agnostic graph analysis techniques, and use the results of the analysis to improve the application itself. According to the invention there is provided a method for analyzing and improving a target compu