US-12619515-B2 - System and method for profiling on-chip performance of neural network execution

US12619515B2US 12619515 B2US12619515 B2US 12619515B2US-12619515-B2

Abstract

A method includes: accessing a static schedule of a target neural network for execution by a processing device, the target neural network including a set of layers; generating a set of expected performance metrics of the target neural network based on the static schedule, the set of expected performance metrics including a first expected performance metric for a first layer in the set of layers; accessing a set of runtime performance metrics captured during execution of the target neural network by the processing device, the set of runtime performance metrics including a first runtime performance metric for the first layer; and, in response to detecting a difference between the first runtime performance metric and the first expected performance metric exceeding a threshold, serving an alert at a user interface.

Inventors

Satyanarayana Raju Uppalapati
Rehan Hameed
Rajasekhar Reddy Ereddy
Sameek Banerjee
Mohammed Shahim
Shilpa Kallem
Suresh Kumar Vennam
Abhilash Bharath Ghanore
Raju Datla
Wajahat Qadeer

Assignees

Deep Vision Inc.

Dates

Publication Date: 20260505
Application Date: 20221220

Claims (20)

1 . A method for profiling neural network performance, the method comprising: accessing a static schedule of a target neural network for execution by a processing device, the target neural network comprising a set of layers defining a set of operations; generating a set of expected performance metrics for the target neural network based on the static schedule, the set of expected performance metrics comprising a subset of expected performance metrics for each layer in the set of layers; activating a set of performance monitor registers in the processing device based on signals stored in a memory of the processing device; capturing, by the set of performance monitor registers, a set of runtime performance metrics during execution of the target neural network by the processing device; for each layer in the set of layers: associating a subset of runtime performance metrics, in the set of runtime performance metrics, with the layer based on detection of an identifier assigned to the layer during execution of the layer; calculating a first difference between a runtime performance metric, in the subset of runtime performance metrics for the layer, and a corresponding estimated performance metric in the subset of estimated performance metrics for the layer; and in response to detecting the first difference exceeding a first threshold, identifying the layer as a deviant layer in a subset of deviant layers in the set of layers; and serving, at a user interface, a notification indicating the subset of deviant layers.
2 . The method of claim 1 : wherein generating the set of expected performance metrics comprises generating the set of expected performance metrics further comprising a second subset of expected performance metrics for each operation in the set of operations; wherein generating the set of runtime performance metrics comprises generating the set of runtime performance metrics further comprising a second subset of runtime performance metrics for each operation in the set of operations; and further comprising: for each operation in the set of operations: calculating a second difference between a runtime performance metric, in the second subset of runtime performance metrics for the operation, and a corresponding expected performance metric in the second subset of estimated performance metrics for the operation; and in response to detecting the second difference exceeding a second threshold, identifying the operation as a deviant operation in a subset of deviant operations in the set of operations; and serving, at the user interface, a notification indicating the subset of deviant operations.
3 . The method of claim 1 : wherein capturing the set of runtime performance metrics comprises capturing the set of runtime performance metrics comprising a first runtime metric representing a first number of compute cycles per instruction in a first layer in the set of layers; and further comprising: in response to detecting the first number of compute cycles per instruction in the first layer exceeding a second threshold, identifying the first layer as a compute-bound layer in a subset of compute-bound layers in the set of layers; and serving, at the user interface, a notification indicating the subset of compute-bound layers.
4 . The method of claim 1 : wherein capturing the set of runtime performance metrics comprises capturing the set of runtime performance metrics comprising a first runtime metric representing a first number of data movement cycles per instruction in a first layer in the set of layers; and further comprising: in response to detecting the first number of data movement cycles per instruction in the first layer exceeding a second threshold, identifying the first layer as a bandwidth-bound layer in a subset of bandwidth-bound layers in the set of layers; and serving, at the user interface, a notification indicating the subset of bandwidth-bound layers.
5 . The method of claim 1 , further comprising: generating a first execution timeline representing expected execution of the target neural network based on the static schedule and the set of expected performance metrics; generating a second execution timeline representing execution of the target neural network by the processing device based on the set of runtime performance metrics; and serving, at the user interface, a visualization depicting the first execution timeline and the second execution timeline.
6 . The method of claim 1 : further comprising accessing a cost model defining a set of costs for each operation in the set of operations; and wherein generating the set of expected performance metrics comprises generating the set of expected performance metrics based on the static schedule and the cost model.
7 . The method of claim 1 : wherein accessing the static schedule comprises accessing the static schedule assigning a set of nodes, defining the set of operations in a directed acyclic graph, to a set of resources in the processing device; wherein capturing the set of runtime performance metrics comprises capturing the set of runtime performance metrics comprising a second runtime performance metric representing memory bandwidth utilization for a first node in the set of nodes, the first node comprising a subset of operations in the set of operations; and further comprising serving, at the user interface, a notification indicating the second runtime performance metric.
8 . The method of claim 1 , wherein activating the set of performance monitor registers comprises: writing a set of signals to a set of memory regions, in the memory of the processing device, mapped to the set of performance monitor registers; and selectively activating the set of performance monitor registers based on the set of signals stored in the set of memory regions.
9 . The method of claim 1 : wherein capturing the set of runtime performance metrics comprises capturing the set of runtime performance metrics further comprising: a second runtime performance metric representing a computational cost of a first layer in the set of layers; and a third runtime performance metric representing a memory bandwidth cost of the first layer; and further comprising serving, at the user interface, a notification indicating the second runtime performance metric and the third runtime performance metric.
10 . The method of claim 9 , wherein capturing the set of runtime performance metrics comprises generating the second runtime performance metric based on a second subset of runtime performance metrics in the set of runtime performance metrics, the second subset of runtime performance metrics representing computation costs of a first subset of operations defined by the first layer.
11 . The method of claim 1 , wherein associating the subset of runtime performance metrics for each layer in the set of layers comprises: in response to detecting a change in execution from the layer to a succeeding layer in the set of layers, extracting the subset of runtime performance metrics from the set of performance monitor registers; and transferring the subset of runtime performance metrics to a main memory of the processing unit via a direct memory access core.
12 . The method of claim 11 , wherein extracting the subset of runtime performance metrics from the set of performance monitor registers comprises: detecting the change in execution from the layer to the succeeding layer by: reading a first set of bits corresponding to the identifier assigned to the layer in a first dequeued command in a control processor of the processing device; and reading a second set of bits corresponding to a second identifier assigned to the succeeding layer in a second dequeued command in the control processor; and in response to detecting the change in execution from the layer to the succeeding layer, extracting the subset of runtime performance metrics from the set of performance monitor registers.
13 . A method for profiling neural network performance, the method comprising: accessing a first static schedule of a target neural network for execution by a first processing device comprising a first set of resources, the target neural network comprising a set of layers defining a set of operations; generating a first set of expected performance metrics for the target neural network based on the first static schedule, the first set of expected performance metrics comprising a first expected performance metric for a first layer in the set of layers; and during execution of the target neural network by the first processing device: activating a set of performance monitor registers in the processing device based on signals stored in a memory of the processing device; capturing, by the set of performance monitor registers, a first set of runtime performance metrics for the target neural network; associating a first runtime performance metric, in the first set of runtime performance metrics, with the first layer based on detection of a first identifier assigned to the first layer during execution of the first layer; in response to detecting a difference between the first runtime performance metric and the first expected performance metric exceeding a first threshold, identifying the first layer as a deviant layer in a subset of deviant layers in the set of layers; and serving, at a user interface, a notification indicating the subset of deviant layers.
14 . The method of claim 13 : wherein capturing the first set of runtime performance metrics comprises capturing the first set of runtime performance metrics comprising a second runtime performance metric representing a first number of inferences per second resulting from execution of the target neural network by the first processing device; and further comprising serving, at the user interface, a notification indicating the second runtime performance metric.
15 . The method of claim 14 : wherein accessing the first static schedule comprises accessing the first static schedule of the target neural network for execution by the first processing device characterized by a first device type; and further comprising: accessing a second static schedule of the target neural network for execution by a second processing device characterized by a second device type different from the first device type; generating a second set of expected performance metrics for the target neural network based on the second static schedule; generating a second set of runtime performance metrics during execution of the target neural network by the second processing device, the second set of runtime performance metrics comprising a third runtime performance metric representing a second number of inferences per second resulting from execution of the target neural network by the second processing device; and in response to detecting the first number of inferences per second exceeding the second number of inferences per second, serving, at the user interface, a notification indicating the first processing device as a higher performing processing device for the target neural network than the second processing device.
16 . The method of claim 13 : wherein generating the first set of expected performance metrics comprises generating the first set of expected performance metrics comprising a second estimated performance metric representing an expected accuracy of the first layer in the set of layers; wherein capturing the first set of performance metrics comprises capturing a set of saturation counts by a set of saturation counters in the set of performance monitor registers; wherein capturing the first set of runtime performance metrics comprises generating a second runtime performance metric, in the first set of runtime performance metrics, based on the set of saturation counts, the second runtime performance metric representing an accuracy of the first layer during execution of the target neural network by the first processing device; and further comprising, in response to detecting a difference between the second runtime performance metric and the second expected performance metric exceeding a first threshold, serving an alert at the user interface.
17 . The method of claim 13 : wherein capturing the first set of runtime performance metrics comprises capturing the first set of runtime performance metrics comprising a second runtime performance metric representing utilization of a first resource in the first set of resources; and further comprising, in response to detecting the second runtime performance metric exceeding a second threshold, serving an alert at the user interface.
18 . The method of claim 13 , further comprising writing, to a first memory region mapped to a first performance monitor register in the set of performance monitor registers, a signal representing a command to disable the first performance monitor register.
19 . The method of claim 13 : wherein capturing the first set of runtime performance metrics comprises capturing the first set of runtime performance metrics comprising a second runtime performance metric representing utilization of the first processing device; and further comprising, in response to detecting the second runtime performance metric exceeding a second threshold, load balancing execution of the target neural network between the first processing device and a second processing device.
20 . A method for profiling neural network performance, the method comprising: accessing a static schedule of a target neural network for execution by a first processing device, the target neural network comprising a set of layers; generating a set of expected performance metrics of the target neural network based on the static schedule, the set of expected performance metrics comprising a first expected performance metric for a first layer in the set of layers; activating a set of performance monitor registers in the first processing device based on signals stored in a memory of the first processing device; capturing, by the set of performance monitor registers, a set of runtime performance metrics during execution of the target neural network by the first processing device, the set of runtime performance metrics comprising: a first runtime performance metric for the first layer; and a second runtime performance metric representing utilization of the first processing device; associating the first runtime performance metric with the first layer based on detection of a first identifier assigned to the first layer during execution of the first layer; in response to detecting a difference between the first runtime performance metric and the first expected performance metric exceeding a threshold, identifying the first layer as a deviant layer in a subset of deviant layers in the set of layers; and serving, at a user interface, a notification indicating the subset of deviant layers.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application No. 63/291,893, filed on 20 Dec. 2021, which is incorporated in its entirety by this reference. This application is related to U.S. patent application Ser. No. 17/127,904, filed on 18 Dec. 2020, U.S. patent application Ser. No. 17/356,372, filed on 23 Jun. 2021, U.S. patent application Ser. No. 17/211,707, filed on 24 Mar. 2021, U.S. patent application Ser. No. 17/331,585, filed on 26 May 2021, and U.S. patent application Ser. No. 17/331,590, filed on 26 May 2021, each of which is incorporated in its entirety by this reference. TECHNICAL FIELD This invention relates generally to the field of artificial neural network computation and more specifically to a new and useful system and method for profiling on-chip performance of neural network execution within the field of artificial neural network computation. BRIEF DESCRIPTION OF THE FIGURES FIG. 1 is a flowchart representation of a method; FIG. 2 is a flowchart representation of one variation of the method; FIG. 3 is a flowchart representation of one variation of the method; FIG. 4 is a graphical representation of one variation of the method; FIG. 5 is a graphical representation of one variation of the method; and FIG. 6 is a flowchart representation of one variation of the method. DESCRIPTION OF THE EMBODIMENTS The following description of embodiments of the invention is not intended to limit the invention to these embodiments but rather to enable a person skilled in the art to make and use this invention. Variations, configurations, implementations, example implementations, and examples described herein are optional and are not exclusive to the variations, configurations, implementations, example implementations, and examples they describe. The invention described herein can include any and all permutations of these variations, configurations, implementations, example implementations, and examples. 1. METHODS As shown in FIGS. 1-3, a method S100 for profiling neural network performance includes: accessing a static schedule of a target neural network for execution by a processing device in Block S102, the target neural network including a set of layers defining a set of operations; generating a set of expected performance metrics for the target neural network based on the static schedule in Block S104, the set of expected performance metrics including a first subset of expected performance metrics for the set of operations; and generating a set of runtime performance metrics during execution of the target neural network by the processing device in Block S122, the set of runtime performance metrics including a first subset of runtime performance metrics for the set of operations. The method S100 further includes, for each operation in the set of operations: calculating a first difference between a runtime performance metric, in the first subset of runtime performance metrics, for an operation and a corresponding estimated performance metric, in the first subset of estimated performance metrics, for the operation in Block S124; and, in response to detecting the first difference exceeding a first threshold, adding the operation to a subset of deviant operations in the set of operations in Block S128. The method S100 further includes serving, at a user interface, a notification indicating the subset of deviant operations in Block S132. 1.1 Variation: Real-Time Profiling As shown in FIGS. 1-3, one variation of the method S100 for profiling neural network performance includes: accessing a first static schedule of a target neural network for execution by a first processing device including a first set of resources in Block S102, the target neural network including a set of layers defining a set of operations; and generating a first set of expected performance metrics for the target neural network based on the first static schedule in Block S104, the first set of expected performance metrics including a first expected performance metric for a first operation in the set of operations. This variation of the method S100 further includes, during execution of the target neural network by the first processing device: accessing a first set of performance values captured by a set of performance counters in the first processing device in Block S120; generating a first set of runtime performance metrics for the target neural network based on the first set of performance values in Block S122, the first set of runtime performance metrics including a first runtime performance metric for the first operation; and, in response to detecting a difference between the first runtime performance metric and the first expected performance metric exceeding a first threshold, serving an alert at a user interface in Block S132. 1.2 Variation: Layer-Level Performance Deviation As shown in FIGS. 1-3, one variation of the method S100 for profiling neural network performance includes: acces