US-12626135-B2 - Dynamically dividing activations and kernels for improving memory efficiency

US12626135B2US 12626135 B2US12626135 B2US 12626135B2US-12626135-B2

Abstract

Embodiments are generally directed to dynamically dividing activations and kernels for improving memory efficiency. An embodiment of a method in a compute engine performing machine learning comprises: receiving, by a convolutional layer of a convolutional neural network (CNN) implemented on the compute engine, a plurality of activation groups contained in an input data, wherein the convolutional layer includes one or more kernel groups and the one or more kernel groups each include a plurality of kernels; determining a plurality of memory efficiency metrics based on the number of activation groups of the plurality of activation groups and the number of kernels of the plurality of kernels; selecting a first optimal number of activation groups and a second optimal number of kernels that are associated with an optimal memory efficiency metric in the plurality of memory efficiency metrics; and performing a convolutional operation on the input data based on the first optimal number and the second optimal number.

Inventors

Xiaoming Chen
Anbang Yao
Junjie Huang
Tao Lv
Yuanke Luo

Assignees

INTEL CORPORATION

Dates

Publication Date: 20260512
Application Date: 20201105
Priority Date: 20191107

Claims (12)

1 . A method comprising: receiving, by processing circuitry of a computing device, input data associated with a convolutional layer corresponding to a convolutional neural network (CNN), wherein the input data identifies activation groups and wherein the convolutional layer includes kernels; determining memory efficiency metrics based on one or more sets of activation groups of the activation groups and one or more sets of kernels of the kernels; dynamically selecting a set of activation groups of the one or more sets of activation groups and a set of kernels of the one or more sets of kernels that are associated with a memory efficiency metric of the memory efficiency metrics, wherein the set of activation groups and the set of kernels are dynamically selected to improve memory efficiency by balancing an activation group size and a kernel group size based on memory efficiency metrics, and wherein the set of activation groups and the set of kernels are associated with a register or a buffer associated with the processing circuitry; and performing a convolutional operation on the input data associated with the convolutional layer based on the set of activation groups and the set of kernels.
2 . The method of claim 1 , further comprising: receiving sparsity data relating to one or more of activation sparsity or kernel sparsity, wherein the memory efficiency metrics are determined further based on the sparsity data, wherein the memory efficiency metrics indicate a memory load efficiency during training or inference of the CNN.
3 . The method of claim 2 , wherein the memory efficiency metrics are derived by calculation based on a formula comprising: n * k * Sa * Sk k * Sa + n * Sk , where: n is the set of kernels of the kernels, k is the set of activation groups of the activation groups, Sa is the activation sparsity, and Sk is the kernel sparsity.
4 . The method of claim 1 , wherein the processing circuitry is coupled to a memory, the processing circuitry comprising one or more of graphics processing circuitry or application processing circuitry.
5 . An apparatus comprising: processing circuitry coupled to a memory, the processing circuitry to: receive input data associated with a convolutional layer corresponding to a convolutional neural network (CNN), wherein the input data identifies activation groups and wherein the convolutional layer includes kernels; determine memory efficiency metrics based on one or more sets of activation groups of the activation groups and one or more sets of kernels of the kernels; dynamically select a set of activation groups of the one or more sets of activation groups and a set of kernels of the one or more sets of kernels that are associated with a memory efficiency metric of the memory efficiency metrics, wherein the set of activation groups and the set of kernels are dynamically selected to improve memory efficiency by balancing an activation group size and a kernel group size based on memory efficiency metrics, and wherein the set of activation groups and the set of kernels are associated with a register or a buffer associated with the processing circuitry; and perform a convolutional operation on the input data associated with the convolutional layer based on the set of activation groups and the set of kernels.
6 . The apparatus of claim 5 , wherein the processing circuitry is further to: receive sparsity data relating to one or more of activation sparsity and kernel sparsity, wherein the memory efficiency metrics are determined further based on the sparsity data, wherein the memory efficiency metrics indicate a memory load efficiency during training or inference of the CNN.
7 . The apparatus of claim 6 , wherein the memory efficiency metrics are derived by calculation based on a formula comprising: n * k * Sa * Sk k * Sa + n * Sk , where: n is the set of kernels of the kernels, k is the set of activation groups of the activation groups, Sa is the activation sparsity, and Sk is the kernel sparsity.
8 . The apparatus of claim 5 , wherein the processing circuitry is coupled to a memory, the processing circuitry comprising one or more of graphics processing circuitry or application processing circuitry.
9 . At least one non-transitory computer-readable medium having stored thereon instructions which, when executed, cause a computing device to perform operations comprising: receiving input data associated with a convolutional layer corresponding to a convolutional neural network (CNN), wherein the input data identifies activation groups and wherein the convolutional layer includes kernels; determining memory efficiency metrics based on one or more sets of activation groups of the activation groups and one or more sets of kernels of the kernels; dynamically selecting a set of activation groups of the one or more activation groups and a set of kernels of the one or more sets of kernels that are associated with a memory efficiency metric of the memory efficiency metrics, wherein the set of activation groups and the set of kernels are dynamically selected to improve memory efficiency by balancing an activation group size and a kernel group size based on memory efficiency metrics, and wherein the set of activation groups and the set of kernels are associated with a register or a buffer associated with processing circuitry of the computing device; and performing a convolutional operation on the input data associated with the convolutional layer based on the set of activation groups and the set of kernels.
10 . The non-transitory computer-readable medium of claim 9 , wherein the operations further comprises: receiving sparsity data relating to one or more of activation sparsity and kernel sparsity, wherein the memory efficiency metrics are determined further based on the sparsity data, wherein the memory efficiency metrics indicate a memory load efficiency during training or inference of the CNN.
11 . The non-transitory computer-readable medium of claim 10 , wherein the memory efficiency metrics are derived by calculation based on a formula comprising: n * k * Sa * Sk k * Sa + n * Sk , where: n is the set of kernels of the kernels, k is the set of activation groups of the activation groups, Sa is the activation sparsity, and Sk is the kernel sparsity.
12 . The non-transitory computer-readable medium of claim 9 , wherein the computing device comprises processing circuitry coupled to a memory, wherein the processing circuitry having one or more of graphics processing circuitry or application processing circuitry.

Description

CLAIM TO PRIORITY This patent application is related to and, under 35 U.S.C. 119, claims the benefit of and priority to Chinese Patent Application No. 201911082113.X, entitled DYNAMICALLY DIVIDING ACTIVATIONS AND KERNELS FOR IMPROVING MEMORY EFFICIENCY, by Xiaoming Chen, et al., filed Nov. 7, 2019, where the contents of which are incorporated herein by reference. FIELD Embodiments relate generally to data processing and more particularly to data processing via a general-purpose graphics processing unit. BACKGROUND OF THE DESCRIPTION Current parallel graphics data processing includes systems and methods developed to perform specific operations on graphics data such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, etc. Traditionally, graphics processors used fixed function computational units to process graphics data; however, more recently, portions of graphics processors have been made programmable, enabling such processors to support a wider variety of operations for processing vertex and fragment data. To further increase performance, graphics processors typically implement processing techniques such as pipelining that attempt to process, in parallel, as much graphics data as possible throughout the different parts of the graphics pipeline. Parallel graphics processors with single instruction, multiple thread (SIMT) architectures are designed to maximize the amount of parallel processing in the graphics pipeline. In an SIMT architecture, groups of parallel threads attempt to execute program instructions synchronously together as often as possible to increase processing efficiency. A general overview of software and hardware for SIMT architectures can be found in Shane Cook, CUDA Programming Chapter 3, pages 37-51 (2013). BRIEF DESCRIPTION OF THE DRAWINGS So that the manner in which the above recited features of the present embodiments can be understood in detail, a more particular description of the embodiments, briefly summarized above, may be had by reference to embodiments, some of which are illustrated in the appended drawings. It is to be noted, however, that the appended drawings illustrate only typical embodiments and are therefore not to be considered limiting of its scope. FIG. 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the embodiments described herein; FIG. 2A-2D illustrate parallel processor components, according to an embodiment; FIG. 3A-3C are block diagrams of graphics multiprocessors and multiprocessor-based GPUs, according to embodiments; FIG. 4A-4F illustrate an exemplary architecture in which a plurality of GPUs is communicatively coupled to a plurality of multi-core processors; FIG. 5 illustrates a graphics processing pipeline, according to an embodiment; FIG. 6 illustrates a machine learning software stack, according to an embodiment; FIG. 7 illustrates a general-purpose graphics processing unit, according to an embodiment; FIG. 8 illustrates a multi-GPU computing system, according to an embodiment; FIG. 9A-9B illustrate layers of exemplary deep neural networks; FIG. 10 illustrates an exemplary recurrent neural network; FIG. 11 illustrates training and deployment of a deep neural network; FIG. 12 is a block diagram illustrating distributed learning; FIG. 13 illustrates an exemplary inferencing system on a chip (SOC) suitable for performing inferencing using a trained model; FIG. 14 is a block diagram of a processing system, according to an embodiment; FIG. 15A-15C illustrate computing systems and graphics processors provided by embodiments described herein; FIG. 16A-16C illustrate block diagrams of additional graphics processor and compute accelerator architectures provided by embodiments described herein; FIG. 17 is a block diagram of a graphics processing engine of a graphics processor in accordance with some embodiments; FIG. 18A-18B illustrate thread execution logic including an array of processing elements employed in a graphics processor core according to embodiments described herein; FIG. 19 illustrates an additional execution unit, according to an embodiment; FIG. 20 is a block diagram illustrating a graphics processor instruction formats according to some embodiments; FIG. 21 is a block diagram of a graphics processor according to another embodiment; FIG. 22A-22B illustrate a graphics processor command format and command sequence, according to some embodiments; FIG. 23 illustrates exemplary graphics software architecture for a data processing system according to some embodiments; FIG. 24A is a block diagram illustrating an IP core development system, according to an embodiment; FIG. 24B illustrates a cross-section side view of an integrated circuit package assembly, according to some embodiments described herein; FIG. 24C illustrates a package assembly that includes multiple units of hardware logic chiplets connected to a substrate (e.g., base die); FIG. 24D illustrates a