US-12619857-B2 - Methods and apparatuses for bottleneck stages in neural-network processing

US12619857B2US 12619857 B2US12619857 B2US 12619857B2US-12619857-B2

Abstract

Methods and apparatuses herein improve bottleneck-layer processing in neural networks. Example advantages include reducing the number of accesses needed to external memory, allowing processing to run in parallel in successive bottleneck layers, based on the use of partial convolutional results, and balancing the amount of “local” memory used for storing convolutional results against the computational overhead of recomputing partial results. One aspect of the methods and apparatuses involves co-locating arithmetic and logical operators and temporary storage in the same data path, with the approach yielding both higher performance and greater energy efficiency in the implementation of bottleneck layers for neural network processing.

Inventors

Anders Berkeman
Sven Karlsson

Assignees

TELEFONAKTIEBOLAGET LM ERICSSON (PUBL)

Dates

Publication Date: 20260505
Application Date: 20201014

Claims (19)

1 . A hardware accelerator circuit configured as a bottleneck stage for neural-network processing and comprising: convolution circuitry comprising a data path that includes a series of first, second, and third convolution stages; memory circuitry comprising first, second, and third stage-input buffers for the first, second, and third convolution stages, along with an output data buffer and a weight buffer; and control circuitry and associated bus interface circuitry, the control circuitry configured to read input data vectors in from an external memory into the first stage-input buffer and to write corresponding output data vectors from the output data buffer back to the external memory; wherein the first convolution stage comprises a 1×1×M convolution circuit that is configured to use respective weights from the weight buffer to produce first convolution results for respective input data vectors held in the first stage-input buffer, each input data vector being one among a set of input data vectors to be processed by the hardware accelerator circuit and comprising a stack of N channel values corresponding to a respective grid position of an input data grid associated with the set of input data vectors, the first convolution results for each input vector being an expanded vector of M channel values, where M>N; wherein the second convolution stage comprises a depth-wise R×R×1 convolution circuit that is configured to use respective weights from the weight buffer to produce second convolution results for respective R×R subsets of the expanded vectors, and wherein the second stage-input buffer holds, at least incrementally, the first convolution results produced by the first convolution stage; wherein the third convolution stage comprises a further 1×1×S convolution circuit that is configured to use respective weights from the weight buffer to produce third convolution results from the second convolution results, wherein the third stage-input buffer holds, at least incrementally, the second convolution results produced by the second convolution stage, where third convolution results comprise compressed vectors corresponding to each input data vector being processed in a current processing cycle of the hardware accelerator circuit, each compressed vector having a length S, where S<M; and wherein the control circuitry is configured to write respective output data vectors held in the output data buffer back to the external memory, each output data vector corresponding to a respective one of the input data vectors in the set of input data vectors and comprising the corresponding compressed vector produced by the third convolution stage, or a combination of the corresponding compressed vector produced by the third convolution stage and the respective input data vector.
2 . The hardware accelerator circuit of claim 1 , wherein the second convolution stage is configured to begin producing the second convolution results based on incremental first convolution results, as output from the first convolution stage for buffering in the second stage-input buffer, and wherein the third convolution stage is configured to begin producing the third convolution results based on incremental second convolution results, as output from the second convolution stage for buffering in the third stage-input buffer.
3 . The hardware accelerator circuit of claim 1 , wherein at least one of the second and third stage-input buffers is sized to hold only incremental convolution results from the prior convolution stage, and the corresponding convolution stage is configured to perform partial re-computations, to account for its stage-input buffer holding only incremental convolution results.
4 . The hardware accelerator circuit of claim 1 , wherein the control circuitry is configured to schedule an order of convolutional operations carried out by the hardware accelerator circuit to account for the second and third stage-input buffers being too small to hold complete sets of stage-specific convolution results for a respective subset of input data vectors processed in each processing cycle of the hardware accelerator circuit.
5 . The hardware accelerator circuit of claim 1 , wherein the control circuitry is configured to read in respective input data vectors from the set of input data vectors on a memory-word basis, in dependence on each input data vector being stored in a corresponding memory word or in corresponding contiguous memory words, depending on the vector length N of the input data vectors versus a memory-word length used by the external memory.
6 . The hardware accelerator circuit of claim 1 , wherein the control circuitry is configured to populate the weight buffer from stored data in an internal memory of the hardware accelerator circuit or based on receiving weights from an external controller, or wherein the weight buffer is configured to be writable directly by the external controller.
7 . The hardware accelerator circuit of claim 1 , wherein the combination of the corresponding compressed vector produced by the third convolution stage and the respective input data vector comprises a concatenation or a summation of the corresponding compressed vector and the respective input data vector.
8 . The hardware accelerator circuit of claim 1 , wherein R is an integer value greater than or equal to 3.
9 . The hardware accelerator circuit of claim 1 , wherein S equals N.
10 . An integrated System-on-a-Chip (SoC) comprising the hardware accelerator circuit of claim 1 , the SoC further comprising a Central Processing Unit (CPU) configured to control the hardware accelerator circuit for neural-network processing and a Direct Memory Access (DMA) controller configured to interface the hardware accelerator circuit with the external memory via the bus interface circuitry.
11 . A wireless communication device comprising: the hardware accelerator circuit of claim 1 ; communication circuitry configured for transmitting communication signals for and receiving communication signals from a wireless communication network; and processing circuitry operatively associated with the communication circuitry and configured to implement neural-network processing of data received via the communication circuitry or acquired via one or more sensors of the wireless communication device, including being configured to use the hardware accelerator for implementation of one or more bottleneck stages used in the neural-network processing.
12 . The wireless communication device of claim 11 , wherein the one or more sensors comprise one or more image sensors, and wherein the set of input data vectors corresponds to neural-net processing of input image data obtained from the one or more image sensors.
13 . The wireless communication device of claim 12 , wherein the input data grid associated with the set of input data vectors represents spatial locations within an image embodied in the image data.
14 . The method of claim 11 , wherein R is an integer value greater than or equal to 3.
15 . The method of claim 11 , wherein S equals N.
16 . A method of neural-network processing, the method comprising: arranging channel data as input data vectors and storing the input data vectors in a memory circuit on a memory-word basis, the channel data resulting from processing original input data through one or more layers of a neural network and each input data vector corresponding to a respective grid position in a data grid associated with the original input data and comprising a stack of N channel values corresponding to N channels of data resulting from the processing; and processing the input data vectors in a hardware accelerator circuit that implements a bottleneck stage of the neural-network processing and reads in respective ones of the input data vectors from the memory circuit on the memory-word basis; wherein processing the input data vectors in the hardware accelerator circuit comprises: producing first convolution results via a first convolution stage that performs 1×1×M convolutions on the input data vectors read into the hardware accelerator circuit, where M>N; producing second convolution results via a second convolution stage that performs R×R×1 depth-wise convolutions of the first convolution results produced for respective subsets of the input data vectors read into the hardware accelerator circuit; producing third convolution results via a third convolution stage that performs 1×1×S convolutions on the second convolution results, the third convolution results being respective compressed data vectors for each subset of input data vectors read into the hardware accelerator circuit, where S<M; and writing output data vectors back to the memory circuit, or another memory circuit, wherein each output data vector corresponds to a respective one of the input data vectors in the set of input data vectors and comprises the corresponding compressed vector produced by the third convolution stage, or a combination of the corresponding compressed vector produced by the third convolution stage and the respective input data vector.
17 . The method of claim 16 , wherein the channel data is arranged in the memory circuit as a cube of input data columns, each input data column being a respective one of the input data vectors, and wherein writing the output data vectors from the hardware accelerator circuit back to the memory circuit or the other memory circuit comprises forming a cube of output data columns corresponding to the cube of input data columns, each output data column being a respective one of the output data vectors.
18 . The method of claim 16 , wherein processing the input data vectors in the hardware accelerator circuit further includes controlling an order of convolution operations performed by the convolution stages of the hardware accelerator circuit to allow a succeeding convolution stage to begin producing convolution results before the convolution results of the preceding convolution stage are complete.
19 . The method of claim 16 , wherein the combination of the corresponding compressed vector produced by the third convolution stage and the respective input data vector comprises a concatenation or a summation of the corresponding compressed vector and the respective input data vector.

Description

TECHNICAL FIELD Methods and apparatuses disclosed herein relate to neural-network processing and specifically relate to bottleneck stages in neural-network processing. BACKGROUND Neural networks are currently the state-of-the-art for various kinds of data processing, with one example being image processing applications such as object detection, classification, and segmentation. Other example uses include speech and text processing, and diagnostics in everything from industrial machinery to healthcare. A typical neural network used for image processing or other pattern-related analysis is a relatively regular feed-forward structure with few or no data dependencies, properties that complement hardware implementation. However, the number of arithmetic operations that are required to do a single inference is significant, in the order of hundreds or thousands of millions of multiplications and additions. Inference at a high frame rate for real-time video processing increases this number further by one or two magnitudes. Here, “inference” refers to a trained neural network inferring a result from input data, such as classifying an object depicted in input image data. Convolutional neural networks, “CNNs” or “ConvNets”, are a class of so-called “deep” neural networks that are often used for image processing, although CNNs find use in an increasing variety of applications. Extensive example information regarding CNN theory, structure, and operation appears in Michelucci, Umberto, Advanced Applied Deep Learning: Convolutional Neural Networks and Object Detection, Apress, 2019. As a simplified overview, CNNs use multiple layers to recognize patterns or features of interest that are present within an input set of data, which, as a non-limiting example, may be an image. While the initial layer or layers of the CNN may include “filters” that recognize simple patterns or features, each layer feeds the next, and inner layers of the CNN generally include filters that recognize patterns of patterns or other abstractions and remove the spatial dependencies. Each convolutional filter or “kernel” used by a given layer in the CNN comprises a smaller “window” of weights that is convolved (scanned or traversed stepwise) through or over the input data to the layer, where the input data comprises a grid or set of data values for each of one or more “channels.” In turn, the convolution results produced by each filter produces a feature map whose values represent the extent to which the corresponding feature was detected in the input data. These results may be understood as constituting one “channel” of output data from the layer and because, except in “depth-wise” processing, each filter is respectively applied to all channels of input data, the number of channels input to successive layers in a CNN may become quite large. Consider an example case where the input data to the first layer of a CNN is an image that comprises an D×E grid of pixel values for each of three color channels C, thus constituting input data to the first layer having dimensions D×E×C. Assuming that the recognition task associated with the CNN is detection of certain numerals or letters within the image, the first layer may have a number of relatively simple filters, e.g., one filter designed to detect vertical edges, one designed to detect horizontal edges, one designed to detect slanting edges, etc. Applying each of these filters to the D×E data set associated with each color channel produces a corresponding set of convolution results or output data, where the output data resulting from the application of a specific filter to a specific channel of input data may be regarded as a resulting feature map that constitutes a new channel of data for the next layer in the CNN. CNNs generally include various types of layers, including so-called “pooling” layers that decrease the spatial dependencies associated with pattern detection. Further, so-called “bottleneck” layers or structures represent a key innovation having vital importance especially in very deep CNNs. Bottleneck layers in CNNs encourage the network to compact or compress relevant information while discarding redundant information. CNNs having significant depth—many layers—may use more than one bottleneck layer positioned within the cascade of successive layers. The compression operations performed by a bottleneck layer depend on 1×1 convolutions and may be understood as combining features across feature maps. Example details and further information on the use of bottleneck layers in CNNs appears in R. Su, X. Liu and L. Wang, “Convolutional neural network bottleneck features for bi-directional generalized variable parameter HMMs,” 2016 IEEE International Conference on Information and Automation (ICIA), Ningbo, 2016, pp. 1126-1131. Further, see C. Szegedy et al., “Going deeper with convolutions,” 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, M A, 2015, pp. 1-9. As CNN depth