CN-114282660-B - Method for managing convolution calculations and corresponding device

CN114282660BCN 114282660 BCN114282660 BCN 114282660BCN-114282660-B

Abstract

The present disclosure relates to a method and corresponding apparatus for managing convolution calculations. In one embodiment, a method for managing convolution calculations performed by a calculation unit adapted to calculate output data on output channels from convolution kernels applied to input data blocks on at least one input channel, wherein the calculation of each input data block corresponds to output data on output channels, respectively, and wherein the calculation of each convolution kernel corresponds to output data on each output channel, respectively, comprises identifying a size of memory locations available in a temporary working memory of the calculation unit, preloading a maximum number of convolution kernels in the temporary working memory that can be stored at the size of the memory, and controlling the calculation unit to calculate a set of output data that can be calculated from the preloaded convolution kernels.

Inventors

L. Friedt
M. Falketo
P. Demaya

Assignees

意法半导体股份有限公司
意法半导体(鲁塞)公司

Dates

Publication Date: 20260505
Application Date: 20210929
Priority Date: 20201001

Claims (20)

1. A method for managing convolution calculations performed by a calculation unit in an image recognition system, the calculation unit being adapted to calculate output data on output channels from convolution kernels applied to input data blocks on at least one input channel, wherein the calculation of each input data block corresponds to output data on an output channel, respectively, and wherein the calculation of each convolution kernel corresponds to the output data on each output channel, respectively, the method comprising: identifying a size of a memory location available in a temporary working memory of the computing unit, wherein the temporary working memory is a temporary memory internal to the computing unit for temporarily storing convolution kernels; preloading in said temporary working memory a maximum number of convolution kernels that can be stored at said size of said memory to optimize the number of loads of the convolution kernels, and Controlling the calculation unit to calculate a set of output data, which can be calculated from preloaded convolution kernels, wherein by preloading the maximum number of convolution kernels and calculating the set of output data, the number of loading of the convolution kernels is reduced, thereby reducing the reasoning time and memory footprint of the convolution calculation.
2. The method of claim 1, wherein controlling the computing unit to compute the set of output data comprises sequentially loading the input data blocks corresponding to the set of output data in the computing unit.
3. The method of claim 1, further comprising repeatedly preloading the maximum number of convolution kernels in the temporary working memory and repeatedly controlling the computing unit to compute the set of output data until the set of output data in all output channels is computed.
4. The method of claim 1, wherein the set of output data corresponds to a maximum size of the output data that can be received at once in each output channel, and wherein the set includes the complete output data of each output channel or only a portion of the output data of each output channel.
5. The method of claim 4, wherein the portion of the output data for each output channel corresponds to a row or set of rows of output data.
6. The method of claim 4, wherein repeatedly preloading and controlling when the set includes only the portion of the output data for each output channel comprises: Repeatedly preloading a different convolution kernel than the previously preloaded one for the same said portion and repeatedly controlling until all said output data of said portion are calculated for all output channels, and For the other parts, the preloading is restarted until all the output data in all the other parts of all the output channels are calculated.
7. The method of claim 1, further comprising allocating memory locations of a buffer memory of the computing unit that is sized for a minimum size threshold when the size of the memory locations available in the temporary working memory is less than the minimum size threshold, wherein preloading the maximum number of convolution kernels comprises preloading a maximum number of convolution kernels that can be stored below the minimum size threshold.
8. The method of claim 1, wherein the convolution kernel comprises weight data, and wherein preloading comprises reorganizing the weight data of the convolution kernel to optimize the computation of the output data.
9. The method of claim 1, wherein the convolution kernel comprises weight data, and wherein controlling the computing unit to compute the set of output data comprises multiplying and accumulating the loaded input data block with the preloaded weight data of the convolution kernel.
10. The method of claim 1, further comprising storing the convolution kernel in a non-volatile memory internal or external to the computing unit prior to preloading the convolution kernel in the temporary working memory, and storing the input data and the output data in the volatile memory internal or external to the computing unit while the computing unit is controlling.
11. A non-transitory computer readable storage medium comprising instructions that, when executed by a computer, perform the method of claim 1.
12. An apparatus in an image recognition system, comprising: A computing unit configured to compute output data on output channels from convolution kernels applied to input data blocks on at least one input channel, such that computation of each input data block corresponds to each output data on the output channel, respectively, and such that computation of each convolution kernel corresponds to the output data on each output channel, respectively, and A processor configured to manage convolution calculations performed by the calculation unit, Wherein the processor is further configured to: identifying a size of a memory location available in a temporary working memory of the computing unit, wherein the temporary working memory is a temporary memory internal to the computing unit for temporarily storing convolution kernels; Preloading in said temporary working memory a maximum number of said convolution kernels capable of being stored at said size of said memory to optimize the number of loads of convolution kernels, and Controlling the calculation unit to calculate a set of output data, which can be calculated from preloaded convolution kernels, wherein by preloading the maximum number of convolution kernels and calculating the set of output data, the number of loading of the convolution kernels is reduced, thereby reducing the reasoning time and memory footprint of the convolution calculation.
13. The apparatus of claim 12, wherein the processor is configured to control the computing unit to compute the set of output data by sequentially loading in the computing unit blocks of input data corresponding to the set of output data, the set of output data being computed according to the preloaded convolution kernel.
14. The apparatus of claim 12, wherein the processor is configured to repeatedly preload the maximum number of convolution kernels in the temporary working memory and repeatedly control the computing unit to compute the set of output data until the set of output data in all output channels is computed.
15. The apparatus of claim 12, wherein the set of output data corresponds to a maximum size of the output data that can be received at once in each output channel, and wherein the set of output data includes the complete output data of each output channel or only a portion of the output data of each output channel.
16. The apparatus of claim 15, wherein the portion of the output data for each output channel corresponds to a row or a set of rows for each output channel.
17. The device of claim 15, wherein the processor is configured to: Repeating preloading of a different convolution kernel than a previously preloaded convolution kernel and repeating the controlling for the same portion when the set of output data includes only the portion until all of the output data of the portion is calculated for all output channels, and For the other parts, the preloading is restarted until all the output data in all the other parts of all the output channels are calculated.
18. The apparatus of claim 12, wherein the processor is configured to allocate memory locations of the computing unit that are buffer memory of the minimum size threshold when the size of the memory locations available in the temporary working memory is less than a minimum size threshold, and preload in the allocated memory locations a maximum number of convolution kernels that can be stored below the minimum size threshold.
19. The apparatus of claim 12, wherein the convolution kernel comprises weight data, and wherein the processor is configured to reorganize the weight data of the convolution kernel during preloading and optimize the computation of the output data by the computation unit.
20. The apparatus of claim 12, wherein the convolution kernel comprises weight data, and wherein the calculation unit is configured to calculate the output data by multiplying and accumulating the input data of the loaded block with the weight data of the preloaded convolution kernel.

Description

Method for managing convolution calculations and corresponding device Cross Reference to Related Applications The present application claims the benefit of French patent application number 2010063 filed on 1 month 10 in 2020, which is incorporated herein by reference. Technical Field Implementations and embodiments relate to artificial neural networks that operate convolution calculations, and in particular to management of convolution calculations. Background There may be mentioned, for example, convolutional neural networks or CNNs which are commonly applied to identify objects or people in images or videos, which are referred to as 2D convolutions. Convolutional neural networks typically comprise four layer types that process information continuously, a convolutional layer that processes, for example, image blocks one after the other, an active layer, typically a nonlinear function, that can increase the correlation of the results of the convolutional layer, a pooling layer that can reduce the dimensions of the layers, and a fully connected (or dense) layer that connects all neurons of one layer to all neurons of the previous layer. For each layer, input data arrives from the previous layer to the input channel, and output data is transmitted on the output channel. The input and output channels correspond to memory locations such as random access memory. The collection of output channels is called a "signature". The convolutional layer generally corresponds to the inner product of the input data and the convolutional kernel weight data. The weight data refers to the parameters of the convolution operation associated with a given convolution kernel. Briefly, the principle of convolution (particularly 2D convolution) is to scan the input channels with windows projected onto the input data blocks and calculate the inner product of each input data block with a convolution kernel. The inner product of each input data block corresponds to the output data on the output channel corresponding to each convolution kernel, respectively. It should be noted that the convolution kernel corresponds to an output channel and may comprise a number of components equal to the number of input channels, so that each component may be dedicated to one input channel. Thus, the "absolute" size of such a convolution kernel (i.e., the number of kernel weight data) will be the size of one component (equal to the size of the window (e.g., 3*3 pixels)) times the number of input channels. The weights are typically stored in non-volatile memory. Convolutional neural networks "CNN" are widely used for artificial intelligence. Convolutional neural networks are very demanding in terms of the performance of the computation, the non-volatile memory capacity to store the weights, the volatile memory capacity of the input/output data, and the computation cycle that results in high inference times (inference times are common terminology in the art of artificial intelligence, meaning the time required to perform tasks for which the neural network has been configured or trained within the framework of self-learning techniques). There is a need to reduce the inference time of computations performed in convolutional neural networks. A disadvantage of existing solutions is that there is generally no advantageous tradeoff provided in terms of reduced reasoning time and reduced memory footprint. In practice, techniques such as the "TVM" type generate specific code for each layer to avoid execution of loops, which may reduce the number of loops but have a significant impact on the size of the non-volatile memory. Furthermore, techniques known as "weight smoothing" or "feature smoothing" include allocating more volatile memory to speed up computation. However, the size of the non-volatile memory and the volatile memory is directly related to the final cost of the solution. In the context of inexpensive devices with limited (random access and non-volatile) memory and computational speed resources, the price problem and the tradeoff between reasoning time and memory space are more significant. Disclosure of Invention Embodiments greatly reduce the inference time for 2D convolution. Further embodiments provide an interference time that is close to the theoretical limit of the number of cycles per convolution calculation operation without increasing volatile and non-volatile memory. According to an embodiment, a method for managing convolution calculations is provided, wherein the method is performed by a calculation unit adapted to calculate output data on output channels from convolution kernels applied to input data blocks on at least one input channel, the calculation for each input data block corresponding to each output data on an output channel, respectively, and the calculation for each convolution kernel corresponding to the output data on each output channel, respectively. According to one embodiment, the method comprises: identifying a size of a m