CN-116468088-B - Neural network accelerator system for image super-resolution and implementation method thereof

CN116468088BCN 116468088 BCN116468088 BCN 116468088BCN-116468088-B

Abstract

The invention discloses a neural network accelerator system for super-resolution of images and an implementation method thereof, and the neural network accelerator system comprises a transmission engine, a logic control module, a data cache module, a calculation complex, an off-chip data bus, an off-chip control bus, an on-chip data bus and an on-chip control bus, wherein the transmission engine and the calculation complex are connected with the data cache module through the on-chip data bus, the transmission engine, the data cache module and the calculation complex are connected with the logic control module through the on-chip control bus, the transmission engine is also connected with the off-chip data bus, the logic control module is also connected with the off-chip control bus, and the calculation complex comprises a data routing unit, a data compression processor, a matrix unit, a convolution unit and an activation unit. The embodiment of the invention realizes the parallel acceleration calculation of convolution calculation, improves the calculation efficiency of the hardware accelerator, reduces the consumption of hardware resources, and can be widely applied to the technical field of neural networks.

Inventors

CHEN DIHU
SU TAO
Mo Yannan

Assignees

中山大学

Dates

Publication Date: 20260505
Application Date: 20230210

Claims (8)

1. The neural network accelerator system for the super-resolution of the image is characterized by comprising a transmission engine, a logic control module, a data cache module, a calculation complex, an off-chip data bus, an off-chip control bus, an on-chip data bus and an on-chip control bus, wherein the transmission engine and the calculation complex are connected with the data cache module through the on-chip data bus, the transmission engine, the data cache module and the calculation complex are connected with the logic control module through the on-chip control bus, the transmission engine is further connected with the off-chip data bus, the transmission engine is used for carrying out data transmission between the on-chip cache and the off-chip cache, the logic control module is further connected with the off-chip control bus, the logic control module is used for carrying out global control according to a received control instruction, the data cache module is used for caching feature map data and weight data, the calculation complex is used for executing convolution calculation tasks, and the calculation complex comprises a data routing unit, a data compression processor, a matrix unit, a convolution unit and an activation unit, wherein: The data compression processor is used for compressing/decompressing the feature map data, the data routing unit is used for carrying out data rearrangement on the feature map data stored by the data cache module and then transmitting the feature map data to the data compression processor for decompressing, the matrix unit is used for generating a plurality of parallel convolution outputs according to the decompressed feature map data, the convolution unit is used for carrying out multiplication accumulation according to the parallel convolution outputs and the weight data to obtain a convolution calculation result, and the activation unit is used for carrying out fitting processing on the convolution calculation result through a preset activation function to obtain the feature map data after convolution calculation and transmitting the feature map data after compression processing through the data compression processor to the data cache module for storage; the matrix unit comprises a shift register group, a group-crossing cache circuit and a timing control circuit, wherein the group-crossing cache circuit comprises a plurality of FIFO caches, and the timing control circuit is used for controlling the shift register group and the group-crossing cache circuit so that the shift register group performs shift operation or is set from the corresponding group-crossing cache circuit.
2. The neural network accelerator system for super resolution of image of claim 1, wherein the data compression processor comprises a compression unit and a decompression unit, the compression unit comprises a comparator tree, a first buffer unit, a parallel subtraction unit and a parallel division unit, the comparator tree is used for finding out the maximum value and the minimum value in one compression block and transmitting the maximum value and the minimum value to the first buffer unit, the parallel subtraction unit and the parallel division unit are used for performing compression processing according to the data buffered by the first buffer unit, the decompression unit comprises a second buffer unit, a subtraction unit, a parallel multiplication unit and a parallel addition unit, the second buffer unit is used for storing the maximum value and the minimum value in one compression block, and the subtraction unit, the parallel multiplication unit and the parallel addition unit are used for performing decompression processing according to the data buffered by the second buffer unit.
3. The neural network accelerator system of claim 2, wherein the data compression processor performs compression/decompression using DXT-5 algorithm.
4. The neural network accelerator system for super resolution of image of claim 1, wherein the convolution unit comprises a plurality of multiply-accumulate units, the multiply-accumulate units comprise a control circuit, a dynamic shifter, a register, and a multiplier, the dynamic shifter is used for shift-aligning the input weight data and the feature map data, and the register and the multiplier are used for multiply-accumulate calculation of the shift-aligned weight data and the feature map data.
5. The neural network accelerator system of claim 1, wherein the transfer engine comprises a read DMA engine and a write DMA engine.
6. The neural network accelerator system of claim 1, wherein the logic control module comprises an instruction cache unit, an instruction decode unit, an input FSM, a convolution FSM, and an output FSM.
7. The neural network accelerator system for image super resolution as claimed in any one of claims 1 to 6, wherein the data buffer module comprises an input buffer unit, an output buffer unit, and a weight buffer unit.
8. A method for implementing the neural network accelerator system for image super resolution, which is implemented by the neural network accelerator system for image super resolution according to any one of claims 1 to 7, and includes the steps of: Transmitting the feature map data and the weight data to the data caching module through the transmission engine; the logic control module acquires and decodes the control instruction, and then performs global control according to the decoding result; The data of the feature map stored by the data caching module is rearranged through the data routing unit and then transmitted to the data compression processor; The data compression processor decompresses the data of the characteristic map after data rearrangement and then transmits the decompressed data to the matrix unit; Generating a plurality of parallel convolution outputs according to the decompressed characteristic diagram data through the matrix unit and transmitting the parallel convolution outputs to the convolution unit; performing multiply-accumulate according to the parallel convolution output and the weight data stored by the data caching module by the convolution unit to obtain a convolution calculation result and transmitting the convolution calculation result to the activation unit; Fitting the convolution calculation result through a preset activation function by the activation unit to obtain the feature map data after convolution calculation and transmitting the feature map data to the data compression processor; And compressing the characteristic map data subjected to convolution calculation through the data compression processor, and transmitting the compressed characteristic map data to the data caching module.

Description

Neural network accelerator system for image super-resolution and implementation method thereof Technical Field The invention relates to the technical field of neural networks, in particular to a neural network accelerator system aiming at image super-resolution and an implementation method thereof. Background Super-resolution tasks, which are a typical and challenging field in computer vision and image signal processing, are aimed at recovering an image of one of its high-resolution versions from a corresponding image of a low-resolution version, and super-resolution is also widely used in the real world, such as image segmentation, remote sensing, monitoring, etc. With the popularity of high definition and ultra high definition display devices, super resolution has attracted more and more attention. Meanwhile, in most scenes, the running speed of the image super-resolution algorithm is important. With the great success of deep learning in various fields, super-resolution algorithms based on deep learning are also widely studied, such as convolutional neural networks, generation countermeasure networks, and recurrent neural networks. In particular, when super-resolution is performed on video, they employ a large number of low-resolution and high-resolution video sequences to input the neural network, and perform inter-frame alignment, feature extraction, feature fusion, and super-resolution reconstruction. Because of the strong nonlinear learning capability of the neural network, the super-resolution method based on deep learning can generally obtain good performance on image data sets of a plurality of public standards, but is difficult to calculate in real time on a plurality of embedded devices because of large calculation amount, large feature map cache and large bandwidth requirement. Most of the existing hardware accelerators fail to fully and effectively utilize parallelism in convolution operation, resulting in reduced computing efficiency and increased hardware resource consumption. Meanwhile, in order to improve the effective bandwidth of the accelerator, a network compression mode of variable length coding such as huffman coding is adopted, so that data at a hardware end cannot be aligned, and the original regularity of a calculation mode is lost. Disclosure of Invention In order to solve the technical problems, the invention aims to provide a neural network accelerator system for image super-resolution with high efficiency and an implementation method thereof. The first technical scheme adopted by the invention is as follows: The utility model provides a neural network accelerator system for image super-resolution, includes transmission engine, logic control module, data cache module, calculation complex, off-chip data bus, off-chip control bus, on-chip data bus and on-chip control bus, transmission engine with calculation complex all pass through the on-chip data bus with data cache module is connected, transmission engine, data cache module and calculation complex all pass through the on-chip control bus with logic control module connects, transmission engine still with off-chip data bus connects, transmission engine is used for carrying out the data transmission between on-chip cache and the off-chip cache, logic control module still with off-chip control bus connects, logic control module is used for carrying out global control according to received control command, data cache module is used for caching characteristic map data and weight data, calculation complex is used for carrying out convolution calculation task, calculation complex includes data routing unit, data compression processor, matrix unit, convolution unit and activation unit, data compression processor is used for carrying out compression/decompression processing to the characteristic map data. Further, the data routing unit is configured to perform data rearrangement on the feature map data stored by the data caching module, then transmit the feature map data to the data compression processor for decompression processing, the matrix unit is configured to generate a plurality of parallel convolution outputs according to the decompressed feature map data, the convolution unit is configured to perform multiply-accumulate according to the parallel convolution outputs and the weight data to obtain a convolution calculation result, and the activation unit is configured to perform fitting processing on the convolution calculation result through a preset activation function to obtain the feature map data after convolution calculation, and perform compression processing through the data compression processor and then transmit the feature map data to the data caching module for storage. Further, the data compression processor comprises a compression unit and a decompression unit, wherein the compression unit comprises a comparator tree, a first buffer unit, a parallel subtraction unit and a parallel division unit, the comparator tree