CN-116596034-B - Three-dimensional convolutional neural network accelerator and method on complex domain

CN116596034BCN 116596034 BCN116596034 BCN 116596034BCN-116596034-B

Abstract

The invention discloses a three-dimensional convolutional neural network accelerator and a method thereof on a complex domain, wherein the accelerator comprises a buffer memory unit, an AXI DMA unit, a calculation unit, a post-processing unit and a control unit, wherein the buffer memory unit is used for storing input characteristics, output characteristics and weight data in the complex domain, the AXI DMA unit is used for carrying out data transmission on the accelerator and an off-chip memory, the calculation unit is used for accelerating the calculation of a convolutional layer and a full connection layer, the post-processing unit is used for calculating a fused quantization layer, a pooling layer, a batch normalization layer and an activation layer, and the control unit is used for controlling and scheduling working states of the buffer memory unit, the AXI DMA unit, the calculation unit and the post-processing unit. The method can remarkably improve the performance and energy efficiency of the 3D CNN in deployment.

Inventors

GONG LEI
WANG CHAO
ZHOU XUEHAI
LI XI
CHEN XIANGLAN
ZHU ZONGWEI

Assignees

中国科学技术大学苏州高等研究院

Dates

Publication Date: 20260505
Application Date: 20230423

Claims (6)

1. A three-dimensional convolutional neural network accelerator on a complex domain, the three-dimensional convolutional neural network comprising a convolutional layer, a fully-connected layer, a pooling layer, an activation layer and a batch normalization layer, characterized in that, Before the deployment of the three-dimensional convolutional neural network, the method comprises the following steps: Quantizing the three-dimensional convolutional neural network; acquiring a complex sequence obtained by FFT conversion of a weight value and an activation value; When N is even, only storing the total of N/2+1 complex numbers of X 0 、X 1 、...、X 𝑁/2 in the complex sequence, and packing the real numbers X 0 and X N/2 into one complex number X 0 +jX 𝑁/2 ; the complex multiplication is optimized: Obtaining a first complex number z 1 and a second complex number z 2 to be multiplied, wherein the first complex number z 1 =a+bj, the second complex number z 2 =c+dj, a, b, c, d are real numbers, and j represents the square root of-1; Converting the multiplication of the first complex z 1 and the second complex z 2 into a calculation , wherein, , , And the sum of the two values, Obtaining a third complex number x, a fourth complex number w 1 and a fifth complex number w 2 to be subjected to multiplication calculation, wherein the third complex number x is required to be subjected to product operation with the fourth complex number w 1 and the fifth complex number w 2 respectively, x=a+bj, w 1 =x 1 +jy 1 ,w 2 =x 2 +jy 2 , wherein a, b and x 1 , y 1 , x 2 , y 2 are real numbers, and j represents the square root of-1; Converting the product of the third complex number x and the fourth complex number w 1 into , wherein, , , , Converting the product of the third complex number x and the fifth complex number w 2 into , wherein, , , ; The accelerator includes: The buffer memory unit is used for storing input characteristics, output characteristics and weight data in a complex domain; an AXI DMA unit, which is used for carrying out data transmission between the accelerator and the off-chip memory; The computing unit is used for accelerating the computation of the convolution layer and the full connection layer and comprises an arithmetic unit matrix, an address generator, a PE controller and a data processing unit, wherein the arithmetic unit matrix comprises a plurality of arithmetic units PE, the arithmetic units PE are arranged in a two-dimensional matrix with the size of T /B multiplied by B, each arithmetic unit PE comprises T /B parallel complex multipliers and a complex addition tree used for summing the output of the T /B parallel complex multipliers, T is the block size of an output channel, B is the block size of a two-dimensional matrix, T is the block size of an input channel, and the address generator is used for generating address data of input characteristics, output characteristics and weight data; The post-processing unit is used for calculating the integrated quantization layer, the integrated pooling layer, the batch normalization layer and the integrated activation layer; and the control unit is used for controlling and scheduling the working states of the buffer memory unit, the AXI DMA unit, the computing unit and the post-processing unit.
2. The three-dimensional convolutional neural network accelerator over complex domain of claim 1, wherein the buffer unit comprises: the input feature caching unit is used for storing input features in a complex domain; The output characteristic buffer unit is used for storing output characteristics in a complex domain; and the weight caching unit is used for caching weights in the complex domain.
3. The three-dimensional convolutional neural network accelerator over complex domain of claim 1, wherein the AXI DMA unit comprises: The data packaging unit is used for packaging the output data of the buffer unit to increase the bandwidth of the output data; The data disassembling unit is used for disassembling the data of the off-chip memory to obtain the data required by the accelerator; And the AXI DMA controller is used for controlling the working states of the data packaging unit and the data disassembling unit.
4. The three-dimensional convolutional neural network acceleration method on the complex domain is characterized by comprising the following steps of: Quantizing the three-dimensional convolutional neural network; Deploying a three-dimensional convolutional neural network; a three-dimensional convolutional neural network accelerated using a three-dimensional convolutional neural network accelerator on a complex domain as claimed in any one of claims 1 to 3.
5. The method for accelerating a three-dimensional convolutional neural network over a complex domain according to claim 4, wherein said quantizing the three-dimensional convolutional neural network comprises: Calculating the scaling factor of the real and imaginary parts of the weight values Scaling factors for real and imaginary parts of an activation value , Scaling factor based on real and imaginary parts of weight values Scaling factors for real and imaginary parts of an activation value Calculating a pseudo quantization operator, the pseudo quantization operator comprising quantization operator CQuant and inverse quantization operator CDequant, Inserting a pseudo quantization operator into a calculation map of the three-dimensional convolutional neural network; Wherein, the , , Wherein l=1, 2 represent the scaling factor of the real part and the scaling factor of the imaginary part of the weight respectively, In a l , l=1, 2 represents the real and imaginary parts of the activation value, β e [0,1], respectively; , , wherein z is a complex number to be quantized, z r 、z i represents a real part and an imaginary part of z, respectively, and j is a square root of-1; Quant is a symmetric quantization operator on the real number domain, quant (x) =clamp ([ x/s ], -127, 127), clamp (x, a, b) is used to constrain the value of x between a, b, returning a if x is less than a, b if x is greater than b, otherwise, returning x; Dequant is an antisymmetric quantization operator over the real number domain, dequant (x) =x×s.
6. The method for accelerating a three-dimensional convolutional neural network on a complex domain according to claim 4, further comprising: Shifting the first multiplier r 1 left by 18 bits and expanding the second multiplier r 2 sign to 27 bits, and then summing the two to obtain r 1 <<18+r 2 , wherein r 1 、r 2 is a real number; calculating the product of r 1 <<18+r 2 and the third multiplier r 3 to obtain o=r 3 ×(r 1 <<18+r 2 ); The results of r 3 ×r 1 and r 3 ×r 2 were isolated from o.

Description

Three-dimensional convolutional neural network accelerator and method on complex domain Technical Field The invention belongs to the technical field of convolutional neural networks, and particularly relates to a three-dimensional convolutional neural network accelerator and a three-dimensional convolutional neural network accelerator method. Background In recent years, deep convolutional neural networks have achieved great success in the field of image processing. However, when processing higher-dimensional data such as video, the conventional two-dimensional convolutional neural network cannot effectively capture time information therein, and thus cannot achieve a satisfactory effect. The three-dimensional convolution neural network solves the problem, can simultaneously capture the space-time information in the video through three-dimensional convolution, and plays a great role in video classification and medical image analysis. However, compared with a two-dimensional convolutional neural network, the three-dimensional convolutional neural network has huge storage and calculation overhead, and brings serious challenges for deployment in an embedded type equal-edge scene. To address this problem, the industry began to attempt to accelerate the 3D CNN algorithm with specialized hardware. In the cloud, the GPU becomes a mainstream hardware acceleration platform due to the characteristics of high computational parallelism and high memory bandwidth. At the edge, due to the limitation of resources, power consumption and other factors, hardware acceleration technology based on ASIC and FPGA is generally adopted, and the deployment efficiency of the 3D CNN is improved by providing higher parallelism on the calculation level and increasing data multiplexing as much as possible on the access level. Wherein an ASIC is an integrated circuit chip designed and developed for a particular application, which has the highest performance, lowest power consumption, and smallest area relative to other platforms. The FPGA has the reconfigurable characteristic, the development period is shorter than that of the ASIC, the development difficulty is lower, and the FPGA has the characteristics of high performance and low power consumption, so that the accelerator based on the FPGA can obtain higher performance and energy efficiency, and can be more suitable for the rapid iteration of a deep learning algorithm. Currently, the mainstream hardware accelerator mainly adopts structures such as vector inner product units, systolic arrays, line buffers and the like. The vector inner product unit mainly spreads on the convolved input channel and output channel to realize the parallelism of the dimensions of the input channel and the output channel, and a typical example is DianNao of the department of computing in the department of Chinese academy. The systolic array includes three data streams, i.e., input fixed, output fixed, and weight fixed, and multiplexes data by data transmission between neighboring PEs, thereby improving performance and energy efficiency of the accelerator, typically as in google's tensor processor. The line buffer architecture realizes the parallel computation of convolution kernel dimension by caching the input features of the K x K window, and under the hardware architecture, the flow processing of the feature map can be easily realized, so that a higher throughput rate can be achieved. Research shows that the calculation of the neural network in the complex domain has a plurality of advantages. But there is a situation in which three-dimensional convolutional neural network accelerator-related research is deficient in the complex domain. Disclosure of Invention In order to solve the problem of lack of related research of the existing three-dimensional convolutional neural network accelerator in the complex domain, the invention provides the three-dimensional convolutional neural network accelerator in the complex domain and the method thereof, which can remarkably improve the performance and energy efficiency of the 3D CNN deployment. The aim of the invention is achieved by the following technical scheme: The first aspect of the present invention provides a three-dimensional convolutional neural network accelerator, the three-dimensional convolutional neural network comprising a convolutional layer, a fully-connected layer, a pooling layer, an activation layer and a batch normalization layer, the accelerator comprising: The buffer memory unit is used for storing input characteristics, output characteristics and weight data in a complex domain; an AXI DMA unit, which is used for carrying out data transmission between the accelerator and the off-chip memory; the computing unit is used for accelerating the computation of the convolution layer and the full connection layer; The post-processing unit is used for calculating the integrated quantization layer, the integrated pooling layer, the batch normalization lay