CN-116663630-B - Universal standard convolution operator accelerator based on ARM and FPGA

CN116663630BCN 116663630 BCN116663630 BCN 116663630BCN-116663630-B

Abstract

The invention relates to a general standard convolution operator accelerator based on ARM and FPGA, which adopts a software and hardware cooperative processing mode, designs a parallel computing and multi-layer fusion mode of the convolution operator accelerator at the FPGA side, calls the convolution operator accelerator at the FPGA side through ARM to build a convolution neural network, transmits configuration parameters to the convolution operator accelerator at the FPGA side through ARM to set the structure of each layer of network, stores network model parameters into an SD card, reads the network model parameters in the SD card in program operation, writes the network model parameters into a DDR memory, reads the model parameters in the DDR memory through ARM and FPGA for operation, and transmits a network reasoning result to a PC end through UART.

Inventors

ZHANG WEI
YE JINLIN

Assignees

河北工业大学

Dates

Publication Date: 20260508
Application Date: 20230629

Claims (2)

1. The system comprises a software and hardware cooperative processing mode, a general standard convolution operator accelerator based on ARM and FPGA, a parallel computing and multi-layer fusion mode of the convolution operator accelerator designed on the FPGA side, a convolution neural network built by the convolution operator accelerator on the FPGA side through ARM, configuration parameters transferred to the convolution operator accelerator on the FPGA side through ARM to set the structure of each layer of network, network model parameters stored in an SD card, ARM read network model parameters in the SD card and write the network model parameters into a DDR memory during program operation, ARM and FPGA read model parameters in the DDR memory for operation, ARM transmit network reasoning results to a PC end through UART, wherein the adopted processor system comprises an ARM processor core, a DDR controller, an SDIO controller and a UART controller, the DDR controller is used for controlling the memory access of data, the SDIO controller is used for controlling the reading and writing of the data in the SD card, and the UART controller is used for transmitting the data to the PC end; The convolution operator accelerator is realized by an FPGA and comprises a convolution module, an activation module, a pooling module, a ping-pong buffer module, an AXI bus module and an on-chip buffer module, wherein the convolution module is used for extracting the characteristics of an input characteristic map, the activation module is used for improving the nonlinear modeling capability of a network, the pooling module is used for reducing the characteristic dimension, the AXI bus module is used for transmitting data between DDR, ARM and the FPGA, and the on-chip buffer module comprises an input characteristic map buffer, an output characteristic map buffer, a weight parameter buffer and an offset parameter buffer; The parallel computing is to process two dimensions of an input layer and an output layer in parallel, decompose a matrix of an input feature diagram, an output feature diagram, weight and offset parameters according to the parallelism, consider the operation of processing decomposed data as a processing unit PE, design a plurality of PEs to respectively execute the data operation divided according to the parallelism and parallelize the PEs, wherein the data operation comprises convolution, activation and pooling operations which are respectively realized through a convolution module, an activation module and a pooling module; The multi-layer fusion mode is that a convolution layer, an activation layer and a pooling layer are fused together, when the convolution layer is adjacent to the pooling layer, the convolution module and the activation module directly transmit data to the pooling module after calculation is completed, when the pooling module transmits the data to the DDR memory after calculation is completed, when the convolution layer is not adjacent to the pooling layer, the convolution module and the activation module directly transmit the data to the DDR memory after calculation is completed; The ARM calls the IP core to build a neural network, the ARM drives the SDIO controller to read an input feature map, weight and bias parameters in the SD card and drive the DDR controller to transmit the input feature map, the weight and bias parameters to the DDR memory, and the ARM drives the IP core to read the input feature data in the DDR memory through the AXI bus and stores the output feature data operated by the IP core into the DDR memory; Adopting improved i-LeNet network structure, removing two full connection layers to reduce network size, adding two convolution layers to improve feature extraction capability, setting channel number to 32 and convolution kernel size to adapt to parallelism of convolution operator accelerator Setting the maximum pooling core size to The convolution step size is set to 1, the maximum pooling step size is set to 2, the padding is set to 2, relu is employed as the activation function.
2. The accelerator according to claim 1, wherein the ping-pong buffer module is configured to increase the data transmission bandwidth and the access speed, divide the on-chip buffer area into two groups, namely, buffer 0 and buffer 1, when the accelerator performs a convolution operation on the data in buffer 0, buffer 1 loads the data required by the next convolution operation into the DDR memory, when the accelerator processes the data in buffer 0, buffer 1 stores the calculation result, when the next clock cycle arrives, the accelerator directly processes the data in buffer 1, and buffer 0 loads the input data of the next convolution operation into the DDR memory, and so on.

Description

Universal standard convolution operator accelerator based on ARM and FPGA Technical Field The invention belongs to the technical field of artificial intelligence embedding, and particularly relates to a universal standard convolution operator accelerator based on ARM and FPGA, which is used for the edge reasoning acceleration of a convolution neural network. Background The convolutional neural network is widely applied to the image field as an AI algorithm, and with the development of technologies such as 5G, internet of things and embedded technology, the convolutional neural network is increasingly arranged in a mobile application scene, the number of layers and the number of parameters of a deep neural network model are continuously increased, and requirements on computing capacity, memory bandwidth, data storage and the like of hardware are increasingly high. The traditional computer adopts serial calculation, the traditional serial calculation method cannot adapt to a complex parallel network structure, and the general processor cannot meet the requirements of low power consumption and high performance of a mobile scene, so that a high-precision AI algorithm cannot be deployed at the edge end with limited resources. The reason for this is that the existing edge chip cannot support the huge computational load generated by AI operation. The conventional edge AI chip is mostly a GPU and ASIC architecture. The GPU has a large number of stream processors, which can reduce the operation time by parallel operation, but the GPU increases the speed by increasing the number of cores and increasing the main frequency, which results in excessive power consumption, and is unsuitable for mobile application scenarios because stable power supply cannot be provided in industrial environments and complex combat environments. ASIC is a specialized chip tailored to meet specific needs, but while having superior performance, ASIC is too costly to design and manufacture, long in development cycle, and non-reconfigurable. The FPGA is a semi-custom circuit, has the capability of data parallel and task parallel calculation, processes data in a hardware pipeline mode, custom-designs an AI accelerator, and maps an AI algorithm into the FPGA to realize hardware acceleration and furthest mine the parallelism of a network model. The method has the advantages of low cost, low power consumption, portability and the like, and has important significance for the landing of a large model of the artificial intelligent algorithm, the improvement of the computing performance of the edge end of the artificial intelligent algorithm, the acceleration of the domestic autonomous controllable process of key equipment and the like. Disclosure of Invention The invention aims to provide a universal standard convolution operator accelerator based on ARM and FPGA on the basis of the prior art. The technical proposal is as follows: The system comprises a software and hardware cooperative processing mode, a general standard convolution operator accelerator based on ARM and FPGA, a parallel computing and multi-layer fusion mode of the convolution operator accelerator designed on the FPGA side, a convolution neural network built by the convolution operator accelerator on the FPGA side through ARM, configuration parameters transferred to the convolution operator accelerator on the FPGA side through ARM to set the structure of each layer of network, network model parameters stored in an SD card, ARM read network model parameters in the SD card and write the network model parameters into a DDR memory during program operation, ARM and FPGA read model parameters in the DDR memory for operation, ARM transmit network reasoning results to a PC end through UART, wherein the adopted processor system comprises an ARM processor core, a DDR controller, an SDIO controller and a UART controller, the DDR controller is used for controlling the memory access of data, the SDIO controller is used for controlling the reading and writing of the data in the SD card, and the UART controller is used for transmitting the data to the PC end; The convolution operator accelerator is realized by an FPGA and comprises a convolution module, an activation module, a pooling module, a ping-pong buffer module, an AXI bus module and an on-chip buffer module, wherein the convolution module is used for extracting the characteristics of an input characteristic map, the activation module is used for improving the nonlinear modeling capability of a network, the pooling module is used for reducing the characteristic dimension, the AXI bus module is used for transmitting data between DDR, ARM and the FPGA, and the on-chip buffer module comprises an input characteristic map buffer, an output characteristic map buffer, a weight parameter buffer and an offset parameter buffer; The parallel computing is to process two dimensions of an input layer and an output layer in parallel, decompose a matrix of a