CN-122021737-A - Adaptive blocking deep learning convolution accelerator and implementation method thereof

CN122021737ACN 122021737 ACN122021737 ACN 122021737ACN-122021737-A

Abstract

The invention discloses a self-adaptive blocking deep learning convolution accelerator and an implementation method thereof, and belongs to the technical field of deep learning hardware acceleration. The invention solves the problems of low MAC utilization rate and long data transmission waiting time of the traditional CNN accelerator in the prior art, automatically calculates optimal execution configuration by analyzing convolutional layer parameters and platform resources through a resource discovery unit, establishes a round-level pipeline scheduling mechanism by performing block division and round allocation by a dynamic scheduler, simultaneously executes data acquisition, calculation execution and result write-back operation by three independent buffers respectively by adopting a three-buffer concurrent mechanism, realizes time overlapping of multiple rounds of calculation, and outputs convolution calculation results by performing multiply-accumulate operation by a self-adaptive MAC array according to configuration. The method effectively improves the MAC utilization rate and the pipeline efficiency of the convolution accelerator, and can be applied to the edge-end deep learning reasoning acceleration scene.

Inventors

YU BIN
YANG BOYU
LI AO
LIU ZHIWEI
HUANG HAI

Assignees

哈尔滨理工大学

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (7)

1. The implementation method of the self-adaptive blocking deep learning convolution accelerator is characterized by comprising the following steps of: S1, receiving configuration parameters of a convolution layer, analyzing the layer type and the calculation complexity through a resource discovery unit, automatically calculating an optimal resource allocation scheme according to available DSP and BRAM resources of a target FPGA platform, and generating execution configuration; S2, the dynamic scheduler performs block division and round distribution according to the execution configuration, decomposes large-scale matrix operation into a plurality of calculation rounds which can be processed in parallel, and establishes a round-level pipeline scheduling mechanism; S3, starting a three-buffer concurrent mechanism, and controlling three independent buffers through a data movement manager to respectively and simultaneously execute data acquisition, calculation execution and result write-back operation; S4, the self-adaptive MAC array reads the input characteristic diagram data and the convolution weight data from the buffer area according to the configuration information of the current round, performs multiply-accumulate operation, and writes the calculation result into the output buffer area; s5, circularly executing the steps S3-S4 until all calculation rounds are completed, and outputting a final convolution calculation result.
2. The method for implementing the adaptive blocking deep learning convolutional accelerator according to claim 1, wherein in S1, the working process of the resource discovery unit is specifically as follows: Receiving configuration parameters of a convolution layer, wherein the configuration parameters comprise an input channel number M, an output channel number N, a convolution kernel size K, a feature map height H and width W and a layer type; Calculating theoretical DSP requirements according to layer types, wherein the theoretical DSP requirements are the input channel number M for depth separable convolution, and M multiplied by N for standard convolution and point-by-point convolution; Calculating optimal DSP distribution according to the total number of available DSPs of the platform: optimal_dsp = min(theoretical_dsp, platform_dsp_total); calculating the required round number, namely optimal_ rounds = ⌈ (M×N)/optimal_dsp ⌉; Outputting the execution configuration and the estimated cycle number.
3. The method for implementing the adaptive blocking deep learning convolutional accelerator according to claim 2, wherein the working process of the dynamic scheduler in S2 and the three-buffer concurrency mechanism in S3 is specifically as follows: The dynamic scheduler adopts a heuristic algorithm to divide blocks, calculates the square root of the number D of available DSPs, searches the optimal M_tile and N_tile combination in the range from 1 to ∈D, evaluates the efficiency of each combination, namely, efficiency= (M multiplied by N)/(m_blocks multiplied by n_blocks multiplied by M_tile multiplied by N_tile), selects the block scheme with highest efficiency, allocates unique ROUND numbers for each calculation ROUND, and the ROUND states comprise round_IDLE, round_FETCH, round_EXECUTE and round_ WRITEBACK; The three buffer areas adopt a Round distribution strategy, in steady operation, a first buffer area executes result write-back operation, a calculation result of Round N-2 is written back to the DDR, a second buffer area executes calculation execution operation, the MAC array calculates data of Round N-1, a third buffer area executes data acquisition operation, input data and weight of Round N are prefetched from the DDR, and the state Round rule of the buffer areas is that after one Round of write-back is completed, the buffer areas are marked as idle, and new rounds of write-back are distributed to the idle buffer areas in a priority mode.
4. The method for implementing the adaptive blocking deep learning convolutional accelerator according to claim 3, wherein in S4, the calculation process of the adaptive MAC array is specifically as follows: the MAC array comprises M×N MAC units, each MAC unit supporting multi-bit multiplication and accumulation operations; The calculation expression of the MAC unit is that result=input_data×weight_data+partial_sum; The MAC array mapping strategy is that when M_tile and N_tile do not exceed the array dimension, direct mapping is adopted, and when M_tile or N_tile exceeds the array dimension, time division multiplexing mapping is adopted to decompose a large block into a plurality of sub-blocks for sequential processing.
5. A self-adaptive blocking deep learning convolutional accelerator system, characterized in that it is used for implementing a self-adaptive blocking deep learning convolutional accelerator implementation method according to any one of claims 1-4, comprising an FPGA hardware platform and an AXI bus interface; the FPGA hardware platform comprises a programmable logic part and an off-chip DDR memory; the programmable logic part comprises a top layer control module, a resource discovery unit, a dynamic scheduler, an adaptive MAC array and a data movement manager; The resource discovery unit is used for analyzing the parameters of the convolution layer and platform resources and calculating the optimal execution configuration; The dynamic scheduler is used for block division, round distribution and three-stage pipeline management; The adaptive MAC array is used for executing multiply-accumulate operation; the data movement manager is used for data transmission between the DDR memory and the BRAM buffer; The AXI bus interface comprises an AXI4 main interface and an AXI-Lite configuration interface which are respectively used for data transmission and parameter configuration.
6. The adaptive blocking deep learning convolutional accelerator system of claim 5, wherein the adaptive MAC array is configured with parameterization, and wherein each MAC unit uses a DSP hard core of the FPGA to implement multiplication.
7. The adaptive blocking deep learning convolutional accelerator system of claim 6, wherein the system supports three convolutional layer types, standard convolutional, deep convolutional, point-by-point convolutional, and wherein key parameters of the system are configurable via a configuration file, including maximum feature map size, maximum channel number, MAC array size, maximum round number, buffer number, data bit width, weight bit width, result bit width, address bit width, AXI data bit width, and maximum burst length.

Description

Adaptive blocking deep learning convolution accelerator and implementation method thereof Technical Field The invention relates to a convolutional neural network computing device and an implementation method, in particular to a convolutional neural network computing device and an implementation method based on self-adaptive time division multiplexing, and belongs to the technical field of deep learning hardware acceleration. Background In recent years, deep learning technology has been vigorously developed, and convolutional neural networks have been widely used in the fields of image recognition, target detection, automatic driving, and the like. In order to meet the requirements of low power consumption and real-time reasoning in an edge computing scene, the FPGA becomes an important platform for deploying the convolutional neural network due to the reconfigurability and the parallel computing capability of the FPGA. However, the DSP number and BRAM capacity of different FPGA platforms are large, accelerators designed for specific platforms are difficult to directly migrate to other platforms, when the scale of a convolution layer exceeds hardware resources, the traditional fixed architecture cannot effectively utilize all DSP resources, the traditional double-buffer mechanism has waiting time between data carrying and calculation, so that the pipeline efficiency is reduced, the parameter difference of different layers of the convolution neural network is large, fixed configuration cannot optimize each layer, and certain difficulties are brought to FPGA deployment by the factors, so that the problems of poor cross-platform migration, low DSP resource utilization rate, insufficient overlap of calculation and memory access and the like exist in heterogeneous FPGA platform deployment of the convolution neural network. Based on the above, the invention provides a convolutional neural network computing device based on self-adaptive time division multiplexing and an implementation method thereof, wherein the convolutional neural network computing device realizes cross-platform self-adaptive deployment and can automatically adjust time division multiplexing configuration according to hardware resources. Disclosure of Invention The following presents a simplified summary of the invention in order to provide a basic understanding of some aspects of the invention. It should be understood that this summary is not an exhaustive overview of the invention. It is not intended to identify key or critical elements of the invention or to delineate the scope of the invention. Its purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is discussed later. In view of the above, the invention provides a convolutional neural network computing device based on adaptive time division multiplexing and an implementation method thereof, which are used for solving the problems of poor cross-platform portability and low DSP resource utilization rate of an FPGA convolutional accelerator in the prior art. The technical scheme I is that the convolutional neural network calculation realization method based on the self-adaptive time division multiplexing comprises the following steps: S1, acquiring resource configuration parameters of a target hardware platform and layer parameters of a convolution layer to be processed, wherein the resource configuration parameters comprise the number D of DSPs and the capacity B of BRAMs, and the layer parameters comprise the number M of input channels, the number N of output channels, the size K of a convolution kernel and the size H of a feature map; S2, calculating time division multiplexing configuration parameters based on the resource configuration parameters and the layer parameters, wherein the time division multiplexing configuration parameters comprise an optimal DSP usage amount R and a time division multiplexing total round T, R is equal to a smaller value of both D and M multiplied by the square root of N, and the T is equal to the result of dividing M by N by R and rounding upwards; s3, configuring a three-buffer round stage pipeline, and mapping continuous execution rounds to three physical buffers A, B, C according to modulo operation, so that three operations of data prefetching, MAC calculation and result writing back are executed on different buffers concurrently; S4, configuring the enabling state and channel mapping of each MAC unit according to the MAC unit allocation table, executing multiply-accumulate operation, and outputting a final calculation result when the accumulated times reach a set value. Further, in the S1, the target hardware platform includes Intel cycle V series, INTEL ARRIA series, xilinx Zynq-7000 series and Xilinx Zynq UltraScale + series, and the corresponding DSP number D and BRAM capacity B are automatically obtained according to the platform type. Further, in the step S2, the calculation process of the time d