CN-119416842-B - Deep neural network accelerator system with automatic configuration and implementation method thereof

CN119416842BCN 119416842 BCN119416842 BCN 119416842BCN-119416842-B

Abstract

The invention discloses an automatic configuration deep neural network accelerator system and an implementation method thereof, wherein the deep neural network accelerator system comprises a deep neural network accelerator and a code generation module, the deep neural network accelerator comprises an off-chip memory, a bus interface, a DMA module, a controller, a data segmentation module, a data integration module, an address generation engine, a calculation module, a nonlinear operation module, an input buffer, an offset buffer, a weight buffer and an output buffer, a corresponding RTL code is generated by the code generation module according to user requirements, the RTL code is transmitted to the controller, the calculation module and the DMA module are configured by the controller according to the RTL code, the configured deep neural network accelerator is obtained, and neural network calculation is executed by the deep neural network accelerator. The invention improves the adaptability of the neural network accelerator to various application scenes, improves the calculation efficiency of the neural network, and can be widely applied to the technical field of the neural network accelerator.

Inventors

MAO WENDONG
ZENG QIUHAO
WANG ZHONGFENG

Assignees

中山大学·深圳
中山大学

Dates

Publication Date: 20260505
Application Date: 20241012

Claims (8)

1. The deep neural network accelerator system with automatic configuration is characterized by comprising a deep neural network accelerator and a code generation module, wherein the deep neural network accelerator comprises an off-chip memory, a bus interface, a DMA module, a controller, a data segmentation module, a data integration module, an address generation engine, a calculation module, a nonlinear operation module, an input buffer, an offset buffer, a weight buffer and an output buffer, wherein: The DMA module and the controller are connected with the off-chip memory through the bus interface, the data segmentation module and the data integration module are connected with the DMA module through signals, the input end of the weight buffer, the input end of the offset buffer and the first input end of the input buffer are connected with the output end of the data segmentation module, the output end of the offset buffer is connected with the second input end of the input buffer through the address generation engine, the output end of the input buffer and the output end of the weight buffer are connected with the input end of the calculation module, the output end of the calculation module is connected with the input end of the output buffer through the nonlinear operation module, the output end of the output buffer is connected with the input end of the data integration module, the calculation module is connected with the controller through signals, and the output end of the code generation module is connected with the input end of the controller. The code generation module is used for generating a corresponding RTL code according to user requirements and transmitting the RTL code to the controller, the controller is used for configuring the calculation module and the DMA module according to the RTL code to obtain a configured deep neural network accelerator, and the following neural network calculation steps are executed through the deep neural network accelerator: Loading input data from an off-chip memory to a DMA module, and performing segmentation processing on the input data through a data segmentation module to obtain activation data, offset data and weight data, further storing the activation data into an input buffer, storing the offset data into an offset buffer and storing the weight data into a weight buffer; Acquiring the activation data, the offset data and the weight data through a calculation module, and performing matrix multiplication calculation, convolution operation, transposition convolution operation, deformable convolution operation, cavity convolution operation, attention operation or deformable attention operation according to the activation data, the offset data and the weight data to obtain a calculation result; activating operation is carried out on the calculation result through a nonlinear operation module to obtain output data, and the output data is stored into an output buffer; and integrating the output data through a data integration module to obtain integrated output data, and writing the integrated output data back to the off-chip memory through the DMA module.
2. The automated deep neural network accelerator system of claim 1, wherein the computation module comprises a matrix multiplication block, a matrix transposition unit, an intermediate buffer, and a SoftMax unit, wherein the output of the input buffer and the output of the weight buffer are both connected to the input of the matrix multiplication block, the first output of the matrix multiplication block is connected to the input of the matrix transposition unit, the output of the matrix transposition unit is connected to the input of the SoftMax unit through the intermediate buffer, the second output of the matrix multiplication block, the output of the intermediate buffer, and the output of the SoftMax unit are all connected to the input of the output buffer through the nonlinear operation module, and the matrix transposition unit and the SoftMax unit are both in signal connection with the controller.
3. An automated deep neural network accelerator system of claim 2, wherein the matrix multiplication block comprises a plurality of processing units, each processing unit comprising a plurality of multipliers and a first accumulator, the input buffered output and the weight buffered output each being coupled to the input of the processing unit, the output of the matrix multiplication block being derived by coupling the computation of each processing unit.
4. An automated configured deep neural network accelerator system of claim 2, wherein the matrix transpose unit comprises a first register array to perform read operations in a first stage and write operations in a second stage and a second register array to perform write operations in the first stage and read operations in the second stage.
5. The automated deep neural network accelerator system of claim 2, wherein the SoftMax unit comprises an exponent module, a second accumulator, a divider, and a FIFO memory, wherein an input of the exponent module is coupled to the output of the intermediate buffer, a first output of the exponent module is coupled to the first input of the divider via the FIFO memory, a second output of the exponent module is coupled to the second input of the divider via the second accumulator, and an output of the divider is coupled to the input of the output buffer via the nonlinear operation module.
6. The automated deep neural network accelerator system of claim 1, wherein the address generation engine is configured to obtain absolute coordinates of a two-dimensional offset of the offset buffer input, and to superimpose the absolute coordinates with an initial input activation value to obtain an offset input address, and to transmit the offset input address to the input buffer.
7. The automated configured deep neural network accelerator system of claim 1, wherein the code generation module generates the RTL code by: Acquiring a user demand prompt file/a modification demand condition file, and inputting the user demand prompt file/the modification demand condition file and a preset template file into ChatGPT to obtain a target Python file; and operating the target Python file to obtain the RTL code.
8. A method of implementing an automatically configured deep neural network accelerator system, for implementation by an automatically configured deep neural network accelerator system as claimed in any one of claims 1 to 7, comprising the steps of: Generating a corresponding RTL code according to the user demand through a code generating module, and transmitting the RTL code to a controller; configuring a calculation module and a DMA module according to the RTL code by the controller to obtain a configured deep neural network accelerator; and performing neural network calculation through the deep neural network accelerator.

Description

Deep neural network accelerator system with automatic configuration and implementation method thereof Technical Field The invention relates to the technical field of neural network accelerators, in particular to an automatic configuration deep neural network accelerator system and an implementation method thereof. Background In the prior art, the problem of automatically generating flexible hardware accelerators for different deep neural networks is four. In a first aspect, the different operations mean different computing modes and circuit structures, and unifying the computing processes of the various operations under limited hardware resources is a great challenge for hardware flexibility. In a second aspect, deformable attention and deformable convolution involve sampling of irregular receptive fields and dynamic irregular access, which increases the overhead of bilinear interpolation and affects data reuse. In a third aspect, the direct use ChatGPT of code is not effective, and its generation capability in a specific field needs to be improved, especially RTL code. In a fourth aspect, chatGPT has a limit on the length of text generation, which makes it impossible to meet the generation requirements of RTL code in very large scale integrated circuit designs. The above problems need to be solved. Disclosure of Invention In order to solve the technical problems, the invention aims to provide an automatic configuration deep neural network accelerator system and an implementation method thereof, which can automatically generate corresponding calculation kernels according to different configuration information, so that the adaptability of the neural network accelerator to various application scenes is improved, and the calculation efficiency of the neural network is also improved. The first technical scheme adopted by the invention is as follows: An automatic deep neural network accelerator system comprises a deep neural network accelerator and a code generation module, wherein the deep neural network accelerator comprises an off-chip memory, a bus interface, a DMA module, a controller, a data segmentation module, a data integration module, an address generation engine, a calculation module, a nonlinear operation module, an input buffer, an offset buffer, a weight buffer and an output buffer, wherein: The DMA module and the controller are connected with the off-chip memory through the bus interface, the data segmentation module and the data integration module are connected with the DMA module through signals, the input end of the weight buffer, the input end of the offset buffer and the first input end of the input buffer are connected with the output end of the data segmentation module, the output end of the offset buffer is connected with the second input end of the input buffer through the address generation engine, the output end of the input buffer and the output end of the weight buffer are connected with the input end of the calculation module, the output end of the calculation module is connected with the input end of the output buffer through the nonlinear operation module, the output end of the output buffer is connected with the input end of the data integration module, the calculation module is connected with the controller through signals, and the output end of the code generation module is connected with the input end of the controller. Further, the calculation module includes a matrix multiplication block, a matrix transposition unit, an intermediate buffer and a SoftMax unit, the output end of the input buffer and the output end of the weight buffer are connected with the input end of the matrix multiplication block, the first output end of the matrix multiplication block is connected with the input end of the matrix transposition unit, the output end of the matrix transposition unit is connected with the input end of the SoftMax unit through the intermediate buffer, the second output end of the matrix multiplication block, the output end of the intermediate buffer and the output end of the SoftMax unit are connected with the input end of the output buffer through the nonlinear operation module, and the matrix transposition unit and the SoftMax unit are connected with the controller through signals. Further, the matrix multiplication block includes a plurality of processing units, each processing unit includes a plurality of multipliers and a first accumulator, the output end of the input buffer and the output end of the weight buffer are both connected with the input end of the processing unit, and the output result of the matrix multiplication block is obtained by connecting the calculation result of each processing unit. Further, the matrix transpose unit includes a first register array to perform a read operation in a first stage and a write operation in a second stage, and a second register array to perform a write operation in the first stage and a read operation in the second stage. Fu