CN-122021752-A - Data storage and access method and system for pulse convolutional neural network accelerator

CN122021752ACN 122021752 ACN122021752 ACN 122021752ACN-122021752-A

Abstract

The application belongs to the technical field of neural network hardware accelerators and integrated circuit design. The application provides a data storage and access method and system for a pulse convolutional neural network accelerator. Aiming at the binaryzation and multi-time-step reasoning characteristics of pulse data in a pulse convolution neural network, the embodiment of the disclosure designs a data storage and access mode matched with a pulse array calculation data stream. By adopting the storage sequence which is mainly in the line direction of input pulse data, is grouped according to input channels and is continuously organized in the time step dimension in the off-chip memory, the off-chip memory can be interacted in a continuous access mode in the processes of data loading and calculation result writing back. And simultaneously, by combining the input buffer design of the multi-BANK structure, the multi-channel input data of adjacent rows can be read in parallel in the convolution window expansion process.

Inventors

ZHAO CHEN
ZHAO WANWAN
Yao Yichu
NIU QIANG
LI QIAN

Assignees

西北工业大学

Dates

Publication Date: 20260512
Application Date: 20260106

Claims (9)

1. A pulse convolutional neural network accelerator-oriented data storage and access system, comprising: The SCNN accelerator comprises a computing module, a storage access interface module and a plurality of on-chip buffers, wherein the storage access interface module comprises an input read controller, a weight read controller and an output write-back controller, The off-chip memory is used for storing input pulse data, convolution kernel and output characteristic diagram; an input read controller for reading input pulse data from the off-chip memory; a weight read controller for reading the convolution kernel from the off-chip memory; a plurality of on-chip buffers for buffering the convolution kernel and the input pulse data; the calculation module is used for carrying out convolution calculation on the input pulse data and combining the convolution kernel to generate an output characteristic diagram; and the output write-back controller is used for writing the output characteristic diagram back to the off-chip memory.
2. The data storage and access system for the pulse convolutional neural network accelerator according to claim 1, wherein the input pulse data is logically divided in the off-chip memory according to the input channel group, the time step and the space column direction to form a plurality of input data blocks, wherein the specific storage sequence is to take the row direction of the input pulse data as a main storage sequence, and in the same row, the data of different time steps are continuously stored according to the sequence of the input channel groups, and in each input channel group, the data further takes the column direction of the input pulse data as a main storage sequence.
3. The pulse-convolutional neural network accelerator-oriented data storage and access system of claim 2, wherein the on-chip buffer comprises: Input buffer, weight buffer and output buffer, The input caches adopt a multi-BANK structure, each input cache is used for storing input pulse data of one row of input pulse data on a plurality of time steps, the storage sequence of the input pulse data is consistent with the sequence of corresponding data in an off-chip memory, the bit width of each storage word is matched with the parallelism of an input channel, and the input caches are used for storing a plurality of channel input pulse data of the same column in the same input channel group; The weight cache is used to store the convolution kernels.
4. The pulse-convolutional neural network accelerator-oriented data storage and access system of claim 3, wherein the computing module comprises: A convolution window unfolding sub-module, a data tilting sub-module, a pulse array, chi Huazi modules and a neuron state updating and pulse generating sub-module, wherein, A convolution window expansion sub-module for performing a convolution window expansion operation on input pulse data to form a matrix suitable for systolic array computation; the data tilting sub-module is used for carrying out time sequence adjustment on the convolution kernel and the unfolded matrix so as to meet the time sequence requirement of the pulse array; The pulse array is used for executing convolution calculation to obtain a convolution calculation result in the current time step; the neuron state updating and pulse generating sub-module is used for completing the updating of the neuron membrane potential according to the convolution calculation results in a plurality of time steps and generating output pulses to form a convolution result block; And Chi Huazi module for pooling the convolution calculation result.
5. The data storage and access system for a pulse convolutional neural network accelerator of claim 4, wherein when the time step is greater than 1, input channel data for the same spatial location at different time steps are processed sequentially to avoid frequent writing back of the neuron state associated between the time steps to off-chip memory; When all time steps of the current space position are calculated, processing a next convolution result block along the column direction of the input pulse data; After all the line input data are calculated at all time steps, the results in the output buffer are written back to the off-chip memory in a predetermined order.
6. The pulse-convolutional neural network accelerator-oriented data storage and access system of claim 5, further comprising: the control module comprises a convolution controller, a pooling controller and a global controller, wherein, The convolution controller and the pooling controller are respectively used for controlling the internal operation time sequences of the convolution calculation related module and the pooling processing module; the global controller is used for carrying out unified scheduling on each functional module and coordinating the transmission of data among the modules.
7. The pulse-convolutional neural network accelerator-oriented data storage and access system of claim 6, wherein the on-chip buffer further comprises: A neuron state memory and an accumulation buffer, wherein, A neuron state memory for storing a neuron state; And the accumulation buffer area is used for accumulating the convolution output result to support the subsequent pooling processing.
8. The pulse-convolutional neural network accelerator-oriented data storage and access system of claim 7, wherein the storage access interface module further comprises: an AXI host interface and an asynchronous FIFO, wherein, An AXI host interface for initiating a data access request to an off-chip memory; the asynchronous FIFOs include a read FIFO and a write FIFO for data transfer across clock domains in the read path and the write path, respectively.
9. The data storage and access method for the pulse convolution neural network accelerator is characterized by comprising the following steps of: Reading input pulse data from an off-chip memory; reading a convolution kernel from off-chip memory; buffering the convolution kernel and the input pulse data; performing convolution calculation on input pulse data, and combining convolution kernels to generate an output feature map; the output signature is written back to off-chip memory.

Description

Data storage and access method and system for pulse convolutional neural network accelerator Technical Field The embodiment of the disclosure relates to the technical field of neural network hardware accelerators and integrated circuit design, in particular to a data storage and access method and system for a pulse convolution neural network accelerator. Background The impulse neural network (SNN: spiking Neural Network) has potential low-power consumption advantages due to the impulse sparsity exhibited in the calculation process, and has been widely paid attention to the academia and industry in recent years. The SCNN introduces a convolution structure on the basis of the SNN, so that the SCNN has the space feature extraction capability while keeping the low power consumption characteristic, and can efficiently execute sensing and reasoning tasks on resource-constrained platforms such as edge equipment. To meet the requirements of SCNNs in terms of performance and energy efficiency, researchers have begun to explore dedicated hardware accelerator architectures for SCNNs. Among them, the systolic array (Systolic Array) is widely used in the conventional convolutional neural network (CNN: convolutional Neural Network) accelerator due to its regular data flow and good scalability, and is gradually introduced into hardware implementation for supporting pulse convolution operations. In SCNN accelerators based on systolic arrays, the computation process is usually driven by a clock, and each processing unit performs convolution operations and neuron state updates in a predetermined data stream manner in discrete clock cycles. In the field of conventional CNN accelerators, there has been a great deal of research work around the organization and access of data in storage systems. Typical operations such as DianNao series accelerators (see: dianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning et al) systematically analyze the weight fixing, the activity and the activity data by efficiently mapping the weights and the activity data to on-chip memory to reduce the number of accesses to off-chip memory, thereby reducing the power consumption, eyeriss architecture (see: eyeriss: an Energy-EFFICIENT RECONFIGURABLE ACCELERATOR FOR DEEP CONVOLUTIONAL NEURAL NETWORKS et al), The method outputs different data streams with fixed effects on data multiplexing and memory access behaviors, optimizes the data layout of an on-chip buffer area and an off-chip memory, adopts a large-scale pulse array structure by Google TPU (see papers such as In-DATACENTER PERFORMANCE ANALYSIS OF A TENSOR PROCESSING UNIT) and improves the computing throughput rate and the memory bandwidth utilization rate through reasonable data partitioning and memory scheduling. In contrast, the number of hardware accelerator studies for SNNs, especially SCNNs, is relatively small. The existing work is more focused on the aspects of neuron model realization, neuron state storage and updating mechanism, pulse computing unit design and the like in a general pulse neural network or a neuromorphic computing platform (see ："TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip"、"Loihi: A Neuromorphic Manycore Processor with On-Chip Learning" and other papers). In the hardware implementation facing to the SCNN, many studies have been made on accelerating pulse convolution operation from the angles of computing architecture and data flow scheduling (see papers such as "SpinalFlow: an Architecture and Dataflow Tailored for Spiking Neural Networks"), while at the storage organization and data access level, part of SCNN accelerators still keep or simply modify the data storage manner in the traditional CNN accelerator in implementation, that is, pulse data is mainly stored according to the spatial dimension or the channel dimension, and corresponding data is repeatedly read at different time steps to complete reasoning computation (see papers such as ："SIES: A Novel Implementation of Spiking Convolutional Neural Network Inference Engine on FPGA"、"DeepFire: Acceleration of Convolutional Spiking Neural Network on FPGAs"). Fig. 1 and 2 show a prior art sequence of storing pulse data organized preferentially in the spatial dimension and a prior art sequence of storing pulse data organized preferentially in the channel dimension, respectively. In the storage mode, the time step dimension is not arranged as an explicit dimension in the data storage organization, input pulse data of each time step is stored independently only by adopting the same storage layout as that of a single time step, and corresponding data is required to be accessed repeatedly for different time steps in the reasoning process. However, such storage methods often do not fully take into account the binary nature of the pulse data and the additional data memory requirements imposed by multi-time-step reasoning. In recent years