CN-116954472-B - Non-continuous input data processing method of fc layer address based on NNA
Abstract
The invention provides an NNA-based fc layer address discontinuous input data processing method, which is characterized in that redundant waste data are carried through DMA to ensure 64 byte alignment of the carried data, and the carried data are stored in ORAM space for NNA calculation. The problem of alignment of input FeatureMap bytes caused by using DMA to carry data is solved, and the problem of correct carrying data by using DMA under the condition of alignment of data lines of a full connection layer is solved. The weight and the input data are calculated according to the line blocks, so that the waiting time between data handling and calculation is reduced.
Inventors
- NI ZHAOFENG
- WANG LIZHI
Assignees
- 合肥君正科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20220329
Claims (2)
- 1. The method is characterized in that address non-continuity requires address row alignment in the method, row and row address skip needs to be recalculated, fixed row numbers are processed each time, input address skip is guaranteed to be correct, redundant waste data are carried through DMA to ensure 64-byte alignment of carried data, the redundant waste data refer to data which are carried when the tail data of each row is less than 64 bytes but 64 bytes are required to be carried to ensure normal operation of hardware, and the carried data are stored in ORAM space for NNA calculation; The method specifically comprises the following steps: s1, initializing and setting: Setting the high H-width W-input channel IC of FeatureMap as 3.3256, and the feature_shape [1,8,3, 32] (wherein the feature_shape is the shape of input data, the abstract is understood as multi-dimensional data, and the sequence from left to right is NDHWC from high dimension to low dimension, wherein the number of N is set as 1 by processing FeatureMap at one time; D, splitting the input channels into 32 groups of numbers; H, inputting high data; W is the width of the input data; the input channel is split by taking 32 as a group; s2, calculating iline _ size, Iline _size=align 64 (w.32. Bit_width/8), address discontinuity, need to consider address jumps from row to row, each jump offset, defined as iline _size: input row byte size; Due to hardware limitations, the data size per process must be guaranteed to be 64 byte aligned; s3, grouping convolution accumulation, S3.1, grouping the height according to the ORAM space size, wherein the formula is as follows: sub_h = oram_size / iline_size; group_num= (height+sub_h-1)/sub_h, namely, under the space constraint condition, the number of lines processed each time is circulated; The method ensures that all channels participate in calculation; DMA transfer data sub_h align64 (iline _size), i.e., the size of each line, the number of lines to be processed = the size of data bytes calculated each time; s3.2, based on NNA, performing MAC calculation on each row of pixels, accumulating MAC results each time, and obtaining the convolution results of all pixels finally; Pixel_num = sub_h * align64(iline_size); MAC: Accumulating: sum=temp 0+temp 1 the + & number, + tempn Result = sum; S4, the last sub_h cycle, Judging whether the sub_h cycle is the last cycle, if so, carrying redundant elements is not needed, and if not, the value of the sub_h is unchanged.
- 2. The method of claim 1, wherein in step S3.1, when the ORAM space is insufficient to drop all data, the incoming data is carried in packets, and the maximum space loadable data is carried each time, otherwise if all the data can be dropped, the data is DMA-carried to the ORAM space at a time.
Description
Non-continuous input data processing method of fc layer address based on NNA Technical Field The invention relates to the technical field of neural networks, in particular to an input data processing method based on non-continuous fc layer addresses of NNA. Background The application of neural network technology is becoming increasingly popular, so that chip manufacturers create dedicated chips for neural network algorithms, especially for inference-side chips, that is, neural Network Accelerators (NNAs). In the prior art, NNA1.0 supports fast operations of multi-channel convolution. The convolutional neural network mainly comprises an input layer, a convolutional layer, a pooling layer and a full connection layer, wherein the full connection layer (fc) realizes logic that input data are multiplied by weights correspondingly to obtain a number, and the number is stored in an output space. The rule of DMA transfer data requires 64 alignment of the source address and the destination address of the data transfer, and the amount of DMA transfer data requires 64 byte alignment, and the correctness of DMA data can be ensured only when all three conditions are satisfied. And carrying data by line DMA by an input layer operator, namely carrying input FeatureMap data required by convolution to a target storage space by taking each line of the input FeatureMap as a DMA data block. However, the input FeatureMap data align64 (iline _size) of the input layer carries data continuously by DMA, so as to ensure 64-byte alignment of the data, invalid data may exist at the tail end of a line to participate in calculation, and calculation result errors may be caused. Furthermore, the common terminology in the prior art is as follows: 1. NNA-hardware accelerator on the pipeline of SIMD of CPU, the operation is controlled by special CPU/SIMD instructions, running on a thread to resolve most convolution multiply add. 2. The fc layer (fully connected layer full connection layer) maps the previously networked features to the sample tag space, primarily for classification. 3. The feature_shape is the shape of input data, the abstract is understood as multi-dimensional data, the multi-dimensional data is sequentially from high dimension to low dimension (NDHWC) from left to right, the number of N processing FeatureMaps at a time is set to 1;D, the number of input channels is divided into 32 groups, H is the height of the input data, W is the width of the input data, and C is 32. 4. Line alignment-due to hardware limitations, the data size per processing must be guaranteed to be 64 byte aligned. align64 (iline _size) = (iline _size+63)/64×64. 5. DMA, collectively Direct Memory Access, direct memory access, a DMA transfer copies data from one address space to another address space, providing high speed data transfer between a peripheral and memory or between memory and memory. 6. ORAM is general RAM on chip, one version is 872kb in size. 7. MAC NNA-based convolution cumulative sum computation 8. The CSS attribute affects the width and height of an element to change the horizontal or vertical size of the box model of an element. Disclosure of Invention In order to solve the problems, the application aims to solve the problem of alignment of input FeatureMap bytes caused by using DMA to carry data, and solve the problem of correct carrying data by using DMA under the condition of alignment of data lines of a full connection layer. The weight and the input data are calculated according to the line blocks, so that the waiting time between data handling and calculation is reduced. Specifically, the invention provides an input data processing method based on non-continuous fc layer address of NNA, in the method, address non-continuous requires address row alignment, row and row address jump need to be recalculated, fixed row numbers are processed each time, input address jump is ensured to be correct, redundant waste data are carried through DMA to ensure 64 byte alignment of carried data, the redundant waste data refer to data which are carried when the tail data of each row is less than 64 bytes but 64 bytes are required to be carried to ensure normal operation of hardware, and the carried data are stored in ORAM space for NNA calculation. The method further comprises: s1, initializing and setting: Setting the high H-width W-input channel IC of FeatureMap as 3-256, and feature_shape [1,8,3, 32]; D, splitting the input channels into 32 groups of numbers; H, inputting high data; W is the width of the input data; the input channel is split by taking 32 as a group; s2, calculating iline _ size, Iline _size=align 64 (w.32bit_width/8), address discontinuity, need to consider address jumps from row to row, each jump offset, defined as iline _size: input row byte size, data size per process must be guaranteed to be 64 byte aligned due to hardware limitations; s3, grouping convolution accumulation, S3.1, grouping the height according to the ORAM sp