CN-122019458-A - Method, device and circuit for filling feature mapping data
Abstract
A method and a device for filling feature mapping data belong to the field of artificial intelligence. In order to improve the execution efficiency of a neural network algorithm and reduce the transmission bandwidth of memory data, the invention provides a data filling method and device. The device comprises a data filling calculation control module so as to efficiently process data filling between each neural network convolution layer, and meanwhile, the data filling transmission control module can be matched with a corresponding data transmission module to rapidly carry input data from a memory to on-chip cache, so that the neural network algorithm calculation is started. Firstly, a data filling circuit calculates through network information to obtain a transmission control signal and sends the transmission control signal to a data transmission module, the data transmission module transmits data of a memory to an on-chip cache according to the control signal, and then the data filling calculation control module obtains the on-chip cache control signal or a filling zero value signal and controls a multiplexer MUX to send corresponding data to a calculation array.
Inventors
- HOU YUMIN
- ZHAO SHUO
- FENG BAO
- FANG JUAN
Assignees
- 北京工业大学
Dates
- Publication Date
- 20260512
- Application Date
- 20251202
Claims (1)
- 1. An apparatus for feature map data population characterized by Simultaneously, the data filling transmission control module is matched with a corresponding data transmission module to carry input data from a memory to on-chip cache so as to start neural network algorithm calculation; firstly, a data filling circuit calculates through network information to obtain a transmission control signal and sends the transmission control signal to a data transmission module, and the data transmission module transmits data of a memory to an on-chip cache according to the control signal; the data filling calculation control module calculates a corresponding control signal according to the corresponding network layer information, and is used for controlling whether the data fed to the calculation array is filling data or on-chip cache data; According to different network layer information, the module outputs different control signals, the data is fully filled with 4 cases, as shown in fig. 2, if the feature mapping data is not segmented, the feature mapping 4 boundaries (up, down, left and right) need to be guaranteed to be filled with 0, other data are output through on-chip cache, if the feature mapping data is segmented, the corresponding control signals need to be given according to specific segmented positions, the feature mapping data is divided into 3 positions, namely a start line block, a middle line block and an end line block, and the specific filling cases are as follows: (1) The starting row block needs to fill 3 boundaries (up, left, right) (2) The middle row block needs to be filled with 2 borders (left, right) (3) The end row block needs to fill 3 borders (bottom, left, right) After data is transmitted to an on-chip cache, firstly judging whether feature mapping data of the network layer are segmented, if not, judging all boundaries, and giving zero value control signals when 4 boundaries are detected, namely, judging corresponding boundaries according to the information if the data are segmented, and giving the zero value control signals when the corresponding boundaries are detected; Detecting 4 conditions of the boundary needing data filling, namely, upper, lower, left and right conditions, wherein the calculation of each boundary condition needs corresponding network layer information; (1) The upper boundary is obtained according to a formula (row_tile. Stride+kr) < pad) according to the characteristic mapping row block row number row_tile, a convolution kernel step length stride, a row number kr and a filling row number pad; (2) The lower boundary is obtained by mapping all information and features of the upper boundary to the total row number row according to a formula (row_tile+kr) > row+pad-1; (3) The left boundary is obtained according to a formula (col_tile. Stride+kc) < pad, when the formula is satisfied, the left boundary is judged to be the left boundary, and zero value control signals are output; (4) The right boundary and the lower boundary are similar, the number of rows is required to be modified into the corresponding number of columns, the left boundary is required to map all information and characteristics of the left boundary to the total number of columns col, and a zero value control signal is output according to a formula (col_tile+kc) > col+pad-1 to the right boundary when the formula is satisfied; in all the above calculations, if the convolution kernel step size stride is 1, then there is no multiplication calculation, and only basic addition and comparison operations are performed; In either case, the address and enable control signals to the on-chip cache need to be calculated so as to output the data of the corresponding position of the on-chip cache to the computing array; Before data filling calculation control and boundary detection, the data filling transmission control module also needs to transmit data from the memory to the on-chip cache according to network information, so that the data filling circuit also needs to give out a control signal of the data transmission module; firstly, network information is read, whether feature mapping data are segmented is judged, if not, all data are transmitted to an on-chip cache, if yes, a specific row block is required to be further judged, and the specific transmission condition is as follows: (1) The starting line block does not need to transmit the data to be filled of the upper, left and right boundaries, and the feature mapping data of the lower boundary is required to be transmitted; (2) Ending the row block without transmitting the data to be filled of the lower, left and right boundaries and transmitting the feature mapping data of the upper boundary; (3) The middle line block does not need to transmit the data to be filled of the left and right boundaries and the feature mapping data of the upper and lower boundaries; after corresponding control signals are obtained according to the block judgment and the row block judgment, the control signals are sent to a data transmission module, then data transmission is started, and data in the corresponding position of the memory is transmitted to an on-chip cache.
Description
Method, device and circuit for filling feature mapping data Technical Field The invention belongs to the field of artificial intelligence. Background Along with the deep penetration of artificial intelligence technology in the fields of computer vision, natural language processing, automatic driving and the like, the demands of application scenes on model precision and complexity are continuously improved, and the neural network is pushed to evolve towards deeper and wider directions. From early LeNet-5 (7 layers) to ResNet-50 (50 layers) to ultra-deep architecture for large model training nowadays, the increase of the network layer number means the enhancement of the feature extraction capability, but simultaneously, higher requirements are put on the calculation and data processing capability of underlying hardware, namely, not only the core calculation such as massive convolution, pooling and the like needs to be supported, but also various preprocessing operations accompanied by efficient processing are needed, and the bottleneck problem of hardware performance is gradually highlighted. However, the design center of gravity of the current mainstream neural network hardware is concentrated on the accelerated optimization of convolution operation and pooling operation for a long time. These two types of operations are used as the core links of the neural network feature extraction, so that the hardware manufacturer continuously improves the operation efficiency by means of parallel computing unit design, pipeline optimization and the like. However, in actual network operation, besides convolution and pooling, there are a large number of preprocessing operations such as data normalization, channel dimension adjustment, data filling, etc., and these operations, although having a small single calculation amount, need to be performed at a high frequency before each layer of network calculation. As the number of network layers increases, the accumulated time and resource consumption of the preprocessing operation are continuously increased, and the preprocessing operation is gradually changed from a secondary link to a key bottleneck which restricts the overall hardware efficiency. Among the numerous preprocessing steps, data stuffing (Padding) is a core obstacle to the current hardware efficiency improvement due to its high frequency and strong dependence on data transmission. The core function of data filling is to solve the problem of feature mapping size reduction in convolution operation, namely when the convolution kernel slides on the input feature mapping, if the input size is not matched with the convolution kernel size, the side length of the output feature mapping can be reduced along with the calculation times. To keep the sizes of the input and output feature maps consistent, ensure that edge features are not lost (edge pixels only participate in 1 convolution calculation, internal pixels participate multiple times, no filling can cause the weight of edge information to be too low), a certain number of values (usually zero values) need to be filled in the edges of the input feature maps, and the size compression caused by convolution operation is counteracted by enlarging the input size. The current processing mode of data filling by hardware mainly depends on a processor to be realized through software logic, and the scheme has three defects that firstly, the calculation efficiency is low. The software filling needs to circularly traverse each pixel of the input feature mapping, judges whether zero values need to be filled and written into the memory position by position, is essentially serialization operation, and severely slows down the running speed of the network. Second, memory bandwidth is severely wasted. After the software is filled, the data containing the zero value is required to be completely carried from the external memory to the on-chip cache, but the filled zero value belongs to invalid data and occupies the memory bandwidth. Third, the on-chip cache area increases rapidly. In order to store the complete feature map containing the filling data, the on-chip cache is designed according to the size after filling, and a certain space is additionally reserved for the cache capacity, so that the chip area and the manufacturing cost of hardware are increased, leakage power consumption is increased due to expansion of the cache capacity, and the high-energy efficiency ratio target of the hardware design is violated. In view of the above, it is desirable to design a dedicated hardware filling control circuit to solve the filling bottleneck from the point of data transmission and computation coordination. The core design thought of the circuit is to generate filling data in real time, so that invalid data transmission is avoided. (1) Aiming at the problems that the data filling operation is performed by depending on software in the neural network calculation, so that the calculation efficie