CN-121982487-A - YOLO v3 post-processing hardware acceleration method based on table lookup method
Abstract
The invention relates to a YOLO v3 post-processing hardware acceleration method based on a table lookup method, which comprises the steps of 1) establishing two lookup tables according to sigmoid and exp functions related in a decoding calculation process, 2) analyzing required storage space according to quantization bits and a network structure, 3) distributing corresponding BRAM storage blocks for the lookup tables according to the required storage space, 4) calculating data to be decoded, putting calculation results into the lookup tables, 5) inputting the data to be decoded, putting the data to be decoded into a cache unit and arranging, 6) reading the data to be decoded, searching corresponding calculation data from the lookup tables, 7) inputting the calculation data into a parallel calculation unit to obtain decoded data, and 8) putting the decoded data back to an initial position in the cache unit. According to the method, the lookup table is established, decoding calculation is carried out on the data to be decoded in the YOLO v3 post-processing, the post-processing calculation is completed under the condition that fewer storage resources are used, the hardware scale is obviously reduced, and the post-processing efficiency is improved.
Inventors
- Gai Yifan
- ZHOU HUI
- ZHAO XIONGBO
- LI XIAOMIN
- XIE YUJIA
- WANG XIAOFENG
- DONG WENJIE
- WU SONGLING
Assignees
- 北京航天自动控制研究所
Dates
- Publication Date
- 20260505
- Application Date
- 20251231
Claims (5)
- 1. The method for accelerating the YOLO v3 post-processing hardware based on the table lookup method is characterized by comprising the following steps of: S1, establishing a lookup table; s1.1, establishing two lookup tables according to two operators, namely a sigmoid function and an exp exponential function, involved in a decoding calculation process; S1.2, determining storage space required by two lookup tables according to set YOLO v3 network weight quantization bit numbers and by combining with a YOLO v3 network structure; s1.3, distributing corresponding BRAM memory blocks for each lookup table according to the memory space required by the two lookup tables; S1.4, respectively carrying out calculation by taking a value a into a formula (1) and a formula (2) for each data to be decoded in a set value range of the data to be decoded, respectively putting calculated data into two lookup tables, wherein the formula (1) and the formula (2) are as follows: sigmoid(a/2 scale )(1)exp(a/2 scale )(2) Wherein scale is the actual quantization scaling factor, taking a positive integer; s1.5, building a lookup table; s2, decoding calculation is carried out on the data to be decoded; S2.1, inputting data to be decoded, and reading the data to be decoded to a cache unit in an FPGA chip; S2.2, reading data to be decoded from the cache unit, and searching corresponding calculation data from the two lookup tables according to the data to be decoded; S2.3, inputting the calculated data output by the lookup table into a parallel calculation unit, and obtaining decoded data after calculation; s2.4, the decoding data calculation result of each data to be decoded is put back to the buffer unit.
- 2. The method for accelerating YOLO v3 post-processing hardware based on the table lookup method of claim 1, wherein a is floating point type data with a value range of-127-128 in S1.4.
- 3. The method for accelerating the YOLO v3 post-processing hardware based on the table lookup method as claimed in claim 1, wherein in the step S2.1, data to be decoded is set to three layers of data, the buffer unit comprises three data emission layers DIN0, DIN1 and DIN2, each data emission layer comprises an input channel direction dim0, a width direction dim1 and a height direction dim2, the data to be decoded is read to the buffer unit according to the arrangement mode that the data emission layers are selected according to the DIN0, DIN1 and DIN2 priority order, and the three layers of data of the data to be decoded are sequentially arranged along the dim0, dim1 and dim2 directions of the selected data emission layers.
- 4. The method for accelerating YOLO v3 post-processing hardware based on the table lookup method of claim 1, wherein in S2.3, the parallel computing unit performs parallel computing according to formula (3), formula (4) and formula (5), and formulas (3), (4) and (5) are respectively as follows: a out-1 =(sigmoid(a)+b)×2 n (3) a out-2 =(exp(a)×b)×2 n (4) a out-3 =(sigmoid(a)) (5) Wherein a is data to be decoded, a out-1 、a out-2 、a out-3 is decoded data, b is floating point type constant value constant, and n is down sampling frequency.
- 5. The method for accelerating YOLO v3 post-processing hardware based on the table lookup method of claim 1, wherein the buffer unit position for storing the decoded data in step S2.4 is the corresponding position for storing the data to be decoded in step S2.1.
Description
YOLO v3 post-processing hardware acceleration method based on table lookup method Technical Field The invention relates to the field of special algorithm hardware circuit design, in particular to a method for accelerating YOLO v3 post-processing hardware based on a table look-up method. Background In recent years, the target detection technology based on deep learning is developed rapidly, and the detection precision is remarkably superior to that of the traditional algorithm, so that the method is widely applied to various fields. Detection methods represented by YOLO v3 have received extensive attention from the academia and industry. The YOLO series algorithm regards the target recognition task as a regression problem, so that the generalization features of the target can be learned more easily, and the speed problem of target detection is solved effectively. The YOLO v3 network model adopts a full convolution layer, a residual error model is introduced to reduce gradient explosion risk, and multi-scale detection is realized by means of an image pyramid model, so that the detection performance aiming at a weak and small target is further improved. Because convolutional neural network computing modes have parallelism, and a CPU is a serial scalar computing engine, the computing power of the CPU cannot meet the computing requirements of model training and reasoning, and therefore a special hardware architecture is usually required to be designed to improve the computing efficiency. When the YOLO v3 model is actually deployed, a main network part with intensive computation is usually implemented by a special AI acceleration chip or an acceleration circuit based on an FPGA, so as to improve the end-side detection efficiency. However, the model post-processing part comprises floating point operators such as exp exponential functions, sigmoid functions and the like, and generally still processes on a general CPU, so that the calculation efficiency is low, and the real-time performance of the whole model processing is seriously affected. In the prior art, a method for constructing a special multiplier accelerating circuit by using such operators is generally adopted, and the method can increase the complexity of hardware circuit design, so that the system clock performance is reduced, and the overall system power consumption is increased. Disclosure of Invention The invention aims to overcome the defects in the prior art and provide a method for accelerating the hardware of the YOLO v3 post-processing based on a table lookup method, which is used for decoding and calculating data to be decoded in the YOLO v3 post-processing by establishing a table lookup. Compared with the traditional method using a multiplier accelerating circuit, the method can finish post-processing calculation under the condition of using less storage resources, thereby remarkably reducing the hardware scale and improving the post-processing efficiency. The invention aims at realizing the following technical scheme: A method for accelerating YOLO v3 post-processing hardware based on a table lookup method comprises the following steps: s1, establishing a lookup table. S1.1, two lookup tables are established according to two operators, namely a sigmoid function and an exp exponential function, related in a decoding calculation process. S1.2, determining the storage space required by two lookup tables according to the set YOLO v3 network weight quantization bit number and the YOLO v3 network structure, wherein the storage space is expandable according to the size requirement of the lookup tables. S1.3, distributing corresponding BRAM storage blocks for each lookup table according to the storage space required by the two lookup tables. S1.4, respectively carrying out calculation by taking a value a into a formula (1) and a formula (2) for each data to be decoded in a set value range of the data to be decoded, respectively putting calculated data into two lookup tables, wherein the formula (1) and the formula (2) are as follows: sigmoid(a/2scale)(1) exp(a/2scale)(2) Wherein scale is the actual quantization scaling factor, taking a positive integer. Further, in S1.4, a is floating point type data with a value range of-127 to 128. S1.5, the lookup table is built. S2, decoding calculation is carried out on the data to be decoded. S2.1, inputting data to be decoded, and reading the data to be decoded to a cache unit in the FPGA chip. In the S2.1, the data to be decoded is set as three layers of data, the buffer unit comprises three data emission layers DIN0, DIN1 and DIN2, each data emission layer comprises an input channel direction dim0, a width direction dim1 and a height direction dim2, the data to be decoded is read to the buffer unit according to the arrangement mode that the data emission layers are selected according to the DIN0, DIN1 and DIN2 priority order, and the three layers of data of the data to be decoded are sequentially arranged al