CN-116861143-B - Method for realizing convolution of small input diagram and small weight

CN116861143BCN 116861143 BCN116861143 BCN 116861143BCN-116861143-B

Abstract

The invention provides a method for realizing convolution of a small input image and a small weight, which comprises the following steps of S1, setting data storage, wherein a storage mode of a characteristic image is set, the characteristic image data storage sequence is 32, W, H and N, wherein 32 is a part of depth, W is wide, H is high, N is the number of 32 in depth, namely 32 x N is the depth of the characteristic image; the storage mode of the set weight comprises the steps of adopting 32 x 32 continuous, then continuous over the width of the convolution kernel, then continuous over the height of the convolution kernel, then continuous over the number of input depths/32 of the convolution kernel and finally continuous over the number of output depths/32, storing the width and height of the convolution kernel and finally storing the output depths of the convolution kernel into a required sequence before processing, S2, loading all data from ddr to fram and wram by using a simd instruction, and loading 32 data each time, and S3, realizing convolution calculation. The method realizes the calculation of the small weight of the small input characteristic diagram, accelerates and improves the efficiency.

Inventors

TIAN FENGBIN
YU XIAOJING

Assignees

北京君正集成电路股份有限公司

Dates

Publication Date: 20260508
Application Date: 20220328

Claims (7)

1. A method for implementing a convolution of a small input graph with a small weight, the method comprising the steps of: s1, storing set data: setting a storage mode of the feature map, namely 32, W, H and N in the feature map data storage sequence, wherein 32 is a part of depth, W is wide, H is high, N is the number of 32 in depth, namely Data is continuous over 32, then over width, then over height, and finally over the depth/32 number; The storage mode of the set weight adopts Continuous over the width of the convolution kernel, continuous over the height of the convolution kernel, continuous over the number of input depths/32 of the convolution kernel, and continuous over the number of output depths/32; before processing, the normal input depth is required to be continuous, the input depth is continuous in the width of the convolution kernel, and finally the convolution kernel outputs continuous data of the depth and stores the continuous data into a required sequence; S2, using a simd instruction to load all data from ddr to fram, wram, 32 data at a time: S2.1, using simd instruction to load all data from ddr to fram, 32 data at a time: Loading into VR0, VR1 using a simd load data instruction; load data to fram using fram load data instruction; because the feature map storage sequence is stored according to the requirement, and the data size can be completely put into fram, the feature map storage sequence can be directly stored according to the default sequence until all data are stored; S2.2, using simd instruction to load all data from ddr to wram, 32 data at a time: Loading into VR0, VR1 using a simd load data instruction; Loading data to the wram using a wram load data instruction; Because the weight storage sequence is stored according to the requirement, and the data size can be completely put into the wram, the weight storage sequence can be directly stored according to the default sequence until all the data are completely stored; s3, realizing convolution calculation: Calculating convolution, wherein an initial address of fram is required to be given, the initial address is 0, and the initial address of wram is also 0; let the depth of the input feature map be In _ ic32 is a multiple of the input depth, The input width is in_width, and the input height is in_height; The depth of the output characteristic diagram is Out _ ic32 is a multiple of the output depth, The output width is out_width, and the input height is out_height; The convolution kernel is of the width of kernel_w and of the height of kernel_h; the width direction step length of the convolution kernel is stride_w, and the height direction of the convolution kernel is stride_h; relationship between output feature map width and input feature map width Relationship between output feature map height and input feature map height If the input feature images are not equal, 0 is required to be complemented to the same wide and high positions according to convolution requirements, and the generated results are stored in vrd.
2. The method for realizing the convolution of the small input image and the small weight according to claim 1 is characterized in that the method is suitable for the conditions that the number of input feature image data is less than or equal to fram, the number of weights is less than or equal to wram, the number of bits is 8 bits under the condition that the weight is less than or equal to wram, the convolution kernel length or width is not more than 3, the input depth is required to be a multiple of 32, the output depth is required to be a multiple of 32, if the input depth of some layers in a model is not the multiple of 32, the filling is required to be the multiple of 32, and the corresponding weight is also subjected to filling processing.
3. A method of implementing a small input graph, small weighted convolution according to claim 1, said method comprising the instructions of: a) Convolution calculation instructions: ingenic_conv_bit8(fram_id,wram_id,ic32_num,kernel_w,kernel_h,stride_x,stride_y, feature_w,feature_h ,vrd); input variable fram _id is fram, wram_id is wram, ic32_num is the calculated number, kernel_w is the width of the convolution kernel, kernel_h is the height of the convolution kernel, stride_x is the step length of convolution calculation in the x direction, stride_y is the step length of convolution calculation in the y direction, feature_w is the width of the input feature map, feature_h is the height of the input feature map, and vrd is the result; Description of use: The method comprises the steps of calculating 4 pixel results each time, wherein a calculation unit is depth 32, a generation result is 32, 4 pixel results are generated, if ic 32_num=1, calculating an input depth of 32x1, generating 4 pixels with an output depth of 32, if ic 32_num=2, calculating an input depth of 32x2, generating 4 pixels with an output depth of 32, if ic 32_num=3, calculating an input depth of 32x3, generating 4 pixels with an output depth of 32, calculating a minimum depth of 32, a minimum output depth of 32, and a minimum output result number of 4, setting fram, namely loading piexl in an input feature map, setting parameters belonging to convolution calculation instructions, and currently setting a processing width of feature_w; b) simd load data instruction: set to ingenic _load (indata, VR0, m) Inputting data to be loaded, marking a pointer of the current data as indata, loading 128-bit data from a position m pointed by the data indata in a memory, If the data of 8 bits are 16 loaded, if the data of 16 bits are 8 loaded, if the data of 32 bits are 8 loaded, the data are loaded into a variable vrd register, wherein m is calculated according to byte, namely 8 bits are one unit, VR0 is a VR register of simd, and 512 bits of data are stored at most; c) fram load data instruction: set to ingenic _vr fram (VR 0, fram _load_id, num) The input variable VR0 is input data, fram _load_id is a start address loaded in fram, num is 0 or 1, fram _load_id data is unchanged after the instruction ends when 0, fram _load_id= fram _load_id+32 after the instruction ends when 1; d) wram load data instruction: Set to ingenic _v2wram (VR 0, wram_load_id, num) The input variable VR0 is input data, the wram_load_id is a start address loaded into the wram, num is 0 or 1, the wram_load_id data is unchanged after the instruction ends when 0, and wram_load_id= fram _load_id+32 after the instruction ends when 1.
4. The method of claim 1, wherein the convolution of the small weights of the small input graph is performed, In the step S2.1, the simd load data instruction is used to load into VR0, VR 1: ingenic_load (indata,VR0,1) ingenic_load (indata,VR0,1) ingenic_load (indata,VR1,1) ingenic_load (indata,VR1,1) using fram load data instruction, load data to fram: ingenic_vr2fram(VR0,fram_load_id,1) ingenic_vr2fram(VR1,fram_load_id,1); In the step S2.2, the simd load data instruction is used to load into VR0, VR 1: ingenic_load (widthdata,VR0,1) ingenic_load (widthdata,VR0,1) ingenic_load (widthdata,VR1,1) ingenic_load (widthdata,VR1,1) using a wram load data instruction, load data to wram: ingenic_vr2wram(VR0,wram_load_id,1) ingenic_vr2wram(VR1,wram_load_id,1)。
5. The method for realizing the convolution of the small weights of the small input graph according to claim 3, wherein in the step S3, the convolution is calculated, and the order of generation is as follows: First generate first , Regenerating the second , ...... Until the last ; At each of the In (1) first row is generated Regenerated into a second row Up to the last line ; For the following First, generate Regenerated into the second Until the last I.e. complete 。
6. The method for realizing the convolution of small weights of a small input graph according to claim 5, wherein the specific implementation of the step S3 is as follows: s3.1, initializing wram_id=0; S3.2, initializing ocnum _i=0, if ocnum _i < out_ic32 is true, continuing execution, and ocnum _1++, if it is not true, jumping out of the step; s3.3, initializing ydir _i=0, if ydir _i < out_height is true, continuing execution, and ydir _i++, if not, jumping out of the step; Execution of ; Execution of ; S3.4, initialize xdir _i=0, and if xdir _i < out_width is true, continue execution, and xdir _i+=4; if not, jumping out of the step; Execution of ; Ingenic _ conv _ bit8 (fram _ id, wram _ id, ic32_ num, kernel _ w, kernel_h,stride_x, stride_y, in_width,in_height vrd); Execution of 。
7. The method for realizing the convolution of small input image and small weight according to claim 1, wherein when the input feature image is complemented with 0 according to the convolution requirement, assuming that when the convolution kernel is 3, the step size is1, and the width and the height of the output feature image are the same as those of the input feature image, the input feature image needs to be complemented with 0, and the filling method is equal to the filling of the input feature image, or only one side is filled according to the requirement of a user.

Description

Method for realizing convolution of small input diagram and small weight Technical Field The invention relates to the technical field of image processing, in particular to a method for realizing convolution of a small input image and a small weight. Background The T40 type chip of Beijing jun integrated circuit Co., ltd (Beijing jun T40 chip for short) is a low power consumption chip for AI deep learning. A convolution calculation unit with independent calculation and a unique simd instruction. Has a oram store, a wram to store weights and a fram to store input data. In this implementation, the data must be stored in wram and fram before the convolution calculation can be performed. oram, wram, fram is of a given size for the chip. For example, wram size is 288 x 1024byte, frame is 128 x 1024byte, and oram size is 2048 x 1024byte. These hypothetical data will be used in the following calculation. All data is stored in ddr, requiring either a dma instruction to be carried forward to oram, or a simd instruction to load data into a staging register, and then a special instruction to be carried forward to wram or fram. Since this is a new chip. Conventional algorithms, while possible, are inefficient. And existing methods cannot use unique computing units and instructions. The feature maps and the weights are input in different sizes, the implementation methods are different, and the efficiency is drastically reduced due to the use of unsuitable algorithms. In addition, the common terminology in the prior art is as follows: 1. Convolution kernel, which is a matrix used for image processing and a parameter used for operation with an original image. The convolution kernel is typically a matrix of columns (e.g., a matrix of 3*3) with a weight for each square in the region. The matrix shape is typically 1×1,3×3,5×5,7×7,1×3,3×1,2×2,1×5,5×1. 2. Convolution, in which the center of a convolution kernel is placed on a pixel to be calculated, the products of each element in the kernel and its covered image pixel values are calculated and summed once, and the resulting structure is the new pixel value for that location, a process called convolution. 3. The result obtained by convolution calculation of input data is called a feature map, and the result generated by full connection of the data is also called a feature map. The feature map size is generally expressed as length x width x depth, or 1 x depth. 4. FRAM (Feature RAM), which is a RAM for Feature maps, is a memory for storing all or part of the Feature maps and directly supplying the Feature maps to a memory calculated by a hardware calculation unit. Belonging to the storage part of the computing unit. Using the computing unit, the feature map data must be placed in the FRAM. 5. WRAM (WEIGHT RAM, i.e., RAM for weights) is a memory for storing all or part of the weights, which is directly supplied to the memory calculated by the hardware calculation unit. Belonging to the storage part of the computing unit. Using the computing unit, the weight data must be placed in the FRAM. Disclosure of Invention In order to solve the problems in the prior art, the application aims to solve the problems, and according to special conditions, a special calculation method is designed, and particularly, the calculation of a small input characteristic diagram and a small weight is realized on a Beijing jun T40 chip. Specifically, the invention provides a method for realizing convolution of a small input diagram with small weight, which comprises the following steps: s1, storing set data: Setting a storage mode of the feature map, wherein the storage mode of the feature map comprises a feature map data storage sequence of 32, W, H and N, wherein 32 is a part of depth, W is wide, H is high, N is the number of 32 in depth, namely 32 x N is the depth of the feature map, and the data is continuous in 32, continuous in width, continuous in height and continuous in depth/32; Setting a storage mode of weights, namely adopting a storage mode of continuously connecting 32 x 32, continuously connecting the width of a convolution kernel, continuously connecting the height of the convolution kernel, continuously connecting the number of input depths/32 of the convolution kernel and finally continuously connecting the number of output depths/32; S2, using a simd instruction to load all data from ddr to fram, wram, 32 data at a time: S2.1, using simd instruction to load all data from ddr to fram, 32 data at a time: Loading into VR0, VR1 using a simd load data instruction; load data to fram using fram load data instruction; because the feature map storage sequence is stored according to the requirement, and the data size can be completely put into fram, the feature map storage sequence can be directly stored according to the default sequence until all data are stored; S2.2, using simd instruction to load all data from ddr to wram, 32 data at a time: Loading into VR0, VR1 using a simd load data instru