CN-116861144-B - Implementation method of convolution of WRAM (write-read-write memory) capable of lowering weight

CN116861144BCN 116861144 BCN116861144 BCN 116861144BCN-116861144-B

Abstract

The application provides a method for realizing convolution of a WRAM capable of lowering weight, which comprises the steps of S1, storing data, setting a storage mode of a feature map, setting a storage mode of the weight, S2, loading all data from DDR to the WRAM by using a SIMD instruction, loading 32 data each time, moving the data in the DDR to the ORAM by using an ORAM data moving instruction, and S3, realizing convolution calculation. The application realizes the calculation of small input characteristic diagrams and small weights by designing the FRAM width setting method, the ORAM-to-FRAM number moving method and the corresponding new convolution calculation method, thereby realizing acceleration and improving efficiency.

Inventors

TIAN FENGBIN
YU XIAOJING

Assignees

北京君正集成电路股份有限公司

Dates

Publication Date: 20260508
Application Date: 20220328

Claims (7)

1. A method for implementing convolution of WRAM capable of dropping weights, the method comprising the steps of: S1, storing data: setting a storage mode of the feature map, namely a feature map data storage sequence of 32, W, H and N, wherein 32 is a part of depth, W is wide, H is high, N is the number of 32 in depth, namely Data is continuous over 32, then over width, then over height, and finally over the depth/32 number; The storage mode of the set weight adopts The method comprises the steps of continuously feeding the input depth of a convolution kernel, continuously feeding the convolution kernel in width, continuously feeding the convolution kernel in height, continuously feeding the convolution kernel in the number of input depths/32 and continuously feeding the convolution kernel in the number of output depths/32, continuously feeding the normal input depths before processing, storing the width and the height of the convolution kernel, and finally storing the output depths of the convolution kernel into a required sequence; s2, loading all data from DDR to WRAM by using SIMD instruction, and loading 32 data at a time: s2.1, loading all data from DDR to WRAM by using SIMD instruction, wherein 32 data are loaded each time, and the initial address of weight data is widthdata; Loading into VR0, VR1 using SIMD load data instructions; loading data into the WRAM using a WRAM load data instruction; The weight storage sequence is stored according to the requirement, and the data size can be completely put into the wram, so that the weight storage sequence can be directly stored according to the default sequence until all data are stored completely; S2.2, using an ORAM data carrying instruction to carry data in the DDR into the ORAM, wherein the initial address of a feature map is set as ddr_id, the number of bytes of the feature map is set as count, and the initial address of ORAM is set as ORAM _id; ingenic_ddr2oram(ddr_id,oram_id,count,1); The feature map storage sequence is stored according to the requirement, and the data size can be completely put into the ORAM, so that the feature map storage sequence can be directly stored according to the default sequence until all data are stored, and when the ORAM cannot store down or fram _w cannot store down the minimum calculated pixel, the method cannot be used; s3, realizing convolution calculation: S3.1, calculating convolution, namely carrying data from the ORAM to the FRAM and then carrying the data to the FRAM for convolution calculation, wherein the weights are all loaded into the WRAM, so that the carrying number of the weights is not required to be considered; s3.2, performing convolution calculation, namely firstly loading data from the ORAM into the FRAM, then performing convolution calculation by using the FRAM and the WRAM, setting the initial address of the ORAM to be 0, setting the initial address of the WRAM to be 0, and setting the depth of an input feature map to be 0 In_ic32 is a multiple of the input depth, the input width is in_width, the input height is in_height, and the depth of the output feature map is Out_ic32 is a multiple of the output depth, the output width is out_width, the input height is out_height, the convolution kernel width is kernel_w, the height is kernel_h, the convolution kernel width direction step length is stride_w, the convolution kernel height direction is stride_h, and the relation between the output feature map width and the input feature map width is that of the input feature map Relationship between output feature map height and input feature map height If the input feature images are not equal, 0 is required to be supplemented to the equal width and height positions according to specific convolution requirements; In order to reduce the number of times ORAM is loaded to FRAM, all results in the same depth direction are generated simultaneously, so that when a circulation order is designed, the outermost circulation is the height of an output characteristic diagram, the width of the output characteristic diagram is next, the depth/32 of the output characteristic diagram is next, and finally a convolution calculation unit is used; Let the number of rows per generation be fram _h, where fram _h= fram _count/fram _w, the greater the number of rows loaded, the lower the number of times of repeated loading in the height direction.
2. The method for realizing the convolution of the WRAM capable of lowering weights is characterized by comprising the steps that the size of a feature map can be contained in an ORAM, the number of the weights is smaller than or equal to WRAM, the number of bits of the WRAM can be contained and is 8, feature map data required for generating 8 pixels through calculation can be completely placed in the FRAM, the length or width of a convolution kernel is not more than 3, the input depth is required to be a multiple of 32, the output depth is required to be a multiple of 32, if the input depth of some layers in a model is not the multiple of 32, the filling is required to be a multiple of 32, and the corresponding weights are also filling processing.
3. A method for implementing convolution of WRAM capable of dropping weights according to claim 1, characterized in that the instructions used in the method are as follows: a) Convolution calculation instructions: ingenic_conv_bit8(fram_id,wram_id,ic32_num,kernel_w,kernel_h,stride_x,stride_y,feature_w,feature_h,vrd); input variable fram _id is fram, wram_id is the start address used by wram, ic32_num is the calculated number, kernel_w is the width of the convolution kernel, kernel_h is the height of the convolution kernel, stride_x is the step length of convolution calculation in the x direction, stride_y is the step length of convolution calculation in the y direction, feature_w is the width of the input feature map, feature_h is the height of the input feature map, vrd is the result; The method comprises the following steps of calculating 4 pixel results each time, wherein a calculation unit is depth 32, the generation result is 32, 4 pixel results are generated, if ic 32_num=1, the input depth is calculated to be 32x1, 4 pixels with the output depth of 32 are generated, if ic 32_num=3, the input depth is calculated to be 32x3, 4 pixels with the output depth of 32 are generated, if ic 32_num=2, the input depth is calculated to be 32x2, 4 pixels with the output depth of 32 are generated, the calculated minimum depth input depth is 32, the minimum output depth is 32, and the number of pixels with the minimum output result is 4; Setting the width of FRAM, namely loading the number piexl of input feature graphs, which belongs to the parameter setting of convolution calculation instructions; b) SIMD load data instruction: ingenic_load (indata,VR0,m) Inputting data to be loaded, namely a pointer index of the data at present, loading 128-bit data from a position m pointed by the data index in a memory, loading 8 data if the data is 16 data, loading 4 data if the data is 32 data, and loading the data into a variable vrd register, wherein m is calculated according to byte, namely 8bit is one unit, and storing 512bit data at most in a VR register with VR0 being simd; c) FRAM load data instruction: ingenic_vr2fram(VR0,fram_load_id,num) the input variable VR0 is input data, fram _load_id is a start address loaded in fram, num is 0 or 1, fram _load_id data is unchanged after the instruction ends when 0, fram _load_id= fram _load_id+32 after the instruction ends when 1; d) WRAM load data instruction: ingenic_vr2wram(VR0,wram_load_id,num) The input variable VR0 is input data, the wram_load_id is a start address loaded into the wram, num is 0 or 1, the wram_load_id data is unchanged after the instruction ends when 0, and the wram_load_id= fram _load_id+64 after the instruction ends when 1; f) ORAM data handling instruction: ingenic_ddr2oram(ddr_id,oram_id,count,num) The input variable, ddr_id, oram _id, count and count are respectively equal to the address of starting data to be loaded in ddr, oram, 0 or 1, and 0 is the unchanged ddr_id and oram _id data after the instruction ends, and 1 is the addition of ddr_id and oram _id data and count after the instruction ends.
4. The method for implementing convolution of WRAM capable of lowering weights according to claim 3, wherein the width of FRAM is set: Setting the total byte number of the FRAM as FRAM _count, setting the width of FRAM as FRAM _w, and processing and loading the input feature map row number of FRAM _h each time, setting the FRAM _w value as pixels of the input feature map required by at least 8 piexl, and calculating 4 pixels by using the least generated result, wherein in use, the first 4 pixels are generated and the data of the following 4 pixels are required to be loaded, so that the least generated result is 8; has the following formula (1) Wherein the method comprises the steps of The whole formula is a multiple of 4, which ensures fram and can generate 8 pixels; Since the data is 4 pixels each time, the cross phenomenon of the data to be loaded and the data to be used can occur for the convolution with the convolution kernel larger than 1, in order to solve the problem, the extra data is processed according to the multiple of 4, namely fram _w is added (kernel_w-1) to be rounded up, and the calculation formula is calculated There is (2) By the formulas (1) (2) (3) Equation (3) cannot be combined because of the rounding operation, and the cases of inequality exist after the combination; The number of lines of the input feature map is fram _h= fram _count/fram _w for each processing load.
5. The method for realizing the convolution of the WRAM drop weights according to claim 4, wherein in S2.1, Loading into VR0, VR1 using SIMD load data instructions, denoted; ingenic_load (widthdata,VR0,0) ingenic_load (widthdata,VR0,1) ingenic_load (widthdata,VR1,0) ingenic_load (widthdata,VR1,1) Loading data into the WRAM by using a WRAM loading data instruction, which is recorded as; ingenic_vr2wram(VR0,wram_load_id,1) ingenic_vr2wram(VR1,wram_load_id,1)。
6. The implementation method of WRAM weight-down convolution according to claim 5, wherein the specific implementation of the S3 convolution calculation is as follows: S3.1, initializing and marking as wram_id=0; oram_id=0; fram_id=0; wr_fram_id=0; rd_fram_id=0; ; S3.2, generating fram _h rows of results in each processing; Setting ydir _i to be the position of the result in height, setting the initial value ydir _i=0, setting the initial processing line number to fram _h_ori, fram _h= fram _h_ori, if ydir _i < out_height is satisfied, continuing to execute the present step, ydir _i+ = fram _h after the present step is completed, judging whether ydir _i < out_height is satisfied, if so, continuing to execute the present step, sequentially looping, if not, judging whether ydir _i < (out_height+ fram _h-1) is satisfied, wherein out_height is not an integer multiple of fram _h, when fram _h is larger than 1, so that a remainder exists, judging again, if so, fram _h= ydir _i-out_height is not satisfied, and if not, jumping out the present step loop is: for(int ydir_i =0; ydir_i<(out_height+ fram_h-1); ydir_i += fram_h); if(ydir_i >=out_height)fram_h= ydir_i-out_height; continuously reading data from the DDR into the ORAM, wherein the number of lines read each time is fram _h; the execution is that the address read by the initialization oram is 0, the address written by the initialization fram is 0, and the following steps are recorded: int rd_oram_idx=0; int wr_fram_idx=0; s3.2.1, generating 4 pixels each time in the width direction and fram _h rows in the height direction; Assuming that initial xdir _i=0, if xdir _i < out_width+3 is true, there is a remainder because there is a case when out_width is not a multiple of 4, and this remainder also needs to be loaded according to the calculation of the minimum calculation unit, so that it is necessary to perform an integer division of out_width by 4, which is equivalent to out_width+3, and there is more memory space for storing the input feature map than the actual feature map space, so as to prevent errors caused by unreadable data, this step is continued to be executed, xdir _i+=4 after this step is completed, it is determined whether xdir _i < out_width+3 is true, this step is looped in order if it is true, and if it is not true, this step loops back to S3.2, and it is noted that: for(int xdir_i =0; xdir_i < out_width+3; xdir_i+=4) ; S3.2.1.1 if xdir _i >1 is true, noted if (xdir _i > 1), continuing to execute the step, if not, executing step S3.2.1.2; Executing reading the data in ORAM into FRAM, each time reading in the loop body Here, the data required for the following convolution calculation is loaded, and in order to be able to form a loop, the minimum unit of each loading is 4 pixels, so the first loading is more, namely ; Performing: ; performing: ; executing a circulation body (5); s3.2.1.2, if xdir _i= 0 is true, continuing to execute the step, denoted if (xdir _i= 0), and if not, continuing to execute the step S3.2.1.3; executing reading the data in ORAM into FRAM, and reading the read cycle body with length of So the length of the first reading is The read data is ; Performing: rd_ oram _idx= rd_ oram _idx; performing: wr fram idx= wr fram idx; executing a circulation body (6); execution in xdir _i >1, add every time When xdir _i=0, the loaded data width is So xdir _i=0 is needed to be added after loading the data, the following values are added: ; ; execution of the program is performed by S3.2.1.3, Executing, namely, int wram_id=0; execution of each processing generation ; Executing a circulation body (7); The method comprises the steps of loading normal data by a circulating body (5), loading data by a circulating body (6) for the first time, specifically realizing convolution by a circulating body (7), and calculating all results in sequence by calculating the result of generating a common weight, namely, the result on the height, and then calculating the result on the depth/32 direction.
7. The method of claim 6, wherein the WRAM is configured to reduce the convolution of weights, The circulating body (5) is as follows: Step (5) 1, setting the initial iccnum _i=0, judging whether icnum _i < in_ic32 is true, continuing the circulation step, after the circulation step, icnum _i++, judging whether icnum _i < in_ic32 is true, and sequentially circulating; denoted as for (int iccnum _i=0, icnum_i < in_ic32; icnum _i++) Performing: ; performing: ; Step (5) 2, setting initial fh_i=0, judging whether fh_i < fram _h is true or not, continuing the circulation step, judging whether fh_i++ is true or not after the circulation step, and sequentially circulating if fh_i < fram _h is true or not, if not, jumping out the circulation step and returning to the execution step (5) 1; Denoted as for (int fh_i=0, fh_i < fram_h; fh_i++) Performing: ; performing: ; Step (5) 3, setting initial fw_i=0, and judging If yes, continuing the circulation step, after the circulation step, fw_i+ =4, and judging If not, jumping out the step and returning to the execution step (5) 2; The method is characterized by comprising the following steps: ; execution is loaded into VR0, VR1 using SIMD load data instructions, noted as: ingenic_load (rd_oram,VR0,0); ingenic_load (rd_oram,VR0,1); ingenic_load (rd_oram,VR1,0); ingenic_load (rd_oram,VR1,1); executing, namely loading data into the FRAM by using an FRAM data loading instruction, wherein the steps are as follows: ingenic_vr2fram(VR0, wr_fram,1); ingenic_vr2fram(VR1, wr_fram,1); the circulating body (6) is as follows: Step (6) 1, setting an initial iccnum _i=0, judging whether icnum _i < in_ic32 is satisfied, if not, jumping out of the loop body (6), if so, continuing to execute the step of the loop, after executing the step of the loop, icnum _i++, judging whether or not is satisfied, if so, sequentially looping, otherwise, jumping out of the loop, and executing the following steps; denoted as for (int iccnum _i=0, icnum_i < in_ic32; icnum _i++) Performing: ; performing: ; step (6) 2, if the initial fh_i=0 is set, judging whether fh_i < fram _h is satisfied, if not, jumping out of step (6) 2, if so, continuing to execute the present circulation step, after executing the present circulation step, fh_i++, judging whether or not is satisfied, if not, jumping out of the present circulation, and executing the following steps, wherein the steps are recorded as follows: for(int fh_i=0;fh_i<fram_h; fh_i++) performing: ; performing: ; Step (6) 3, setting initial fw_i=0, and judging If yes, if not, the step (6) 3 is jumped out, if yes, the step of the present cycle is continuously executed, after the step of the present cycle, fw_i+ =4 is executed, then whether the step of the present cycle is met is judged, if yes, the present cycle is sequentially executed, otherwise, the present cycle is jumped out, the following steps are executed, and the following steps are recorded as follows: Execution is loaded into VR0, VR1 using SIMD load data instructions, noted as: ingenic_load (rd_oram,VR0,0); ingenic_load (rd_oram,VR0,1); ingenic_load (rd_oram,VR1,0); ingenic_load (rd_oram,VR1,1) executing, namely loading data into the FRAM by using an FRAM data loading instruction, wherein the steps are as follows: ingenic_vr2fram(VR0, wr_fram,1); ingenic_vr2fram(VR1, wr_fram,1); The circulating body (7) is as follows: Step (7) 1 is executed, in which an initial ocnum _i=0 is set, whether ocnum _i < out_ic32 is satisfied or not is judged, if not, the step (7) 1 is jumped out, if yes, the step of the present cycle is continuously executed, whether the step of the present cycle is satisfied or not is judged again after ocnum _i++, the cycle is sequentially executed, otherwise, the present cycle is jumped out, and the following steps are executed: denoted as for (int ocnum _i=0, ocnum_i < out_ic32; ocnum _i++) Execution of each processing generation ; Fram _id=0; performing: ; Step (7) 2 is executed, wherein the initial fh_i=0 is set, whether fh_i < fram _h is satisfied or not is judged, if not, the step (7) 2 is jumped out, if so, the step of the circulation is continuously executed, after the step of the circulation, fh_i++ is judged to be satisfied or not, if so, the circulation is sequentially carried out, otherwise, the circulation is jumped out, and the steps are recorded as follows: Denoted as for (int fh_i=0, fh_i < fram_h; fh_i++) Performing: ; Ingenic _ conv _ bit8 (fram _ id, wram _ id, ic32_ num, kernel _ w, kernel_h,stride_x, stride_y,vrd); The execution is that the generated result is taken out for the subsequent processing and storage; performing: 。

Description

Implementation method of convolution of WRAM (write-read-write memory) capable of lowering weight Technical Field The invention relates to the technical field of image processing, in particular to a method for realizing convolution of a weight of a WRAM (write-read-write memory) capable of being put down. Background The model T40 chip of Beijing jun integrated circuit Co., ltd (hereinafter referred to as the Beijing jun T40 chip) is a low power consumption chip for AI deep learning. A convolution computing unit with independent computation, a unique SIMD instruction. There is one ORAM memory, one WRAM for storing weights and one FRAM for storing input data. In this way, data must be stored in the WRAM and FRAM before the convolution calculation can be performed. ORAM, WRAM, FRAM is of a given size for the chip. For example, WRAM is of sizeFRAM isORAM has a size of. These hypothetical data will be used in the following calculation. All data is stored in DDR, which requires DMA instruction to be moved to ORAM, or SIMD instruction to load data into a custom register, which is then moved to WRAM or FRAM with special instructions. Since this is a new chip. Conventional algorithms, while possible, are inefficient. And existing methods cannot use unique computing units and instructions. The feature maps and the weights are input in different sizes, the implementation methods are different, and the efficiency is drastically reduced due to the use of unsuitable algorithms. In addition, the common terminology in the prior art is as follows: 1. Convolution kernel, which is a matrix used for image processing and a parameter used for operation with an original image. The convolution kernel is typically a matrix number of columns (e.g A matrix of) each square on the area has a weight value. The matrix shape is typically 1×1,3×3,5×5,7×7,1×3,3×1,2×2,1×5,5×1. 2. Convolution, in which the center of a convolution kernel is placed on a pixel to be calculated, the products of each element in the kernel and its covered image pixel values are calculated and summed once, and the resulting structure is the new pixel value for that location, a process called convolution. 3. The result obtained by convolution calculation of input data is called a feature map, and the result generated by full connection of the data is also called a feature map. The feature map size is generally expressed as length x width x depth, or 1 x depth. 4. FRAM (Feature RAM), which is a RAM for Feature maps, is a memory for storing all or part of the Feature maps and directly supplying the Feature maps to a memory calculated by a hardware calculation unit. Belonging to the storage part of the computing unit. Using the computing unit, the feature map data must be placed in the FRAM. 5. WRAM (WEIGHT RAM, i.e., RAM for weights) is a memory for storing all or part of the weights, which is directly supplied to the memory calculated by the hardware calculation unit. Belonging to the storage part of the computing unit. Using the computing unit, the weight data must be placed in the FRAM. 6. ORAM (Oblivious RAM) is a random access memory that provides fast read and write access to Dynamic Random Access Memory (DRAM). For any input X, Y, the series of instructions they produce is the same in probability distribution. Disclosure of Invention In order to solve the problems in the prior art, the application aims to solve the problems, and according to special conditions, a special calculation method is designed, and particularly, the calculation of a small input characteristic diagram and a small weight is realized on a Beijing jun T40 chip. Specifically, the invention provides a method for realizing convolution of a WRAM capable of dropping weights, which comprises the following steps: S1, storing data: setting a storage mode of the feature map, namely a feature map data storage sequence of 32, W, H and N, wherein 32 is a part of depth, W is wide, H is high, N is the number of 32 in depth, namely Data is continuous over 32, then over width, then over height, and finally over the depth/32 number; The storage mode of the set weight adopts The method comprises the steps of continuously feeding the input depth of a convolution kernel, continuously feeding the convolution kernel in width, continuously feeding the convolution kernel in height, continuously feeding the convolution kernel in the number of input depths/32 and continuously feeding the convolution kernel in the number of output depths/32, continuously feeding the normal input depths before processing, storing the width and the height of the convolution kernel, and finally storing the output depths of the convolution kernel into a required sequence; s2, loading all data from DDR to WRAM by using SIMD instruction, and loading 32 data at a time: s2.1, loading all data from DDR to WRAM by using SIMD instruction, wherein 32 data are loaded each time, and the initial address of weight data is widthdata; Loading into VR0,