CN-121597137-B - Reverse weight gradient calculation data multiplexing method and device based on many-core processor
Abstract
The invention discloses a reverse weight gradient calculation data multiplexing method and device based on a many-core processor. The method comprises the steps of firstly obtaining the size of a convolution layer according to a deep learning model and size information of an input data set to obtain hardware information of a many-core processor, secondly designing three candidate schemes, determining the height and width of a convolution core in each candidate scheme, the block size of the number of input channels and the number of output channels on each calculation core and calculating a circulation sequence design, then calculating to obtain data access quantity between local storage and a global storage on a chip in each candidate scheme, and finally selecting the candidate scheme with the minimum data access quantity. The invention solves the problem of how to design the acceleration strategy according to different convolution layer sizes in the development of the high-performance acceleration library for the gradient computation of the reverse convolution kernel, and improves the computation performance by reducing the data access amount between the local storage and the global storage on the chip.
Inventors
- CHEN MINZHEN
- ZHANG WUYUE
- YOU CHUNBO
- JIANG YUWEI
- WANG FEN
- GU CHENG
- ZHOU CHAO
Assignees
- 之江实验室
Dates
- Publication Date
- 20260508
- Application Date
- 20260130
Claims (9)
- 1. The inverse weight gradient calculation data multiplexing method based on the many-core processor is characterized by comprising the following steps of: (1) Obtaining the size of a convolution layer to be calculated according to the deep learning model and the size information of the input data set, and obtaining the hardware information of the many-core processor; (2) Setting the height R of a convolution kernel and the block sizes bR and bS of the width S of the convolution kernel on a single calculation kernel to obtain three candidate schemes, wherein the first candidate scheme comprises bR=bS=1, the second candidate scheme comprises bR=1 and bS=S, the third candidate scheme comprises bR=R and bS=S, the block sizes bC and bM of the number C of input channels and the number M of output channels in each candidate scheme on the single calculation kernel are determined according to the hardware information of the multi-kernel processor, the total number bC_core of input channels and the total number bM_core distributed to each calculation kernel are determined, and the circulation sequence calculated by the three candidate schemes is determined; (3) Calculating the data access amount between the local storage and the global storage on the chip in each candidate scheme by calculating the repeated access times of each element in the convolution layer and combining the size of the convolution layer and bC, bR, bS, bM based on the calculated circulation sequence of the three candidate schemes and the analysis data multiplexing mode; The formula of the gradient dw-dependent data access amount memory_access_dw of the convolution kernel weight is as follows: memory_access_dw=C×R×S×M×sizeof(dtype(dw)); Wherein sizeof () represents the number of bytes occupied by the data-taking format in the memory; The specific calculation formulas of the data access amount memory_access_x_i of the input feature map x read from the global memory and the data access amount memory_access_y_i of the feature map gradient dy from the next layer read from the global memory for each candidate scheme are as follows: candidate scheme 1: memory_access_x_1=R×S×ceil(M/(bM_1))×N×H×W×C×sizeof(dtype(x)) memory_access_dy_1=R×S×ceil(C/(bC_1))×N×E×F×M×sizeof(dtype(dy)) Candidate scheme 2: memory_access_x_2=R×ceil(M/(bM_2))×N×H×W×C×sizeof(dtype(x)) memory_access_dy_2=R×ceil(C/(bC_2))×N×E×F×M×sizeof(dtype(dy)) Candidate scheme 3: memory_access_x_3=ceil(M/(bM_3))×N×H×W×C×sizeof(dtype(x)) memory_access_dy_3=ceil(C/(bC_3))×N×E×F×M×sizeof(dtype(dy)) Wherein bC_1, bC_2, bC_3 represent the number of input channels per calculation of each calculation core in the candidate schemes 1, 2, 3, bM_1, bM_2, bM_3 represent the number of output channels per calculation of each calculation core in the candidate schemes 1, 2, 3, respectively, N represents the batch size, H represents the height of the input feature map x, W represents the width of the input feature map x, E represents the height of the output feature map y, F represents the width of the output feature map y, M represents the number of channels of the output feature map y, C represents the number of input channels, ceil () represents an upward rounding function; (4) And selecting a candidate scheme with the smallest data access amount.
- 2. The method of claim 1, wherein in the step (1), the convolution layer size includes a size of an input feature map, a size of a weight, and a size of a feature map gradient from a next layer, the size of the input feature map includes a batch size, a number of input channels, and a height and width of the feature map, and the hardware information of the many-core processor includes a per-column accumulator buffer depth P of the systolic array, a column number sa_co of the systolic array, and a row number RO and a column number CO of the computational core array of the many-core processor.
- 3. The method for multiplexing data calculated by the inverse weight gradient based on many-core processor of claim 2, wherein in the step (2), Determining the block sizes bC and bM of the input channel number C and the output channel number M on a single computing core in each candidate scheme according to the block sizes bS and bR in each scheme, the depth P of an accumulator buffer and the column number sa_co of a systolic array, wherein bC is not more than the maximum power of 2 of the product of the division of P by bS and bR; The total input channel number bC_core and the total output channel number bM_core allocated to each computing core are determined, specifically, the total input channel number bC_core and the total output channel number bM_core allocated to each computing core are required to meet the condition that bC_core is divided by C and the maximum value of the up-rounding value of RO and bC, and bM_core is divided by M and the maximum value of the up-rounding value of CO and bM.
- 4. The method of multiplexing data for a many-core processor-based inverse weight gradient computation of claim 1, wherein the round robin order of the first candidate computation is determined by: Traversing the width F times of the output gradient map by each computing core, computing to obtain intermediate width gradient blocks with the size of bC multiplied by bM each time, and accumulating all the intermediate width gradient blocks point to point after F times of circulation to obtain a first computing result; Traversing the height E times of the output gradient map, calculating to obtain intermediate height gradient blocks with the size of bC multiplied by bM each time, and accumulating all the intermediate height gradient blocks point to point after E times of circulation to obtain a second calculation result; (a.3) calculating the dimension of the input channel, and performing block-division circulation by taking bC as the block size to obtain a gradient block with the size of bC_core multiplied by bM, wherein the gradient block is a third calculation result; (a.4) calculating the dimension of the output channel, and performing block division circulation by taking bM as the block size to obtain a gradient block with the size of bC_core multiplied by bM_core, wherein the gradient block is a fourth calculation result; (a.5) traversing the width S times of the fourth calculation result to obtain a gradient block with the size of bC_core multiplied by S multiplied by bM_core, which is a fifth calculation result; (a.6) traversing the height R of the fifth calculation result to obtain a gradient block with the size of bc_core×r×s×bm_core as a sixth calculation result; (a.7) obtaining a gradient dw of convolution kernel weights with the size of C×R×S×M by parallel calculation of the sixth calculation result obtained by calculation of all calculation cores.
- 5. The method of multiplexing data based on inverse weight gradient computation of a many-core processor of claim 1, wherein the cyclic order of computation of the second candidate is determined by: traversing the width F times of the output gradient map by each computing core, computing to obtain intermediate width gradient blocks with the size of bC multiplied by S multiplied by bM each time, and accumulating all the intermediate width gradient blocks point to point after F times of circulation to obtain a first computing result; Traversing the height E of the output gradient map for times, calculating to obtain intermediate height gradient blocks with the size of bC multiplied by S multiplied by bM each time, and accumulating all the intermediate height gradient blocks point to point after E times of circulation to obtain a second calculation result; (b.3) calculating the dimension of the input channel, and performing block circulation by taking bC as the block size to obtain a third calculation result with the dimension of bC_core multiplied by S multiplied by bM; (b.4) calculating the dimension of the output channel, and performing block circulation by taking bM as the block size to obtain a fourth calculation result with the dimensions of bC_core multiplied by S multiplied by bM_core; (b.5) traversing the height R of the fourth calculation result to obtain a fifth calculation result with the size of bC_core×R×S×bM_core; (b.6) obtaining the gradient dw of the convolution kernel weight with the size of C×R×S×M by parallel calculation of the fifth calculation result obtained by calculation of all calculation cores.
- 6. The method of multiplexing data for a many-core processor-based inverse weight gradient computation of claim 1, wherein the round robin order of the third candidate computation is determined by: Traversing the width F times of the output gradient map by each computing core, computing to obtain intermediate width gradient blocks with the size of bC multiplied by R multiplied by S multiplied by bM each time, and accumulating all the intermediate width gradient blocks point to point after F times of circulation to obtain a first computing result; Traversing the height E of the output gradient map for times, calculating to obtain intermediate height gradient blocks with the dimensions of bC multiplied by R multiplied by S multiplied by bM each time, and accumulating all the intermediate height gradient blocks point to point after E times of circulation to obtain a second calculation result; (c.3) calculating the dimension of the input channel, and performing block circulation by taking bC as the block size to obtain a third calculation result with the dimension of bC_core multiplied by R multiplied by S multiplied by bM; (c.4) calculating the dimension of the output channel, and performing block circulation by taking bM as the block size to obtain a fourth calculation result with the dimensions of bC_core multiplied by R multiplied by S multiplied by bM_core; and (c.5) carrying out parallel calculation on the fourth calculation results obtained by calculation of all the calculation cores to obtain a gradient dw of convolution kernel weights with the sizes of C multiplied by R multiplied by S multiplied by M.
- 7. The method for computing data multiplexing by using the inverse weight gradient based on the many-core processor according to claim 1, wherein the step (4) is specifically to select a candidate with the smallest total data access according to the total data access of each candidate obtained by computing in the step (3), and the total data access includes the data access of the input feature map x read from the global storage and the data access of the feature map gradient dy from the next layer read from the global storage, and the dw-related data access.
- 8. A many-core processor-based inverse weight gradient computation data multiplexing device, comprising a memory and one or more processors, wherein the memory stores executable code, and wherein the one or more processors are configured to implement the many-core processor-based inverse weight gradient computation data multiplexing method of any of claims 1-7 when executing the executable code.
- 9. A computer readable storage medium having stored thereon a program which, when executed by a processor, implements a method of inverse weight gradient computation data multiplexing based on a many-core processor as claimed in any of claims 1 to 7.
Description
Reverse weight gradient calculation data multiplexing method and device based on many-core processor Technical Field The invention relates to the field of special processors for artificial intelligence, in particular to a method and a device for multiplexing reverse weight gradient calculation data based on a many-core processor. Background With the rapid development of artificial intelligence technology, the network scale of the deep learning algorithm is continuously increased, and the higher and higher requirements are also put on the computing capability of the processor, so that the development of a high-performance deep learning computing acceleration library on an artificial intelligence special processor becomes a research hot spot. The existing artificial intelligent processor usually adopts a many-core architecture, a plurality of computing cores perform data processing in parallel, so that the computing parallelism and performance are improved, a multi-level storage architecture is adopted, global Memory (Global Memory) is shared among the computing cores, each computing core is internally provided with on-chip local Memory (SCRATCHPAD MEMORY, SPM), and a computing component directly accesses data in a register to perform computation. Convolution operators are calculation hot spots in a deep learning algorithm, and in order to achieve better performance on the deep learning application, each calculation core is provided with a hardware acceleration unit which is specially used for accelerating convolution/matrix multiplication calculation, namely a pulsation array. The systolic array generally fixes the weight data, inputs the feature map data and the intermediate portion and result, flows in beats in both the horizontal and vertical directions, respectively, and the accumulator buffer receives the intermediate portion and result and accumulates with the original data in the accumulator buffer for temporary storage until the final accumulated result is output after all accumulation is completed. In the development process of a high-performance acceleration library of a many-core processor facing deep learning training, acceleration optimization for inverse convolution kernel gradient calculation faces a plurality of challenges. Firstly, a convolution calculation acceleration library is designed on a many-core processor coupled with a pulse array hardware acceleration unit, calculation is accelerated through the pulse array unit, access and storage often become performance bottlenecks of a convolution operator, secondly, a convolution process and forward convolution of reverse convolution kernel gradient calculation have larger differences, two inputs of convolution operation in the reverse convolution kernel gradient calculation are feature graphs or feature graph gradients with larger height and width dimensions, a convolution result is a convolution kernel gradient with smaller height and width dimensions, in a multi-stage storage architecture of the many-core processor, storage spaces of on-chip SPM and registers are limited, therefore, the convolution operation needs to be split into multiple circulations, finally, the convolution layers in different dimensions in a network have diversity, strategies adopted in the development of the acceleration library are different, and how to find strategies with better performance according to the convolution layer dimensions is a challenge. According to the calculation principle of convolution, the inverse convolution kernel gradient calculation is used for multiplexing data, namely 1) the input feature images are multiplexed among different output channels, 2) the feature image gradient from the next layer is multiplexed among different input channels, and 3) the input feature images are multiplexed among different heights and widths of convolution kernels. The first two data multiplexing characteristics are typically utilized when performing inverse convolution kernel gradient computation acceleration on a many-core processor, but when the convolution kernel height or width is greater than 1, the last multiplexing characteristic is often ignored and underutilized. The invention aims to utilize the multiplexing characteristic of convolution data and the hardware characteristic of a many-core processor architecture to carry out strategy scheme design, fully multiplex the data in the on-chip cache as much as possible, increase the multiplexing degree of the data, thereby reducing the data access quantity between SPM and a global memory, reducing the access delay and improving the performance. The method carries out strategy design guidance and performance optimization aiming at the reverse convolution kernel gradient calculation with the height or width of the convolution kernel being more than 1. Disclosure of Invention The invention aims to overcome the defects in the prior art and provides a reverse weight gradient calculation data multiplexing method