CN-115407966-B - Data representation method, tensor quantization method and multiply-add calculation device

CN115407966BCN 115407966 BCN115407966 BCN 115407966BCN-115407966-B

Abstract

The application provides a data representation method, a tensor quantization method and a multiply-add computing device. The data representation method comprises the steps of obtaining target data, wherein the target data comprises a flag bit and a signed number, the sum of bit widths of the flag bit and the signed number is equal to a preset bit width, the signed number comprises a high bit and a low bit, obtaining segmentation bits of the signed number, determining sections of the target data according to various numerical values in the flag bit and the high bit, and representing the target data according to the sections of the target data. It can be inferred that a dynamic range that can be represented by a PINT of k-bit is [ -2 2(k‑2) ,2 2(k‑2) ], which corresponds to the range that can be represented by an INT format of (2 k-3) -bit. The complexity at the time of calculation is reduced compared to FP32 and model accuracy is not affected.

Inventors

WANG ZHONGFENG
LU JINMING

Assignees

南京大学

Dates

Publication Date: 20260508
Application Date: 20210528

Claims (8)

1. A data representation method applied to a convolutional neural network, comprising: Obtaining target data, wherein the target data comprises a flag bit and a signed number, the sum of bit widths of the flag bit and the signed number is equal to a preset bit width, and the signed number comprises a high bit and a low bit; The method comprises the steps that a segmentation bit of the signed number is obtained, the segmentation bit segments the signed number into first data which are close to the marker bit and second data which are far away from the marker bit, the combination of the first data and the segmentation bit is high, and the combination of the second data and the segmentation bit is low; Determining the section of the target data according to each numerical value in the zone bit and the high bit, wherein the section is one of a first section, a second section and a third section; The target data is represented according to the section of the target data, wherein the target data is represented by the low order bits if the target data is a first section, the target data is represented by the product of a numerical value corresponding to the signed number and 2 d if the target data is a second section, the target data is represented by the product of a numerical value corresponding to the signed number and 2 k-2 if the target data is a third section, and d is a segmentation bit and k is the preset bit width.
2. The data representation method according to claim 1, wherein determining the segment to which the target data belongs from the respective values in the flag bit and the high bit includes: determining potential belonging segments of the target data according to whether the values in the high-order bits are the same, wherein the target data is a potential first segment if the values in the high-order bits are all the same, otherwise, the target data is a potential third segment; And determining the section of the target data according to the value of the flag bit, wherein if the value of the flag bit is 1, the target data is a second section, and if the value of the flag bit is 0, the section of the target data is the number of sections corresponding to the potential section.
3. A tensor quantization method applied to a convolutional neural network, comprising: Acquiring a first tensor; calculating a maximum absolute value of the first tensor; Taking the quotient of the maximum absolute value and a predetermined maximum representation range as a scaling factor; scaling the first tensor by the scaling factor to obtain a second tensor; Determining an affiliated segment of the second tensor according to the absolute value of the second tensor, wherein the second tensor is a first segment if the absolute value of the second tensor is less than 2 d , is a second segment if the absolute value of the second tensor is between 2 d and 2 k-2+d , is a third segment if the absolute value of the second tensor is greater than 2 k-2+d ; Quantizing the second tensor according to the segment to which the second tensor belongs; taking the product of the scaling factor and the quantized second tensor as the original numerical range of the first tensor.
4. A method of tensor quantization according to claim 3, characterized in that quantizing the second tensor according to its belonging segment comprises: If the segment to which the second tensor belongs is a first segment, the second tensor x s is quantized according to the following formula: x pi ＝round(x s ); Wherein round is a rounding function, x pi is the quantized second tensor; Or alternatively If the segment to which the second tensor belongs is a second segment, the second tensor x s is quantized according to the following formula: x pi ＝round(x s /2 d )×2 d ; Or alternatively If the segment to which the second tensor belongs is a third segment, the second tensor x s is quantized according to the following formula: x pi ＝round(x s /2 k-2 )×2 k-2 ; Where d is a division bit when expressed by the data expression method according to claim 1 or 2, and k is a preset bit width.
5. A multiply-add computing device for use with a convolutional neural network, comprising: The acquisition module is used for acquiring the first input data and the second input data; a multiplier for multiplying a first signed number corresponding to the first input data by a second signed number corresponding to the second input data to obtain an initial product; the first judging module is used for judging whether the first high-order numerical values corresponding to the first input data are all the same or not; the second judging module is used for judging whether the values of the second high bits corresponding to the second input data are all the same; The decoder is used for determining a first affiliated section of the first input data and a second affiliated section of the second input data according to a first zone bit of the first input data, a second zone bit of the second input data, a first judgment result of a first judgment module and a second judgment result of a second judgment module, and determining the shift number of the initial product according to the first affiliated section and the second affiliated section; the shifter is used for shifting the initial product according to the shifting number determined by the decoder to obtain a shifted product; and an adder for adding the shift product to the acquired third input data to obtain a multiplication and addition result.
6. The multiply-add computing device of claim 5, wherein the first determination module comprises a first AND gate, a first NOR gate, and a first OR gate, wherein an output of the first AND gate and an output of the first NOR gate are both connected with an input of the first OR gate; the first AND gate is used for determining that the first output result is high level under the condition that all the first high bits are high level; The first nor gate is configured to determine that the second output result is high level when the first high bits are all low level; the first or gate is configured to determine that the values of the first high bits corresponding to the first input data are all the same when the first output result is at a high level and/or the second output result is at a high level.
7. The multiply-add computing device of claim 5, wherein the second determination module comprises a second AND gate, a second NOR gate, and a second OR gate, wherein an output of the second AND gate and an output of the second NOR gate are both connected with an input of the second OR gate; the second AND gate is used for determining that the third output result is high level under the condition that all the second high bits are high level; the second nor gate is configured to determine that the fourth output result is high level when the second high bits are all low level; And the second or gate is used for determining that the values of the second high bits corresponding to the second input data are all the same when the third output result is high level and/or the fourth output result is high level.
8. The multiply-add computing device of claim 5, wherein the decoder comprises: the first determining unit is used for determining a first section of the first input data according to the first flag bit of the first input data and the first judging result of the first judging module; The second determining unit is used for determining a second section of the second input data according to a second flag bit of the second input data and a second judging result of the second judging module; the adding unit is used for adding the first belonging section and the second belonging section to obtain the sum of the sections; And a third determining unit, configured to determine a shift number of the initial product according to a correspondence between the sum of the segments and the shift number of the initial product.

Description

Data representation method, tensor quantization method and multiply-add calculation device Technical Field The application relates to the technical field of convolutional neural networks, in particular to a data representation method, a tensor quantization method and a multiply-add computing device. Background With the continuous development of Artificial Intelligence (AI), the technology has evolved from early artificial feature engineering to the fact that the technology can learn from mass data, and has made a significant breakthrough in the fields of machine vision, voice recognition, natural language processing and the like. DNNs (Deep NeuralNetwork, deep neural networks) are becoming increasingly favored in the field of artificial intelligence. However, as network architectures become larger and more complex, we need a large amount of computing resources to train in a cluster with high-end GPU servers. In recent years, as DNN is increasingly used in actual production and life, especially, technologies such as online learning, incremental learning, federal learning and the like are raised, and meanwhile, protection of data privacy is also increasingly emphasized, so that energy-efficient DNN training on end-side equipment is becoming an urgent need. Low bit training is a class of efficient solutions. The conventional low bit training process uses standard 32 bit floating point numbers (FP 32) for computation, which contain 1 bit sign, 8 bit unsigned exponent (exp), and 23 bit mantissa (mant). The calculation method of the data represented by FP32 is classified into normal case and denormal case. In the normal case, it is expressed according to the following formula, (-1) sign×2(exp-127) X1. Mant, in the denormal case, (-1) sign×2(exp-126) X0. Mant. When in calculation, according to the difference between the normal situation and denormal situations, different data representation methods are adopted for representation, and further different calculation methods are adopted for calculation. However, since the data bit width represented by FP32 is relatively high, in addition, floating point number calculation involves operations such as exponential addition, data alignment, normal/denormal discrimination, and the like, which correspondingly leads to an increase in computational complexity. Therefore, there is a need to develop a data representation method with lower complexity to achieve lower hardware resource overhead, computation complexity, computation delay and data storage overhead. Disclosure of Invention The application provides a data representation method, a tensor quantization method and a multiply-add computing device, which are used for solving the problem of high computation complexity caused by an FP32 representation method. In a first aspect of the present application, there is provided a data representation method applied to a convolutional neural network, comprising: Obtaining target data, wherein the target data comprises a flag bit and a signed number, the sum of bit widths of the flag bit and the signed number is equal to a preset bit width, and the signed number comprises a high bit and a low bit; The method comprises the steps that a segmentation bit of the signed number is obtained, the segmentation bit segments the signed number into first data which are close to the marker bit and second data which are far away from the marker bit, the combination of the first data and the segmentation bit is high, and the combination of the second data and the segmentation bit is low; Determining the section of the target data according to each numerical value in the zone bit and the high bit, wherein the section is one of a first section, a second section and a third section; The target data is represented according to the section of the target data, wherein the target data is represented by the low order bits if the target data is a first section, the target data is represented by the product of a numerical value corresponding to the signed number and 2 d if the target data is a second section, the target data is represented by the product of a numerical value corresponding to the signed number and 2 k-2 if the target data is a third section, and d is a segmentation bit and k is the preset bit width. Optionally, determining the segment of the target data according to each value in the flag bit and the high bit includes: determining potential belonging segments of the target data according to whether the values in the high-order bits are the same, wherein the target data is a potential first segment if the values in the high-order bits are all the same, otherwise, the target data is a potential third segment; And determining the section of the target data according to the value of the flag bit, wherein if the value of the flag bit is 1, the target data is a second section, and if the value of the flag bit is 0, the section of the target data is the number of sections corresponding to the potent