CN-115841138-B - Method for keeping consistency of quantized reasoning and training end data

CN115841138BCN 115841138 BCN115841138 BCN 115841138BCN-115841138-B

Abstract

The invention provides a method for keeping consistency of quantization reasoning and training end data, which comprises the following steps of S1, solving the weight of float data, S2, solving the maximum value of the whole weight data, wherein the maximum value of an absolute value abs () function is obtained for the weight because the data of the weight has positive numbers and negative numbers, the weight is divided by the maximum value, the values are distributed between-1 and-1, S3, when the GPU training end quantizes the data, 4-bit intercepting operation is carried out on the data, S4, 128-value multiplication is carried out on the operation of the round () function, 128-value multiplication is selected because the range of the value of int8 is-128 to 127, and S5, the clip (-128, 127) function of the last step is carried out, so that the model is finally quantized to discrete data between-128 and 127 from the float data. The method is based on the analysis of inconsistent results in the process of weight quantification, provides the processing of weight data, reduces the difference between model reasoning and training data, and ensures the correctness of the board-end data results.

Inventors

ZHOU FEIFEI

Assignees

合肥君正科技有限公司

Dates

Publication Date: 20260508
Application Date: 20210918

Claims (3)

1. A method for keeping the consistency of quantized reasoning and training end data is characterized in that training and reasoning are performed on different devices in the process of model quantization, and the training end runs on GPU equipment for accelerating training of the model because the training and the reasoning do not belong to the same device, and the reasoning is applied on chip-related board end equipment based on solidification of the model of the training end, and comprises the following steps: s1, calculating the weight of float data; S2, obtaining the maximum value of the whole weight data, wherein the weight data has positive numbers and negative numbers, so that the maximum value of the absolute value is obtained for the weight, the weight is divided by the maximum value, and the numerical values are distributed between-1 and-1; S3, when the data is quantized at the GPU training end, 4-bit interception operation is carried out on the data, wherein the 4-bit interception is carried out on the data between-1 and-1 after the decimal point before multiplication by 128 in the step S4, the 4-bit interception is carried out on the data, the data is obtained by selecting an intermediate value 4 because the float effective precision is 7 bits, the 4-bit is obtained according to empirical data during model training, and the inconsistency of the result of the GPU and the inference end is avoided during model quantization while the precision is ensured; s4, multiplying 128 values by the operation of inputting a round () function, and selecting 128 values for multiplication because the range of the value of the int8 is-128 to 127; S5, performing clip (-128, 127) functions of the last step, and finally enabling the model to be quantized from float data to discrete data between-128 and 127.
2. The method for keeping the data consistency of the quantitative reasoning and the training end according to claim 1, wherein the method belongs to the quantization of weights in the low bit training process based on the deep neural network.
3. The method of claim 1, wherein in the step S1, abs () is used for obtaining the absolute value.

Description

Method for keeping consistency of quantized reasoning and training end data Technical Field The invention relates to the technical field of neural networks, in particular to a method for keeping the consistency of quantized reasoning and training end data. Background With the model prediction (predication) becoming more accurate, the network becomes deeper and deeper, and the memory size consumed by the neural network becomes a problem, especially for applications on mobile devices, such as board-end T31 series chips of beijing jun integrated circuit corporation (abbreviated as beijing jun). In general, the size and capacity of the flash at the board end are very small, and the size of the model is not only the memory capacity problem but also the memory bandwidth problem. The model uses the model's weights (weights) each time it is predicted, and image-related applications typically need to process data in real-time. Therefore, the method is important for low bit quantization of the weight, the size of a model can be greatly reduced by quantization, float quantization is carried out to 8 bits, and the size of the model can be reduced by 4 times, so that the running speed of a network is greatly improved; However, during quantization, the quantized results are inconsistent due to the fact that network training and reasoning are carried out on different devices (GPU, CPU), so that the training model is inconsistent in board-end reasoning side results; In the prior art, when a weight (weight) float 32bit is quantized, the result inconsistency exists when a model is quantized due to the fact that the CPU precision inconsistency of a GPU end and an inference end exists (the general GPU and the CPU precision inconsistency exists in 7 bits behind a decimal point), and the error of the inference end is increased. Furthermore, the terms commonly used in the art include: low bit quantization, namely, the weight and feature are quantized to (8 bit, 4bit and 2 bit) with 32bit width (float). And (3) network training, namely defining the structure of the neural network and the output result of forward propagation, defining a loss function and selecting an algorithm of backward propagation optimization, carrying out gradient propagation and optimization on the network through a BP algorithm, and repeatedly running the backward propagation optimization algorithm on training data so as to enable the network to adapt to a data set. The quantitative reasoning is that the network weight of the training end is fixed without backward propagation process, so that the model can be fixed, meanwhile, the fixed model can be optimized due to the fact that the model is fixed during reasoning, the floating point number is normally adopted for reasoning, but the model size and time of the floating point reasoning are very high, so that the model is quantized in 8-bit shaping, the reasoning of the network is accelerated, and the memory occupation of the model at the board end is reduced. The training and reasoning quantized data consistency is that the training end performs network optimization on the GPU based on tensorflow, pytorch or mxnet related training frames, the reasoning is based on C++ and reasoning is performed on the CPU of the board end, the training and the quantization reasoning are inconsistent in precision of the GPU and the CPU, meanwhile, the problem of inconsistent data of the training and reasoning end exists due to the addition of round operation during quantization, the data inconsistency of the network reasoning end is affected, and the precision of the network model reasoning is reduced. And a clip () function for controlling the elements in the array within a given range, wherein the clip function changes all values smaller than the lower boundary to the lower boundary and changes all values larger than the upper boundary to the upper boundary, given the upper and lower boundaries of the range to be controlled. The round function returns a value that is the result of a rounding operation to the specified decimal place. Disclosure of Invention In order to solve the problems, the method aims to provide processing of weight data based on analysis of inconsistent results in weight quantification, reduce the difference between model reasoning and training data and ensure the correctness of board-end data results. Specifically, the invention provides a method for keeping the consistency of quantitative reasoning and training end data, which comprises the following steps: s1, calculating the weight of float data; S2, obtaining the maximum value of the whole weight data, wherein the data of the weight has positive numbers and negative numbers, so that the maximum value of an absolute value abs () function is obtained for the weight, the weight is divided by the maximum value, and the numerical values are distributed between-1 and-1; s3, intercepting 4-bit operation on the data when the data are quantized at the GPU tra