CN-121996890-A - Rotation matrix-based large model mixing precision quantitative reasoning method and device
Abstract
The invention relates to a large model mixed precision quantization reasoning method and device based on a rotation matrix, which comprises the steps of obtaining a target activation tensor, carrying out rotation transformation on the target activation tensor by adopting the rotation matrix, carrying out word element dimension mixed precision quantization, continuously distributing activation values with the same precision in the quantized activation tensor in a memory, storing the activation tensor with the first precision in an outlier buffer area distributed in the memory, calling a tensor core, carrying out matrix multiplication operation on the quantized activation tensor and the quantized weight tensor to obtain a quantization operation result, carrying out inverse quantization on the quantization operation result, and taking the inverse quantization result as a matrix multiplication operation result of the target activation tensor and the target weight tensor. According to the method and the device, the activation values with the same precision are continuously distributed in the memory, and the activation tensor with the first precision is stored in the outlier buffer area, so that tensor cores can be calculated efficiently, and the model capacity and the higher accuracy can be maintained while model reasoning is accelerated.
Inventors
- ZHAI JIDONG
- ZHANG QIHAO
Assignees
- 清华大学
Dates
- Publication Date
- 20260508
- Application Date
- 20260116
Claims (10)
- 1. The large model mixing precision quantitative reasoning method based on the rotation matrix is characterized by comprising the following steps of: acquiring a target activation tensor; Performing rotation transformation on the target activation tensor by adopting a rotation matrix to obtain a rotation transformed activation tensor; Performing word element dimension mixed precision quantization on the activation tensor after rotation transformation to obtain quantized activation tensors, wherein activation values with the same precision in the quantized activation tensors are continuously distributed in a memory, and the quantized activation tensors comprise activation tensors with first precision and activation tensors with second precision, wherein the first precision is greater than the second precision; Invoking a tensor core, and performing matrix multiplication operation on the quantized activated tensor and the quantized weight tensor to obtain a quantization operation result, wherein the quantized weight tensor is obtained by performing rotation transformation on a target weight tensor by adopting the rotation matrix, and performing word element dimension mixed precision quantization; And performing inverse quantization on the quantization operation result, and taking the inverse quantization result as a matrix multiplication operation result of the target activation tensor and the target weight tensor.
- 2. The method of claim 1, wherein performing the mixed-precision quantization of the lexeme dimensions on the rotation transformed activation tensor to obtain a quantized activation tensor comprises: Calculating the absolute value of the maximum activation value corresponding to each word element dimension in the activation tensor after rotation transformation; selecting a preset number of word dimension from each word dimension of the activation tensor after rotation transformation according to the sequence of the absolute value of the maximum activation value from large to small; Performing the first precision quantization on the activation values corresponding to the preset number of word element dimensions to obtain activation tensors of the first precision; And carrying out second precision quantization on the activation values corresponding to each word element dimension or the activation values corresponding to the residual word element dimensions in the activation tensor after rotation transformation to obtain the activation tensor with the second precision, wherein the residual word element dimensions are word element dimensions except for the preset number of word element dimensions in the activation tensor after rotation transformation.
- 3. The method of claim 2, wherein the quantized weight tensor comprises the weight tensor of the first precision and the weight tensor of the second precision, wherein the weight tensor of the first precision corresponds to the activation tensor of the first precision and the weight tensor of the second precision corresponds to the activation tensor of the second precision; the matrix multiplication operation is performed on the quantized activation tensor and the quantized weight tensor to obtain a quantization operation result, which comprises the following steps: performing matrix multiplication operation on the activation tensor of the second precision and the weight tensor of the second precision to obtain a first operation result; Performing matrix multiplication operation on the activation tensor of the second precision and the weight tensor of the first precision to obtain a second operation result; Performing matrix multiplication operation on the activation tensor of the first precision and the weight tensor of the first precision to obtain a third operation result; Performing matrix multiplication operation on the activation tensor of the first precision and the weight tensor of the second precision to obtain a fourth operation result; And replacing data at a corresponding position in the first operation result by adopting the second operation result, the third operation result and the fourth operation result to obtain the quantization operation result.
- 4. A method according to claim 3, wherein pipeline processing is used in the matrix multiplication of the activation tensor of the second precision and the weight tensor of the second precision, the matrix multiplication of the activation tensor of the second precision and the weight tensor of the first precision, the matrix multiplication of the activation tensor of the first precision and the weight tensor of the first precision, and the matrix multiplication of the activation tensor of the first precision and the weight tensor of the second precision.
- 5. The method according to claim 4, wherein when a pipeline processing mode is adopted in matrix multiplication operation of activated tensors and weight tensors with the same precision, slices corresponding to a target thread block in the activated tensors and weight tensors with the same precision are loaded from a global memory to a shared memory through an asynchronous memory copy operation so as to be used for matrix multiplication operation of the target thread block; and/or the number of the groups of groups, When the pipeline processing mode is adopted in matrix multiplication operation of the activated tensors and the weight tensors with different precision, slices corresponding to the target thread blocks in the activated tensors and the weight tensors with different precision are loaded from the global memory to the shared memory through asynchronous memory copy operation, and then the slices are converted into the same type in the shared memory so as to be used for matrix multiplication operation of the target thread blocks.
- 6. The method of claim 3, further comprising performing matrix multiplication of the activation tensor of the second precision and the weight tensor of the second precision, matrix multiplication of the activation tensor of the second precision and the weight tensor of the first precision, matrix multiplication of the activation tensor of the first precision and the weight tensor of the first precision, and matrix multiplication of the activation tensor of the first precision and the weight tensor of the second precision in an asynchronous parallel manner.
- 7. The method of claim 1, wherein the rotating the target activation tensor using the rotation matrix to obtain the rotated activation tensor comprises: Performing matrix multiplication operation on the target activation tensor and the rotation matrix to obtain the activation tensor after rotation transformation; and/or the number of the groups of groups, The rotating matrix is adopted to perform rotation transformation on the target weight tensor, and the method comprises the following steps: Calculating a transpose of the rotation matrix; And performing matrix multiplication operation on the transposed matrix and the target weight tensor to obtain the weight tensor after rotation transformation.
- 8. A rotation matrix-based large model hybrid accuracy quantitative reasoning device, the device comprising: the acquisition module is used for acquiring the target activation tensor; The rotation transformation module is used for carrying out rotation transformation on the target activation tensor by adopting a rotation matrix to obtain a rotation transformed activation tensor; The system comprises a quantization module, a storage module and a storage module, wherein the quantization module is used for carrying out mixed precision quantization of word dimensions on the activation tensor after rotation transformation to obtain quantized activation tensors, and activation values with the same precision in the quantized activation tensors are continuously distributed in the storage; The matrix multiplication operation module is used for calling a tensor core, and carrying out matrix multiplication operation on the quantized activated tensor and the quantized weight tensor to obtain a quantization operation result, wherein the quantized weight tensor is obtained by carrying out rotary transformation on a target weight tensor by adopting the rotary matrix and then carrying out word element dimension mixed precision quantization; And the inverse quantization module is used for inversely quantizing the quantization operation result and taking the inverse quantization result as a matrix multiplication operation result of the target activation tensor and the target weight tensor.
- 9. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to carry out the steps of the method according to any one of claims 1 to 7.
- 10. A non-transitory computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any one of claims 1 to 7.
Description
Rotation matrix-based large model mixing precision quantitative reasoning method and device Technical Field The disclosure relates to the technical field of artificial intelligence, in particular to a large model mixing precision quantitative reasoning method and device based on a rotation matrix. Background Along with the rapid development of large language model technology in recent years, the parameter scale of the latest model reaches billions or even billions, and huge pressure is brought to GPU video memory capacity and model reasoning efficiency. The quantization technology is one of important schemes for optimizing the ultra-large scale model reasoning service, and can greatly reduce the occupation of the model video memory by compressing the representation of the calculation Tensor from a high-precision format to a low-precision format, and obtain higher calculation throughput by means of a Tensor Core (Tensor Core) instruction special for low precision on the GPU, so that the reasoning efficiency is improved. The inference process of the quantization model introduces two additional online computations, quantization (quantization) and inverse quantization (dequantization). The quantization process first determines a shared scaling factor, which is typically stored in a high precision format, then divides the input tensor by the scaling factor and rounds into the value space of the low precision data format, completing quantization compression. And performing low-precision matrix multiplication operation on the quantized activation value and the offline quantized weight, and finally, re-scaling the result back to the original precision through inverse quantization. In practical applications, quantization is often performed in a finer granularity manner, for example, quantization is performed by grouping activated rows and weights in columns, so as to better maintain model accuracy. The method changes the scaling factor from scalar to vector, thus broadcast element-by-element multiplication is needed in the inverse quantization stage, increasing the computational complexity and affecting the reasoning speed, and the existing quantization operation does not consider the characteristics of actual hardware, resulting in lower efficiency when the hardware executes quantized data operation. Therefore, there is a need for a more efficient way of quantitative reasoning acceleration for large models. Disclosure of Invention In view of this, the present disclosure proposes a rotation matrix-based large model hybrid precision quantitative reasoning method, apparatus, electronic device, storage medium and computer program product. According to an aspect of the present disclosure, there is provided a rotation matrix-based large model hybrid accuracy quantitative reasoning method, the method including: acquiring a target activation tensor; Performing rotation transformation on the target activation tensor by adopting a rotation matrix to obtain a rotation transformed activation tensor; Performing word element dimension mixed precision quantization on the activation tensor after rotation transformation to obtain quantized activation tensors, wherein activation values with the same precision in the quantized activation tensors are continuously distributed in a memory, and the quantized activation tensors comprise activation tensors with first precision and activation tensors with second precision, wherein the first precision is greater than the second precision; Invoking a tensor core, and performing matrix multiplication operation on the quantized activated tensor and the quantized weight tensor to obtain a quantization operation result, wherein the quantized weight tensor is obtained by performing rotation transformation on a target weight tensor by adopting the rotation matrix, and performing word element dimension mixed precision quantization; And performing inverse quantization on the quantization operation result, and taking the inverse quantization result as a matrix multiplication operation result of the target activation tensor and the target weight tensor. In one possible implementation manner, the performing the mixed precision quantization of the word dimensions on the activation tensor after the rotation transformation to obtain a quantized activation tensor includes: Calculating the absolute value of the maximum activation value corresponding to each word element dimension in the activation tensor after rotation transformation; selecting a preset number of word dimension from each word dimension of the activation tensor after rotation transformation according to the sequence of the absolute value of the maximum activation value from large to small; Performing the first precision quantization on the activation values corresponding to the preset number of word element dimensions to obtain activation tensors of the first precision; And carrying out second precision quantization on the activation values c