CN-122021751-A - Acceleration reasoning system of large language model, mixed precision quantization method and medium

CN122021751ACN 122021751 ACN122021751 ACN 122021751ACN-122021751-A

Abstract

The application provides an acceleration reasoning system of a large language model, a mixed precision quantization method and a medium. The acceleration reasoning system of the large language model comprises the system integrated in an HBM stack to form a heterogeneous system with an XPU, wherein the system comprises a CPU, the XPU and a plurality of groups of HBM stacks, each group of HBM stacks comprises eight DRAM chips and a buffer chip which are sequentially stacked and connected through a through silicon via, the system utilizes in-memory calculation to execute mixed precision quantification on the target large language model, and utilizes quantified model parameters to conduct reasoning calculation on an attention layer so as to accelerate the reasoning calculation of the target large language model. The application can quantize the large language model with mixed precision, balance accuracy and compression rate, maximize bandwidth and calculation efficiency, and ensure that quantization advantages are effectively converted into actual reasoning acceleration.

Inventors

JIANG LI
HU YIWEI
LIU FANGXIN

Assignees

上海交通大学

Dates

Publication Date: 20260512
Application Date: 20260104

Claims (10)

1. The acceleration reasoning system of the large language model is characterized in that the system is integrated in an HBM stack to form a heterogeneous system with an XPU, wherein the system comprises a CPU, the XPU and a plurality of groups of HBM stacks, and each group of HBM stacks comprises eight DRAM chips which are sequentially stacked and a buffer chip which is connected through a silicon through hole; The system utilizes in-memory calculation to execute mixed precision quantization on the target large language model, and utilizes quantized model parameters to perform inference calculation on an attention layer so as to accelerate the inference calculation of the target large language model.
2. The system of claim 1, wherein the DRAM chip comprises a plurality of DRAM banks, a Bank-PIM unit is arranged among the DRAM banks, the Bank-PIM unit reads operands from a Bank line buffer to perform parallel quantization matrix multiplication calculation in an attention layer, and sends the parallel calculation result to the buffer chip to perform accumulation calculation, thereby obtaining an accumulation calculation result.
3. The acceleration reasoning system of claim 1, wherein a Stack-PIM unit is provided on the buffer chip, the Stack-PIM unit comprising a quantization unit, an inverse quantization unit, an on-chip buffer, and a Softmax unit; the quantization unit performs interlayer mixed precision quantization on the target large language model to distribute quantized model parameters to the Bank-PIM unit, and stores quantization parameters corresponding to each layer into the on-chip buffer area; the inverse quantization unit collects accumulation calculation results at fixed intervals, reads the quantization parameters corresponding to the current layer from the on-chip buffer area, performs inverse quantization operation on the accumulation calculation results, and transmits the inverse quantization results to the quantization unit or the Softmax unit; The Softmax unit is configured to perform Softmax processing based on the attention layer score.
4. The system for acceleration reasoning of a large language model according to claim 2, wherein the quantization unit performing interlayer mixed-precision quantization on the target large language model comprises: Acquiring abnormal values in the activation values of the target large language model, and mapping the abnormal values to weights through channel-by-channel scaling to acquire weight distribution and activation value distribution respectively; Quantizing the activation value distribution and the weight distribution to correspondingly obtain an activation value quantization distribution and a weight quantization distribution; Determining the mixed quantization precision of the weight channel based on the original weight distribution and the quantized weight distribution; Acquiring a data scaling factor between the activation value distribution and the activation value quantization distribution, and taking the data scaling factor as an activation value entropy; and determining the quantization precision of the activation value by minimizing the difference value between the entropy of the activation value and a penalty term, wherein the penalty term is the byte-level data cost.
5. The system for accelerated reasoning for large language models of claim 1 wherein the system uses quantized model parameters to perform the reasoning calculations at the attention level comprising: splitting the activation value according to the bit to obtain a first partial activation value and a second partial activation value; the first part of the activation values and the second part of the activation values are independently subjected to quantization matrix multiplication calculation through a shared multiplier respectively, and And carrying out quantization matrix multiplication calculation by utilizing the mixed quantization precision of the weight channels based on the precision of the different weight channels by using the shifter.
6. The system for acceleration reasoning of large language model of claim 1, wherein the system comprises a computation buffer layer, wherein a schedule is stored on the computation buffer layer, wherein the schedule is integrated with a DRAM controller, and wherein the CPU uses the schedule to coordinate execution of the Bank-PIM unit through the DRAM controller.
7. The system for accelerated reasoning of large language models of claim 6 wherein the Bank-PIM unit groups according to a predetermined number of tokens, each group jointly calculating a token by channel division.
8. The large language model acceleration inference system of claim 2, wherein each of the DRAM banks is exclusive in DRAM or PIM mode only at any one time.
9. A method for hybrid accuracy quantification of a large language model, the method comprising: Acquiring abnormal values in the activation values of the target large language model, and mapping the abnormal values to weights through channel-by-channel scaling to acquire weight distribution and activation value distribution respectively; Quantizing the activation value distribution and the weight distribution to correspondingly obtain an activation value quantization distribution and a weight quantization distribution; Determining the mixed quantization precision of the weight channel based on the original weight distribution and the quantized weight distribution; Acquiring a data scaling factor between the activation value distribution and the activation value quantization distribution, and taking the data scaling factor as an activation value entropy; and determining the quantization precision of the activation value by minimizing the difference value between the entropy of the activation value and a penalty term, wherein the penalty term is the byte-level data cost.
10. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the function of the acceleration reasoning system of a large language model as claimed in any one of claims 1 to 8 or the hybrid accuracy quantization method of a large language model as claimed in claim 9.

Description

Acceleration reasoning system of large language model, mixed precision quantization method and medium Technical Field The application belongs to the technical field of large models, and relates to an acceleration reasoning system of a large language model, a mixed precision quantization method and a medium. Background Large language models (Large Language Models, LLMs) exhibit excellent capabilities in multiple language tasks, but their deployment is limited by the large scale of the model and the high resource requirements. One key bottleneck in the reasoning process is "memory wall". Especially during the token generation phase. The limited bandwidth between graphics card (GPU) compute units and memory (DRAM) results in low hardware utilization. Although the computational performance of accelerators has improved significantly, memory capacity and memory bandwidth have not grown synchronously, which has highlighted the need for higher memory efficiency solutions in large language model deployments. DRAM-based in-memory computing (Processing In Memory, PIM) offers a promising solution to address memory bottlenecks in large language model pushing. By placing the computing unit near a memory Bank (DRAM Bank), DRAM-based PIM significantly improves memory bandwidth, up to 8 times that of conventional architectures, making it well suited for memory bandwidth-constrained workloads, such as large language models, and the like. Therefore, how to extend the applicability of PIM based on DRAM in deep neural networks to accelerate the calculation of large language model reasoning is a technical problem that needs to be solved. Disclosure of Invention The application provides an acceleration reasoning system of a large language model, a mixed precision quantification method and a medium, which are used for solving the technical problem of how to expand the applicability of PIM based on DRAM in a deep neural network so as to accelerate the reasoning calculation of the large language model. In a first aspect, an embodiment of the present application provides an acceleration inference system for a large language model, where the system is integrated in an HBM stack to form a heterogeneous system with an XPU, where the system includes a CPU, the XPU, and multiple sets of HBM stacks, where each set of HBM stacks includes eight DRAM chips stacked in sequence and a buffer chip connected by a through-silicon via, where the system performs hybrid precision quantization on a target large language model by using in-memory computation, and performs inference computation on an attention layer by using quantized model parameters to accelerate the inference computation of the target large language model. In one implementation manner of the first aspect, the DRAM chip includes a plurality of DRAM banks, and Bank-PIM units are disposed between the plurality of DRAM banks, and the Bank-PIM units read operands from a Bank line buffer to perform parallel quantization matrix multiplication computation at an attention layer, and send parallel computation results to the buffer chip to perform accumulation computation, so as to obtain accumulation computation results. In one implementation manner of the first aspect, a Stack-PIM unit is provided on the buffer chip, and the Stack-PIM unit includes a quantization unit, an inverse quantization unit, an on-chip buffer area, and a Softmax unit; the quantization unit performs interlayer mixed precision quantization on the target large language model to distribute quantized model parameters to the Bank-PIM unit, and stores quantization parameters corresponding to each layer into the on-chip buffer area; the inverse quantization unit collects accumulation calculation results at fixed intervals, reads the quantization parameters corresponding to the current layer from the on-chip buffer area, performs inverse quantization operation on the accumulation calculation results, and transmits the inverse quantization results to the quantization unit or the Softmax unit; The Softmax unit is configured to perform Softmax processing based on the attention layer score. In an implementation manner of the first aspect, the quantization unit performing interlayer mixed precision quantization on the target large language model includes: Acquiring abnormal values in the activation values of the target large language model, and mapping the abnormal values to weights through channel-by-channel scaling to acquire weight distribution and activation value distribution respectively; Quantizing the activation value distribution and the weight distribution to correspondingly obtain an activation value quantization distribution and a weight quantization distribution; Determining the mixed quantization precision of the weight channel based on the original weight distribution and the quantized weight distribution; Acquiring a data scaling factor between the activation value distribution and the activation value quantization dis