CN-121998084-A - Inference acceleration method and system for lightweight large model
Abstract
The application discloses a reasoning acceleration method and system of a lightweight large model, comprising the following steps of S1, performing complex domain enhanced ultra-low bit quantization, complex domain characterization conversion including real parameters, dynamic bit allocation of task perception and complex domain quantization and precision calibration, S2, constructing a three-layer framework including a resource layer, a scheduling layer and an application layer, inputting quantized real weight parameters, user reasoning requests and hardware state information of S13, completing task execution in a pre-filling and decoding stage through three-layer collaborative scheduling, performing heterogeneous collaborative reasoning based on PD separation, and outputting reasoning results and task execution states. The method belongs to the technical field of artificial intelligence, and realizes high-precision maintenance under ultra-low bit quantization by combining complex domain characterization conversion and dynamic bit allocation precision calibration, thereby effectively reducing large-model video memory occupation, solving the technical pain point of overlarge low-bit quantization precision loss and breaking through the traditional real number domain quantization limitation.
Inventors
- CHEN XUE
- CHEN ZEYU
- CHEN SENYANG
Assignees
- 神笔马良人工智能(杭州)有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260107
Claims (10)
- 1. The reasoning acceleration method of the lightweight large model is characterized by comprising the following steps of: S1, performing complex domain enhanced ultra-low bit quantization, including complex domain characterization conversion of real parameters, dynamic bit allocation of task perception, complex domain quantization and precision calibration, specifically including: S11, inputting real weight parameters and parameter distribution statistical information of a large model transducer layer, converting the real weight parameters into complex domain parameters containing amplitude components and phase components through a complex domain mapping function, introducing a regularization term constraint mapping process to ensure parameter information integrity, and outputting the complex domain parameters, the amplitude components and the phase components; S12, constructing a bit allocation model based on reinforcement learning, inputting a network layer type, a task type, amplitude and phase distribution characteristics of the S11 and a hardware resource occupancy rate, and outputting a quantized bit configuration scheme of each transducer layer; s13, inputting the amplitude component and the phase component of the S11, the quantization bit configuration scheme of the S12 and the verification data set, respectively quantizing the amplitude and the phase, then inversely quantizing to obtain real weight parameters, combining the verification data set to evaluate the precision and the parameter distribution difference, dynamically adjusting the quantization granularity, and outputting the quantized real weight parameters and the final quantization configuration scheme; and S2, constructing a three-layer architecture comprising a resource layer, a scheduling layer and an application layer, inputting the quantized real weight parameters, the user reasoning request and the hardware state information of the S13, completing task execution in a pre-filling and decoding stage through three-layer collaborative scheduling, performing heterogeneous collaborative reasoning based on PD separation, and outputting a reasoning result and a task execution state.
- 2. The method for accelerating reasoning of the lightweight large model according to claim 1, wherein in the step S11, the conversion logic of the complex domain mapping function comprises the steps of taking the ratio of the absolute value of the real weight parameter to the maximum value of the absolute value as an amplitude component, obtaining a phase component by carrying out arctangent operation on the ratio of each row mean value to each column mean value of the real weight parameter, and obtaining a regularization term by combining the L2 norm ratio of the complex domain parameter to the original real weight parameter with a preset regularization coefficient for restraining the parameter energy difference before and after mapping.
- 3. The method for accelerating reasoning of a lightweight large model as set forth in claim 1, wherein the training process of the bit allocation model in S12 includes: collecting a diversified task data set, and constructing a training sample set according to weight parameters, distribution characteristics, precision loss rates under different bit configurations and resource occupancy rates of a corresponding large model; adopting an improved deep Q network architecture, and splicing the network layer type and the task type after coding with the distribution characteristics and the resource occupancy rate normalization result as an input layer; optimizing model parameters through a time sequence differential loss function, and completing training by adopting a preset learning rate, a batch size and iteration times; The reward function is the weighted sum of the precision weight and the efficiency weight, the precision guarantee coefficient and the resource saving coefficient respectively.
- 4. The method for accelerating inference of a lightweight large model according to claim 1, wherein in S13, quantization processing includes non-uniform quantization of amplitude components and uniform quantization of phase components, wherein the amplitude components are statistically divided into high-value and low-value regions and quantized by fine granularity and coarse granularity respectively, the phase components are uniformly quantized after quantization intervals are determined according to a value range and quantization levels, and quantization logics are obtained by rounding ratios of the components to the quantization intervals and multiplying the quantized values by the quantization intervals.
- 5. The method for accelerating reasoning of the lightweight large model as set forth in claim 1, wherein in the step S13, the precision calibration logic comprises obtaining real weight parameters through quantized amplitude components, absolute value maximum values of original real weight parameters and sign inverse quantization, calculating distribution differences and reasoning accuracy of the original real weight parameters and the inverse quantized parameters, and if the differences and the accuracy losses exceed preset thresholds, adjusting quantization granularity to be re-quantized until the requirements are met.
- 6. The method for accelerating reasoning of a lightweight large model according to claim 1, wherein in S2, the three-layer architecture cooperative scheduling logic comprises: The resource layer receives a scheduling instruction, deploys a pre-filling task to the GPU cluster, deploys a decoding task to the FPGA cluster, and transmits data and feeds back a hardware state through the high-speed interconnection module; The scheduling layer builds a dynamic scheduling model based on reinforcement learning, inputs the reasoning task of the application layer and the hardware state of the resource layer, and realizes load scheduling through task classification and sequencing, batch self-adaptive adjustment and resource elastic allocation; And the application layer receives the user request, invokes the quantization tool to execute the S1, forwards the reasoning task and feeds back the reasoning result and the abnormal alarm information.
- 7. The reasoning acceleration method of the lightweight large model according to claim 6, wherein the dynamic scheduling model adopts a PPO reinforcement learning architecture, an input layer is the feature splicing of task types, priorities, batch sizes, hardware states and task execution states, parameters are optimized through a mixed loss function comprising strategy loss, value loss and entropy loss, batch self-adaptive adjustment is carried out to dynamically adjust batch sizes in a pre-filling stage according to GPU utilization, and resource elastic allocation is carried out to realize dynamic migration of tasks between an FPGA and a CPU cluster according to the cache occupancy rate of the FPGA under a high concurrency scene.
- 8. The reasoning acceleration system of the lightweight large model is characterized by comprising a complex domain quantization module, a heterogeneous resource layer, a dynamic scheduling module and an application layer, wherein the modules cooperate to form a closed loop system: a complex domain quantization module, configured to perform the complex domain enhanced ultra-low bit quantization step of any one of claims 1 to 5, input real weight parameters, parameter distribution statistics information, and verification data sets, and output quantized real weight parameters and a final quantization configuration scheme; The heterogeneous resource layer comprises a GPU cluster, an FPGA cluster and a high-speed interconnection module, and is used for receiving a scheduling instruction to complete task execution in a pre-filling and decoding stage, and outputting an reasoning intermediate result, a final reasoning token sequence and hardware state information; the dynamic scheduling module is used for constructing a scheduling model based on reinforcement learning, inputting the reasoning task of the application layer and the hardware state information of the heterogeneous resource layer, and outputting a scheduling instruction to realize load-aware dynamic scheduling; The application layer is used for receiving a user reasoning request, calling the complex domain quantization module to execute a quantization flow, forwarding a reasoning task, feeding back a reasoning result, a task execution report and abnormal alarm information; The output of the complex domain quantization module is transmitted to the heterogeneous resource layer and the dynamic scheduling module, and the hardware state information of the heterogeneous resource layer is transmitted to the dynamic scheduling module to form a logic closed loop.
- 9. The system for reasoning acceleration of a lightweight large model according to claim 8, wherein the complex domain quantization module comprises a complex domain mapping unit, a bit allocation unit and a quantization calibration unit, wherein the complex domain mapping unit performs complex domain characterization conversion of real parameters and outputs complex domain parameters, amplitude and phase components, the bit allocation unit deploys the bit allocation model and outputs a quantization bit configuration scheme, and the quantization calibration unit performs amplitude non-uniform quantization and phase uniform quantization, and adjusts quantization granularity through inverse quantization and precision evaluation.
- 10. The reasoning acceleration system of the light-weight large model according to claim 8 is characterized in that the FPGA cluster of the heterogeneous resource layer integrates a complex domain quantization special operation unit, multi-bit complex multiplication and addition operation is supported, and the high-speed interconnection module adopts PCIe and InfiniBand interconnection technology and data compression algorithm to realize low-delay data transmission between the GPU and the FPGA cluster.
Description
Inference acceleration method and system for lightweight large model Technical Field The invention relates to the technical field of artificial intelligence, in particular to a lightweight large model reasoning acceleration method and system. Background With the development of large model technology, large models such as GLM-4, llama3 and the like show excellent performance in natural language processing tasks, but model parameters are large in scale and reach billions to billions generally, so that the problems of high memory occupation, large reasoning time delay and high hardware cost exist in the reasoning process, and the industrialization landing of the large models is limited. To solve the above problems, the prior art has mainly conducted research from two directions of model compression (e.g. quantization, pruning) and inference architecture optimization (e.g. PD separation): Low-bit quantization techniques, mainstream schemes such as GPTQ, AWQ, tensorRT-LLM quantization are limited to the real number domain, employing either uniform or simple non-uniform quantization strategies. In the ultra-low bit scene of INT4 and below, the model performance is suddenly reduced due to discretization errors, the precision loss is generally more than 5%, the existing scheme is mainly static bit allocation, the task type is not combined for dynamic adjustment, and the precision and the efficiency are difficult to balance. The low-bit quantization refers to a model compression means for converting high-precision weight parameters (such as 32-bit floating point numbers) of a model into low-precision representations (such as INT4 and INT 2) so as to reduce video memory occupation and calculation cost. PD separation architecture existing schemes (such as vLLM, text Generation Inference) only implement physical separation of Pre-fill-Decode Split (PD separation for short). PD separation is a framework optimization means of large model reasoning, and the reasoning process is divided into a pre-filling stage (P stage) and a decoding stage (D stage), wherein the P stage mainly completes token embedding and initial attention calculation of an input text, belongs to a computationally intensive task, the D stage mainly completes generation of a subsequent token, relies on KV cache to store intermediate results, and belongs to a storage intensive task. The existing scheme adopts isomorphic hardware (such as pure GPU) deployment, does not carry out fine adaptation aiming at the difference of two-stage resource requirements, adopts static batch scheduling as a scheduling strategy, and has delay fluctuation of more than 15% and poor stability under a high concurrency scene. The heterogeneous reasoning architecture is characterized in that most of the existing schemes are simple cooperation of a CPU and a GPU, special quantization operation hardware units are not integrated, the existing schemes are not deeply fused with a model compression method, and the upper limit of hardware performance is difficult to develop. The prior art has three main core defects: the ultra-low bit quantization precision loss is large, and the performance ceiling quantized in real number domain is difficult to break through; The PD separation architecture resource adaptation is rough, directional matching of heterogeneous hardware is not realized, and the resource utilization rate is low; the high concurrency scene scheduling lacks dynamic load perception, has large time delay fluctuation, and is difficult to meet the industrial real-time requirement. Therefore, there is a need to develop a lightweight large model reasoning acceleration method to solve the problems in the prior art. Disclosure of Invention The invention aims to provide a lightweight large-model reasoning acceleration method and system, which can solve the technical problems of large loss of ultra-low bit compression precision, rough adaptation of system resources and high concurrency scene performance fluctuation, and has the advantages of simple structure and convenient use so as to solve the problems in the background technology. In order to achieve the above purpose, the present invention provides the following technical solutions: A reasoning acceleration method of a lightweight large model comprises the following steps: S1, performing complex domain enhanced ultra-low bit quantization, including complex domain characterization conversion of real parameters, dynamic bit allocation of task perception, complex domain quantization and precision calibration, specifically including: S11, inputting real weight parameters and parameter distribution statistical information of a large model transducer layer, converting the real weight parameters into complex domain parameters containing amplitude components and phase components through a complex domain mapping function, introducing a regularization term constraint mapping process to ensure parameter information integrity, and outputting th