Search

CN-121997991-A - Large language model quantization quality assessment system

CN121997991ACN 121997991 ACN121997991 ACN 121997991ACN-121997991-A

Abstract

The invention discloses a large language model quantization quality evaluation system which comprises a resource analysis end, a quantization execution end and an evaluation analysis end, wherein the resource analysis end is used for collecting equipment data sets, determining hardware resource constraint of edge equipment, dividing the sensitivity level of a functional layer on quantization in a large language model to obtain sensitivity ordering, the quantization execution end is used for establishing a mapping rule to realize precision level mapping, determining a final allocation scheme and quantization parameters by iteratively optimizing quantization precision loss and resource consumption, constructing a mixed precision quantization operator library, simultaneously determining a compiling tool chain of the edge equipment, constructing a quantization model of the large language model, testing different tasks by the evaluation analysis end to evaluate the quality of the quantization model on the edge equipment to judge whether the quality meets the standard, carrying out precision adjustment iteration and resource cutting optimization of the quantization model aiming at non-standard items in evaluation, monitoring key indexes and storing the system data sets.

Inventors

  • WANG JIAXIANG
  • XU WENTAO

Assignees

  • 西北工业大学

Dates

Publication Date
20260508
Application Date
20251222

Claims (10)

  1. 1. The large language model quantitative quality evaluation system is characterized by comprising a resource analysis end, a quantitative execution end and an evaluation analysis end, wherein: The resource analysis end is used for collecting the equipment data set, defining the hardware resource constraint of the edge equipment, and dividing the sensitivity level of the functional layer to quantization in the large language model to obtain the sensitivity ordering; The quantization execution terminal establishes a mapping rule based on the equipment data set and the sensitivity ordering, realizes precision grade mapping, and obtains a preliminary allocation scheme; analyzing the quantization type and quantization parameters of each functional layer, constructing a mixed precision quantization operator library, simultaneously determining a compiling tool chain of edge equipment, configuring compiling options and constructing a quantization model of a large language model; The evaluation analysis end tests different tasks based on the quantization model and the standard test set to evaluate the quality of the quantization model on the edge equipment and judge whether the quality meets the standard, carries out precision adjustment iteration and resource cutting optimization of the quantization model aiming at the unqualified item in the evaluation, and simultaneously monitors key indexes in real time and stores a system data set.
  2. 2. The large language model quantization quality assessment system according to claim 1, wherein the resource analysis end comprises an edge device requirement analysis module; the edge equipment demand analysis module comprises a hardware resource parameter acquisition unit and a model performance demand definition unit, wherein: The hardware resource parameter acquisition unit is used for acquiring an equipment data set of the appointed edge equipment, wherein the equipment data set comprises memory data, storage data and power consumption data, and transmitting the acquired equipment data set to the model sensitivity analysis module and the resource optimization iteration module; the model performance requirement definition unit is used for defining the model of the large language model, setting a quantization precision data loss threshold, an edge device reasoning delay threshold, an edge device memory occupation threshold and an edge device power consumption threshold, and sending the quantization precision data loss threshold, the edge device reasoning delay threshold, the edge device memory occupation threshold and the edge device power consumption threshold to the quantization quality evaluation module of the evaluation analysis end.
  3. 3. The large language model quantization quality assessment system according to claim 1, wherein the resource analysis end further comprises a model sensitivity analysis module; The model sensitivity analysis module comprises a feature extraction unit and a sensitivity calculation unit, wherein: The feature extraction unit is used for dividing the types of the functional layers of the large language model, extracting the parameter scale and the calculated amount of each functional layer, and marking the functional layers with the total calculated amount of the layers being larger than the preset proportion as computationally intensive layers; the method comprises the steps of acquiring data distribution of input tensor and output tensor of each functional layer, recording data flow of each functional layer in an inference process, obtaining a large language model layer analysis report, and sending the large language model layer analysis report to a sensitivity calculation unit, wherein the types of the functional layers comprise an embedded layer, a multi-head attention layer, a feed-forward network layer, a layer normalization layer and an output layer; The sensitivity calculation unit quantitatively calculates the sensitivity degree of each functional layer to quantification based on the analysis report of the large language model layer, and divides the sensitivity level to obtain sensitivity ordering.
  4. 4. The large language model quantization quality assessment system according to claim 3, wherein the sensitivity calculation unit quantitatively calculates the sensitivity of each functional layer to quantization based on the large language model layer analysis report, classifies the sensitivity level, and obtains the sensitivity rank, comprising: A1, taking FP32 as an original precision reference, and setting each functional layer as a target layer in sequence, wherein a test precision set comprises INT8, INT4 and FP16; a2, quantifying the target layer into specified precision, fixing other functional layers into FP32, and executing isolated quantification test on the target layer; A3, inputting a standard test set in the isolated quantization test process, and obtaining output characteristic tensors of target layers before and after quantization; a4, calculating the sensitivity of each layer based on the output characteristic tensors of the target layers before and after quantization to obtain a sensitivity coefficient; Sensitivity coefficient The calculation formula is as follows: , wherein, For the original precision FP32, the output characteristic tensor of the target layer, When the target layer is quantized to the specified precision, outputting a characteristic tensor of the target layer; A5, repeating the contents of the steps A2 to A4, and calculating the sensitivity coefficient of each functional layer under the specified precision; A6, judging that the functional layers are high-sensitivity layers when the sensitivity coefficient of the functional layers is larger than or equal to 0.1, judging that the functional layers are medium-sensitivity layers when the sensitivity coefficient of the functional layers is larger than or equal to 0.05 and smaller than 0.1, judging that the functional layers are low-sensitivity layers when the sensitivity coefficient of the functional layers is smaller than or equal to 0.05, and marking the recommended precision range of each functional layer to obtain the sensitivity ordering.
  5. 5. The large language model quantization quality assessment system according to claim 1, wherein the quantization execution end comprises an accuracy allocation module; the mixed precision allocation module comprises a precision grade mapping unit and an allocation strategy optimization unit, wherein: The precision grade mapping unit establishes a mapping rule based on a device data set and sensitivity ordering, namely, preferentially allocates FP16 to a high sensitive layer, if the memory occupation exceeds the memory occupation threshold of the edge device, the high sensitive layer is degraded to INT8 and marks the quality to be monitored; the allocation strategy optimization unit determines a final allocation scheme by iteratively optimizing and balancing quantization accuracy loss and resource consumption based on the preliminary allocation scheme.
  6. 6. The large language model quantization quality evaluation system according to claim 5, wherein the allocation policy optimizing unit determines the final allocation scheme by iteratively optimizing a balance quantization accuracy loss and resource consumption based on the preliminary allocation scheme, comprising: B1, calculating based on the sensitivity coefficient to obtain the distribution priority The method specifically comprises the following steps: ; Wherein, the As a sensitivity coefficient of the sensor array, Adopting a resource consumption rate with specified precision for the functional layer; The method comprises the steps of B2, adjusting resource hyperbranched items in a preliminary allocation scheme according to the descending order of allocation priority, and preferentially reducing the precision of a functional layer with the minimum allocation priority, wherein the resource hyperbranched items comprise that precision data loss exceeds a quantized precision data loss threshold value, edge equipment reasoning delay time exceeds an edge equipment reasoning delay threshold value, edge equipment memory occupation exceeds an edge equipment memory occupation threshold value and edge equipment power consumption exceeds an edge equipment power consumption threshold value; b3, improving the precision according to the ascending order of the distribution priority under the condition that the precision loss exceeds the threshold value in the primary distribution scheme; B4, calculating a comprehensive score based on the optimized schemes of the B2 and the B3: ; Wherein, the In order to integrate the score(s), In order to achieve a loss of accuracy rate, Is the resource hyperbranched rate; And B5, repeating the steps B2 to B3, and selecting the optimized scheme with the highest comprehensive score as a final allocation scheme after iterating for a plurality of times.
  7. 7. The large language model quantization quality assessment system according to claim 1, wherein the quantization execution end further comprises a quantization parameter calculation module; the quantization parameter calculation module comprises a symmetrical quantization parameter calculation unit and an asymmetrical quantization parameter calculation unit; The symmetric quantization parameter calculation unit analyzes the numerical range of input data aiming at a functional layer with data distribution showing symmetry in a large language model, extracts the maximum value of the absolute value of a characteristic tensor in the input data so as to determine the dynamic range boundary of the input data, determines the mapping proportion between floating point type data and integers based on quantization precision, randomly selects samples from the input data, calculates the error of the input data and recovery data through the operation of converting the quantized floating point number into the integer and converting the inverse quantized integer into the floating point number, and if the error is larger than an error threshold, adjusts the calculation mode of the maximum absolute value to remove extreme abnormal values, recalculates the proportion, and outputs quantization precision and a scaling factor to obtain symmetric quantization parameters; The asymmetric quantization parameter calculation unit analyzes the numerical distribution of input data aiming at a functional layer with data distribution showing asymmetry in a large language model, extracts the maximum value of the absolute value of a characteristic tensor in the input data so as to determine a complete floating point number value interval, determines a corresponding integer value range based on quantization precision, determines the total length of the integer range, determines a scaling factor and a zero point at the same time, performs quantization and inverse quantization operations on the input data, calculates an average error of the input data and recovery data, and when the error is larger than a preset threshold value, readjust the value interval of the input data, recalculates the scaling factor and the zero point until the output floating point number, the zero point and the floating point number value interval reach preset requirements, and obtains the asymmetric quantization parameter.
  8. 8. The large language model quantization quality assessment system according to claim 1, wherein the quantization execution end further comprises a quantization execution and edge adaptation module; The quantization execution and edge adaptation module comprises a mixed precision quantization operator generation unit and an edge equipment compiling and adaptation unit; The mixed precision quantization operator generation unit is used for analyzing quantization types and quantization parameters of each functional layer based on a final allocation scheme, symmetric quantization parameters and asymmetric quantization parameters, generating parameter quantization operators for an embedded layer in the functional layer, performing offline quantization on 32-bit floating point weights into 4-bit integers, adopting packet compression during storage, generating activation quantization operators for a multi-head attention layer in the functional layer, performing real-time quantization on 32-bit floating point inputs into 16-bit floating points during reasoning, generating quantization calculation fusion operators for a feedforward network layer in the functional layer, fusing quantization, matrix multiplication and inverse quantization, generating a numerical range constraint operator for a layer normalization layer in the functional layer, and obtaining a mixed precision quantization operator library on the basis; The edge device compiling and adapting unit is used for determining a compiling tool chain of the edge device, configuring compiling options, performing hardware instruction mapping on the mixed precision quantization operator library, performing memory optimization on a large language model structure, simultaneously linking a quantization operator and model weights into an executable file, generating a model reasoning entry function, and reducing reasoning time consumption through operator scheduling optimization, so that a quantization model of the large language model is obtained.
  9. 9. The large language model quantization quality assessment system according to claim 1, wherein the assessment analysis end comprises a quantization quality assessment module; The quantization quality evaluation module comprises a quantization precision loss evaluation unit and an edge resource consumption evaluation unit; The quantitative precision loss evaluation unit is used for testing three tasks including text generation, semantic understanding and logical reasoning based on a quantitative model and a standard test set, respectively testing the standard performance of a large language model at a PC end and the performance of the quantitative model at an edge device, recording the BLEU value, the semantic understanding accuracy and the confusion degree tested at the PC end and the BLEU value, the semantic understanding accuracy and the confusion degree tested at the edge device, and simultaneously comparing the BLEU loss rate, the semantic understanding accuracy loss rate and the confusion degree ascending rate on the edge device calculated at the PC end to obtain quantitative precision data, and simultaneously transmitting the unqualified content of the quantitative precision data to the resource optimization iteration module; The edge resource consumption evaluation unit is used for evaluating the constraint of the resource compliance of the quantization model on the edge device.
  10. 10. The large language model quantization quality assessment system according to claim 1, wherein the assessment analysis end further comprises a resource optimization iteration module; the resource optimization iteration module comprises an accuracy adjustment iteration unit and a resource clipping optimization unit; The precision adjustment iteration unit checks original precision distribution of a functional layer with the loss ratio of the quantized precision data based on the quantized precision data, improves the precision of a high-sensitivity layer, recalculates a scaling factor and a zero point of the high-sensitivity layer, generates an optimized quantized model, retests and calculates a BLEU loss rate, a semantic understanding accuracy loss rate and a confusion degree rising rate on edge equipment; The resource clipping optimizing unit performs clipping optimization based on the unqualified items output by the edge resource consumption evaluating unit, performs parameter clipping on the low sensitive layer when the memory occupancy rate is larger than the memory occupancy threshold of the edge equipment, reduces the data carrying times of the computationally intensive layer in an operator fusion mode when the delay achievement rate is smaller than the preset edge equipment reasoning delay threshold, adjusts the CPU or GPU frequency when the average power consumption is larger than the power consumption threshold of the edge equipment, and accordingly obtains a resource optimizing scheme and sends the resource optimizing scheme to the quantization quality evaluating module for reevaluation.

Description

Large language model quantization quality assessment system Technical Field The invention relates to the technical field of large language model evaluation, in particular to a large language model quantitative quality evaluation system. Background Along with the application expansion of a large language model (Large Language Model, LLM) in an edge equipment scene, a quantization technology becomes a key means for balancing the performance of the model and hardware resources, memory occupation is reduced by reducing parameter precision, reasoning efficiency is improved, and quantization quality assessment is used as a core link, so that precision loss in a quantization control process can be controlled, an effective function of the model is ensured to be maintained under a resource limited environment, and the method is an important support for realizing the edge deployment landing of the large language model, and the reliability and user experience of the application are directly affected. In the prior large language model quantization related technology, the existing evaluation scheme focuses on a single dimension, the hardware characteristics of edge equipment and the sensitivity difference of each layer of the model are not fully combined to carry out comprehensive consideration, meanwhile, part of schemes lack of dynamic iteration optimization mechanisms, quantization strategies are difficult to flexibly adjust according to evaluation results, and therefore under the edge scene with strict resource constraint, the precision and efficiency balance effect of the model still needs to be further improved, and the suitability and the practicability still have optimization space. Disclosure of Invention The invention aims to provide a large language model quantitative quality evaluation system to solve the problems that in the prior art, accuracy and efficiency are difficult to consider, suitability and practicability are to be improved. In order to realize the tasks, the invention adopts the following technical scheme: A large language model quantitative quality evaluation system comprises a resource analysis end, a quantitative execution end and an evaluation analysis end, wherein: The resource analysis end is used for collecting the equipment data set, defining the hardware resource constraint of the edge equipment, and dividing the sensitivity level of the functional layer to quantization in the large language model to obtain the sensitivity ordering; The quantization execution terminal establishes a mapping rule based on the equipment data set and the sensitivity ordering, realizes precision grade mapping, and obtains a preliminary allocation scheme; analyzing the quantization type and quantization parameters of each functional layer, constructing a mixed precision quantization operator library, simultaneously determining a compiling tool chain of edge equipment, configuring compiling options and constructing a quantization model of a large language model; The evaluation analysis end tests different tasks based on the quantization model and the standard test set to evaluate the quality of the quantization model on the edge equipment and judge whether the quality meets the standard, carries out precision adjustment iteration and resource cutting optimization of the quantization model aiming at the unqualified item in the evaluation, and simultaneously monitors key indexes in real time and stores a system data set. Further, the resource analysis end comprises an edge equipment requirement analysis module; the edge equipment demand analysis module comprises a hardware resource parameter acquisition unit and a model performance demand definition unit, wherein: The hardware resource parameter acquisition unit is used for acquiring an equipment data set of the appointed edge equipment, wherein the equipment data set comprises memory data, storage data and power consumption data, and transmitting the acquired equipment data set to the model sensitivity analysis module and the resource optimization iteration module; the model performance requirement definition unit is used for defining the model of the large language model, setting a quantization precision data loss threshold, an edge device reasoning delay threshold, an edge device memory occupation threshold and an edge device power consumption threshold, and sending the quantization precision data loss threshold, the edge device reasoning delay threshold, the edge device memory occupation threshold and the edge device power consumption threshold to the quantization quality evaluation module of the evaluation analysis end. Further, the resource analysis end also comprises a model sensitivity analysis module; The model sensitivity analysis module comprises a feature extraction unit and a sensitivity calculation unit, wherein: The feature extraction unit is used for dividing the types of the functional layers of the large language model, extracting the parameter