CN-121997985-A - System and method for fine-tuning rotated outlier-free large language models to enable efficient weight activation quantization
Abstract
A computing device comprising at least one processor, one or more non-transitory computer-readable storage media, and a system, wherein the system is to fine tune a large language model under low bit weight-activated quantization. The computing device also includes a Graphics Processing Unit (GPU), a Neural Processing Unit (NPU), or a Tensor Processing Unit (TPU). The hardware interface module of the system is used for loading the low-bit model representation form from the storage module and transmitting the low-bit model representation form to the GPU, the NPU or the TPU to perform reasoning and executing operation.
Inventors
- HUANG XIJI
- ZHENG GUANGTING
- LIU ZECHUN
- LIU SHIYANG
Assignees
- 香港科技大学
Dates
- Publication Date
- 20260508
- Application Date
- 20251008
- Priority Date
- 20241107
Claims (20)
- 1. A system for fine-tuning a large language model under low bit weight-activated quantization, the system comprising: A storage module for storing a structured representation associated with the model object, the structured representation comprising a weight matrix, an activation signal, a rotation parameter, and an intermediate or final representation; A tensor structuring module for receiving the structured representation of the model object from the storage module and processing the structured representation into a structured digital format suitable for computer processing, wherein the structured digital format is stored in the storage module; a model initialization module for retrieving the structured digital format from the storage module and modifying an internal normalization component of the model object to maintain computational invariance during rotation; A rotation configuration module for retrieving the modified model object from the storage module, applying an orthogonal rotation transformation to a weight matrix and an activation signal of the model object, and storing the rotated model object in the storage module; A rotation-aware trim module to retrieve the rotated model object and apply a low-rank adaptation to the model object according to a selected trim strategy, wherein the low-rank adaptation includes inserting and training a low-rank matrix while keeping basic weights unchanged; a quantization module for retrieving the trimmed model objects and quantizing the weights and activations using at least one quantization strategy to generate a low-bit model representation stored in the storage module, and And the hardware interface module is used for exporting the low-bit model representation form stored in the storage module to an inference system for deployment.
- 2. The system of claim 1, the orthogonal rotation transform comprising at least one of a Hadamard matrix or a block diagonal matrix applied to an attention mechanism and a feed forward weight matrix.
- 3. The system of claim 1, the rotation configuration module further to apply an inter-block rotation to the projection weight matrix and an intra-block rotation to the activation tensor after the normalization layer and before the nonlinear function.
- 4. The system of claim 1, the rotation aware trim module is further configured to operate in a rotated LoRA (LoRA After Rotation; LAR) mode by inserting a low-rank matrix after rotation and training the low-rank matrix in a rotated weight space.
- 5. The system of claim 1, the rotation-aware trim module further operable in a pre-rotation LoRA (LoRA Before Rotation; LBR) mode by applying the low-rank adaptation before rotation and then transforming the adapted weights by rotation.
- 6. The system of claim 1, the quantization module further to perform channel-by-channel symmetric quantization on weights and tensor-by-tensor quantization on activations.
- 7. The system of claim 1, further comprising: And the evaluation analysis module is used for retrieving the low-bit model representation form and calculating an evaluation index with the activation kurtosis and the quantization error.
- 8. The system of claim 1, further comprising: And the digital conversion interface module is used for converting the floating point tensor, the rotation matrix and the low-rank vector into a digital format aligned with a memory compatible with hardware execution.
- 9. The system of claim 1, further comprising: a compatibility interface module for adapting LoRA the system to a variant by re-integrating the resolved direction and magnitude weight components into a fine tuning process.
- 10. The system of claim 1, wherein the storage module comprises one or more non-transitory computer-readable storage media selected from at least the group consisting of Dynamic Random Access Memory (DRAM), static Random Access Memory (SRAM), flash Memory (Flash Memory), and Solid State Disk (SSD).
- 11. A computer apparatus, comprising: at least one processor; one or more non-transitory computer-readable storage media; the system of claim 1 for scaling large language models under low bit weight-activated quantization, and A Graphics Processing Unit (GPU), a Neural Processing Unit (NPU), or a Tensor Processing Unit (TPU); Wherein the hardware interface module of the system is configured to load the low-bit model representation from the storage module and transmit the low-bit model representation to the graphics processing unit, the neural processing unit, or the tensor processing unit for performing inference.
- 12. A method for fine-tuning a large language model under low bit weight-activated quantization, comprising: Receiving, by a tensor structuring module, a structured representation associated with a model object, the structured representation comprising a weight matrix, an activation signal, and a rotation parameter; processing, by the tensor structuring module, the structured representation into a structured digital format suitable for computer processing, and storing the structured digital format in a storage module; modifying, by a model initialization module, an internal normalized component of the model object to maintain computing invariance during rotation; Applying orthogonal rotation transformation to the weight matrix of the model object and the activation signal by a rotation configuration module to generate the rotated model object; applying, by a rotation-aware hinting module, a low-rank adaptation to the rotated model object according to a selected hinting strategy, wherein the low-rank adaptation comprises inserting and training a low-rank matrix while maintaining basic weights unchanged; Quantizing, by a quantization module, weights and activations using at least one quantization strategy to generate a low-bit model representation, and And exporting the low-bit model representation form to an inference system for deployment through a hardware interface module.
- 13. The method of claim 12, applying the orthogonal rotation transform comprises applying at least one of a Hadamard matrix or a block diagonal matrix to an attention mechanism and a feed forward weight matrix.
- 14. The method of claim 12, applying the orthogonal rotation transform comprising: Applying an inter-block rotation to the projection weight matrix, and An intra-block rotation is applied to the activation tensor after the normalization layer and before the nonlinear function.
- 15. The method of claim 12, applying the low rank adaptation comprises operating in a post-rotation LoRA (LoRA After Rotation; LAR) mode by inserting a low rank matrix after rotation and training the low rank matrix in a post-rotation weight space.
- 16. The method of claim 12, applying the low-rank adaptation comprises operating in a pre-rotation LoRA (LoRA Before Rotation; LBR) mode by applying low-rank adaptation before rotation, then transforming the adjusted weights by rotation.
- 17. The method of claim 12, performing the quantization comprising: Performing channel-by-channel symmetric quantization on weights, and A tensor quantization is performed on the activations.
- 18. The method of claim 12, further comprising: The low-bit model representation is evaluated using a benchmarking task and an evaluation index including activation kurtosis and quantization error is calculated.
- 19. The method of claim 12, further comprising: Floating point tensors, rotation matrices, and low rank vectors are converted to a memory-aligned digital format compatible with hardware execution.
- 20. The method of claim 12, further comprising: The model object is adapted LoRA to the variant by re-integrating the resolved direction and magnitude weight components into the fine tuning process.
Description
System and method for fine-tuning rotated outlier-free large language models to enable efficient weight activation quantization Technical Field The present invention relates to model compression and quantization techniques for large-scale neural networks, and more particularly, to a system and method for fine-tuning a rotated and outlier free large language model (large language model; LLM) to achieve efficient weight-activated quantization under low bit conditions, thereby improving accuracy and reducing quantization errors. Background Large Language Models (LLM), such as GPT-4 and LLaMA, have achieved significant success in a variety of tasks. However, based on the ever-increasing model size and training costs, developments in model compression and parameter-efficient tuning (PEFT) methods have been driven. In this regard, low-rank adaptation (LoRA) has become a widely used PEFT technique that can improve the fine tuning efficiency by updating a limited set of parameters. In recent years, quantization techniques for converting high-accuracy parameters into a low-bit format, such as INT4, have been combined with LoRA methods. The existing quantization LoRA scheme can save the memory cost in the fine tuning process, and some schemes can even reduce the reasoning cost by directly generating quantized LLM. However, these methods only perform weight quantization, and LoRA weight-activation quantization is currently not studied. By quantizing weights and activations with low bits, GPU memory at run-time can be further saved and computationally intensive matrix multiplication operations accelerated. It has been found that in LLM, the use of LoRA fine-tuned 4-bit or 6-bit weight-activated quantization still results in a significant drop in accuracy, because there are outliers in the weight and activation distribution that would lead to an enlarged quantization range and increased quantization error. In existing approaches in the field of post-training quantitative research, attempts have been made to solve outlier problems by mixing accuracy subpackets or transferring outliers from activation to weights. In recent years, it has been demonstrated that rotating the weight matrix of LLM will effectively eliminate the activation outliers and maintain the computational invariance. However, these methods solve the problem from the point of view after training, and ignore the abnormal values and the distribution changes during the pre-training and fine-tuning process. Therefore, there is also a need for a system and method that enables efficient low bit weight-activated quantization by eliminating outliers during fine tuning of large language models and that maintains robustness throughout the fine tuning process. Disclosure of Invention It is an object of the present invention to provide a system and method that addresses the above-identified deficiencies and unmet needs in the art. The invention provides a method called rotation-free Low-Rank Adaptation (RoLoRA), which is used as a LoRA-based scheme for realizing efficient weight-activation quantification. RoLoRA uses rotation to eliminate outliers and proposes a fine-tuning of rotation perception to preserve outlier-free properties in the rotated LLM. Experimental results indicate that RoLoRA can continuously promote the convergence of the low bit LoRA and the post-training quantization robustness under the weight-activation setting. RoLoRA has been evaluated on the LLaMA-7B/13B and LLaMA3-8B models, and in the common sense reasoning task RoLoRA achieved an absolute accuracy improvement of up to 29.5% over the LoRA baseline on LLaMA-13B of the 4-bit weight-activated quantification. RoLoRA has also been demonstrated on a large multimodal model, including LLaVA-1.5-7B. According to a first aspect of the present invention, a system is provided for fine-tuning a large language model under low bit weight-activated quantization. The system comprises a storage module, a tensor structuring module, a model initializing module, a rotation configuration module, a rotation perception fine tuning module, a quantization module and a hardware interface module. The storage module is used for storing the structural representation forms associated with the model objects, including weight matrixes, activation signals, rotation parameters and intermediate or final representation forms. The tensor structuring module is operable to receive the structured representation of the model object from the storage module and process the structured representation into a structured digital format suitable for computer processing, wherein the structured digital format is stored in the storage module. The model initialization module is used to retrieve the structured digital format from the storage module and modify the internal normalized components of the model object to maintain computational invariance during rotation. The rotation configuration module is used for retrieving the modified model obj