US-20260127438-A1 - SYSTEM AND METHOD FOR FINE-TUNING ROTATED OUTLIER-FREE LARGE LANGUAGE MODELS FOR EFFECTIVE WEIGHT-ACTIVATION QUANTIZATION

US20260127438A1US 20260127438 A1US20260127438 A1US 20260127438A1US-20260127438-A1

Abstract

A computing device includes at least one processor, one or more non-transitory computer-readable storage media, a system for fine-tuning a large language model under low-bit weight-activation quantization. The computing device further comprises a graphics processing unit (GPU), a neural processing unit (NPU), or a tensor processing unit (TPU). The hardware interface module of the system is configured to load a low-bit model representation from the memory module and transmit the model representation to the GPU, NPU, or TPU for inference execution.

Inventors

Xijie Huang
Kwang Ting CHENG
Zechun Liu
Shih-Yang LIU

Assignees

THE HONG KONG UNIVERSITY OF SCIENCE AND TECHNOLOGY

Dates

Publication Date: 20260507
Application Date: 20251008

Claims (20)

1 . A system for fine-tuning a large language model under low-bit weight-activation quantization, comprising: a memory module configured to store structured representations associated with a model object comprising weight matrices, activation signals, rotation parameters, and intermediate or final representations; a tensor structuring module configured to receive the structured representations of the model object from the memory module and process the structured representations into structured digital formats suitable for computer processing, wherein the structured digital formats are stored in the memory module; a model initialization module configured to retrieve the structured digital formats from the memory module and modify internal normalization components of the model object to maintain computational invariance during rotation; a rotation configuration module configured to retrieve the modified model object from the memory module, apply orthogonal rotation transformations to weight matrices and activation signals of the model object, and store the rotated model object in the memory module; a rotation-aware fine-tuning module configured to retrieve the rotated model object and apply low-rank adaptation according to a selected fine-tuning strategy to the model object, wherein the low-rank adaptation comprises inserting and training low-rank matrices while keeping base weights frozen; a quantization module configured to retrieve the fine-tuned model object and perform quantization of both weights and activations using at least one quantization strategy to generate a low-bit model representation stored in the memory module; and a hardware interface module configured to export the low-bit model representation stored in the memory module to an inference system for deployment.
2 . The system of claim 1 , wherein the orthogonal rotation transformations comprise at least one of a Hadamard matrix or a block-diagonal matrix applied to attention and feed-forward weight matrices.
3 . The system of claim 1 , wherein the rotation configuration module is further configured to apply between-block rotation to projection weight matrices and in-block rotation to activation tensors after normalization layers and before nonlinear functions.
4 . The system of claim 1 , wherein the rotation-aware fine-tuning module is further configured to operate in a LoRA After Rotation (LAR) mode by inserting low-rank matrices after rotation and training the low-rank matrices in the rotated weight space.
5 . The system of claim 1 , wherein the rotation-aware fine-tuning module is further configured to operate in a LoRA Before Rotation (LBR) mode by applying low-rank adaptation before rotation and subsequently transforming the adapted weights via rotation.
6 . The system of claim 1 , wherein the quantization module is further configured to perform per-channel symmetric quantization on weights and per-tensor quantization on activations.
7 . The system of claim 1 , further comprising: an evaluation and analysis module configured to retrieve the low-bit model representation and compute evaluation metrics with activation kurtosis and quantization error.
8 . The system of claim 1 , further comprising: a digital transformation interface module configured to convert floating-point tensors, rotation matrices, and low-rank vectors into memory-aligned digital formats compatible with hardware execution.
9 . The system of claim 1 , further comprising: a compatibility interface module configured to adapt the system for LoRA variants by re-integrating decomposed directional and magnitude weight components into the fine-tuning process.
10 . The system of claim 1 , wherein the memory module comprises one or more non-transitory computer-readable storage media selected from the group consisting of dynamic random-access memory (DRAM), static RAM (SRAM), flash memory, and solid-state drives (SSD).
11 . A computing device, comprising: at least one processor; one or more non-transitory computer-readable storage media; the system according to claim 1 for fine-tuning a large language model under low-bit weight-activation quantization; and a graphics processing unit (GPU), a neural processing unit (NPU), or a tensor processing unit (TPU); wherein the hardware interface module of the system is configured to load a low-bit model representation from the memory module and transmit the model representation to the GPU, NPU, or TPU for inference execution.
12 . A method for fine-tuning a large language model under low-bit weight-activation quantization, comprising: receiving, by a tensor structuring module, structured representations associated with a model object comprising weight matrices, activation signals, and rotation parameters; processing the structured representations into structured digital formats suitable for computer processing, by the tensor structuring module, and storing the structured digital formats in a memory module; modifying, by a model initialization module, internal normalization components of the model object to maintain computational invariance during rotation; applying, by a rotation configuration module, orthogonal rotation transformations to weight matrices and activation signals of the model object to produce a rotated model object; applying, by a rotation-aware fine-tuning module, low-rank adaptation to the rotated model object according to a selected fine-tuning strategy, wherein the low-rank adaptation comprises inserting and training low-rank matrices while keeping base weights frozen; performing, by a quantization module, quantization of both weights and activations using at least one quantization strategy to generate a low-bit model representation; and exporting, by a hardware interface module, the low-bit model representation to an inference system for deployment.
13 . The method of claim 12 , wherein the applying orthogonal rotation transformations comprises applying at least one of a Hadamard matrix or a block-diagonal matrix to attention and feed-forward weight matrices.
14 . The method of claim 12 , wherein the applying orthogonal rotation transformations comprises: applying between-block rotation to projection weight matrices; and applying in-block rotation to activation tensors after normalization layers and before nonlinear functions.
15 . The method of claim 12 , wherein the applying low-rank adaptation comprises operating in a LoRA After Rotation (LAR) mode by inserting low-rank matrices after rotation and training the low-rank matrices in the rotated weight space.
16 . The method of claim 12 , wherein the applying low-rank adaptation comprises operating in a LoRA Before Rotation (LBR) mode by applying low-rank adaptation before rotation and subsequently transforming the adapted weights via rotation.
17 . The method of claim 12 , wherein the performing quantization comprises: performing per-channel symmetric quantization on weights; and performing per-tensor quantization on activations.
18 . The method of claim 12 , further comprising: evaluating the low-bit model representation using benchmark tasks and computing evaluation metrics comprising activation kurtosis and quantization error.
19 . The method of claim 12 , further comprising: converting floating-point tensors, rotation matrices, and low-rank vectors into memory-aligned digital formats compatible with hardware execution.
20 . The method of claim 12 , further comprising: adapting the model object for a LoRA variant by re-integrating decomposed directional and magnitude weight components into the fine-tuning process.

Description

CROSS-REFERENCE TO RELEVANT APPLICATIONS The present application claims priority from a U.S. provisional patent application Ser. No. 63/717,284 filed Nov. 7, 2024, and the disclosure of which are incorporated by reference in their entirety. TECHNICAL FIELD The present invention relates to techniques for model compression and quantization of large-scale neural networks; in particular, to a system and method for fine-tuning rotated outlier-free large language models (LLMs) to facilitate effective weight-activation quantization with improved accuracy and reduced quantization error in low-bit settings. BACKGROUND Large language models (LLMs), such as GPT-4 and LLaMA, have achieved notable success across various tasks. However, the increasing model size and training cost have motivated the development of model compression and parameter-efficient fine-tuning (PEFT) methods. Low-rank adaptation (LoRA) has become a widely adopted PEFT technique for improving fine-tuning efficiency by updating a limited set of parameters. Recently, quantization techniques, which convert high-precision parameters into lower-bit formats such as INT4, have been integrated with LoRA methods. Existing quantization-LoRA schemes can save memory costs during fine-tuning, and some schemes can also reduce inference costs by producing quantized LLMs directly. However, these methods only perform weight-only quantization, while LoRA weight-activation quantization is under-explored. Quantizing both weights and activations in low-bit further saves run-time GPU memory and accelerates compute-intensive matrix-multiplication operations. It is observed that 4-bit or 6-bit weight-activation quantization with LoRA finetuning still incurs a high accuracy degradation in LLMs, attributing to the outliers in weight and activation distribution, which stretch the quantization range and increase the quantization error. Existing methods in the post-training quantization research community have endeavored to tackle the outlier challenge by mixed-precision subgrouping or shifting outliers from activation to weight. More recently, applying rotation to the weight matrices of LLMs has demonstrated effectiveness in eliminating activation outliers and keeping computational invariance. However, all these methods solve the problems from a post-training perspective, ignoring that outliers will emerge and change distribution during pre-training and fine-tuning. Accordingly, there is a need for a system and method that enable effective low-bit weight-activation quantization during fine-tuning of large language models by eliminating outliers in a manner that remains robust throughout the fine-tuning process. SUMMARY OF INVENTION It is an objective of the present invention to provide a system and a method to address the aforementioned shortcomings and unmet needs in the state of the art. In the present invention, a approach called Rotated outlier-free Low-Rank Adaptation (RoLoRA) is presented, which serves as a LoRA-based scheme for effective weight-activation quantization. RoLoRA utilizes rotation for outlier elimination and proposes rotation-aware fine-tuning to preserve the outlier-free characteristics in rotated LLMs. Experimental results show RoLoRA consistently improves low-bit LoRA convergence and post-training quantization robustness in weightactivation settings. RoLoRA has been evaluated on LLaMA2-7B/13B and LLaMA3-8B models, achieving up to a 29.5% absolute accuracy improvement for 4-bit weight-activation quantized LLaMA2-13B on commonsense reasoning tasks, as compared to the LoRA baseline. The effectiveness of RoLoRA has also been demonstrated on large multimodal models, including LLaVA-1.5-7B. In accordance with a first aspect of the present invention, a system for fine-tuning a large language model under low-bit weight-activation quantization is provided. The system includes a memory module, a tensor structuring module, a model initialization module, a rotation configuration module, a rotation-aware fine-tuning module, a quantization module, and a hardware interface module. The memory module is configured to store structured representations associated with a model object comprising weight matrices, activation signals, rotation parameters, and intermediate or final representations. The tensor structuring module is configured to receive the structured representations of the model object from the memory module and process the structured representations into structured digital formats suitable for computer processing, in which the structured digital formats are stored in the memory module. The model initialization module is configured to retrieve the structured digital formats from the memory module and modify internal normalization components of the model object to maintain computational invariance during rotation. The rotation configuration module is configured to retrieve the modified model object from the memory module, apply orthogonal rotation transformations