CN-121998013-A - Method and apparatus for achieving efficient fine tuning of unstructured sparse low-precision large pre-trained base models

CN121998013ACN 121998013 ACN121998013 ACN 121998013ACN-121998013-A

Abstract

An example apparatus includes interface circuitry, machine-readable instructions, and at least one processor circuit programmed by the machine-readable instructions to sparsify a base model of a base model to generate a sparse base model, apply a neural low-rank adapter search to the sparse base model, and output a trimmed base model based on the application of the neural low-rank adapter search to the sparse base model.

Inventors

J. P. Munoz Kia Brando
YUAN JINJIE
N. K. Jain

Assignees

英特尔公司

Dates

Publication Date: 20260508
Application Date: 20250930
Priority Date: 20241101

Claims (20)

1. An apparatus, comprising: An interface circuit; Machine-readable instructions, and At least one processor circuit programmed by the machine-readable instructions to: Sparsifying a base model of the base model to generate a sparse base model; applying a neural low-rank adapter search to the sparse base model, and An application of the sparse base model based on the neural low rank adapter search outputs a trimmed base model.
2. The apparatus of claim 1, wherein one or more of the at least one processor circuit is to sparsify the base model by identifying a sparsity pattern associated with a sparsity weight of the base model.
3. The apparatus of claim 2, wherein one or more of the at least one processor circuit is to identify the sparsification weights based on a scoring function applied to pre-trained weights of the base model.
4. The apparatus of claim 3, wherein when the fine-tuned base model is a thinned and quantized base model, one or more of the at least one processor circuit is to identify the thinned and quantized base model by quantizing the thinning weights to a lower precision.
5. The apparatus of any of claims 1, 2,3, or 4, wherein one or more of the at least one processor circuit is to generate a binary mask based on the sparse base model, the binary mask resulting from an initial sparsification of a weight matrix of the base model.
6. The apparatus of any of claims 1,2, 3,4, or 5, wherein one or more of the at least one processor circuit is to apply the neural low-rank adapter search to train an elastic adapter with a variable configuration to improve accuracy of the trimmed base model.
7. The apparatus of claim 6, wherein the variable configuration represents a variable rank rating value compared to a fixed rank rating value.
8. The apparatus of claim 7, wherein the neural low-level adapter search is to apply the variable rank rating to the flexible adapter to identify a single flexible adapter configuration from a space of flexible adapter configurations.
9. The apparatus of claim 6, wherein one or more of the at least one processor circuit is to merge model weights of the elastic adapter and the base model after fine-tuning while maintaining sparsity of the model weights.
10. At least one machine readable medium comprising machine readable instructions to cause at least one processor circuit to at least: Sparsifying a base model of the base model to generate a sparse base model; applying a neural low-rank adapter search to the sparse base model, and An application of the sparse base model based on the neural low rank adapter search outputs a trimmed base model.
11. The at least one machine readable medium of claim 10, wherein the machine readable instructions are to cause one or more of the at least one processor circuit to sparsify the base model by identifying a sparsity pattern associated with a sparsity weight of the base model.
12. The at least one machine readable medium of claim 11, wherein the machine readable instructions are to cause one or more of the at least one processor circuit to identify the sparsification weights based on a scoring function applied to a pre-trained weight of the base model.
13. The at least one machine readable medium of claim 12, wherein the fine-tuned base model is a sparse and quantized base model, and the machine readable instructions are to cause one or more of the at least one processor circuit to identify the sparse and quantized base model by quantizing the sparse weights to a lower accuracy.
14. The at least one machine readable medium of any one of claims 10, 11, 12, or 13, wherein the machine readable instructions are to cause one or more of the at least one processor circuit to generate a binary mask based on the sparse base model, the binary mask being derived from an initial sparsification of a weight matrix of the base model.
15. The at least one machine readable medium of any one of claims 10, 11, 12, 13, or 14, wherein the machine readable instructions are to cause one or more of the at least one processor circuit to apply the neural low rank adapter search to train an elastic adapter having a variable configuration to improve accuracy of the trimmed base model.
16. The at least one machine readable medium of claim 15, wherein the variable configuration represents a variable rank rating value compared to a fixed rank rating value.
17. The at least one machine readable medium of claim 16, wherein the neural low-level adapter search is to apply the variable rank rating to the flexible adapter to identify a single flexible adapter configuration from a space of the flexible adapter configuration.
18. The at least one machine readable medium of claim 16, wherein the machine readable instructions are to cause one or more of the at least one processor circuit to merge model weights of the elastic adapter and the base model after fine tuning while maintaining sparsity of the model weights.
19. An apparatus, comprising: Means for sparsifying a base model of the base model to generate a sparse base model; Means for applying a neural low-rank adapter search to the sparse base model, and Means for outputting a trimmed base model based on the neural low rank adapter search application to the sparse base model.
20. The apparatus of claim 19, wherein the means for sparsifying comprises identifying a sparsity pattern associated with a sparsity weight of the base model.

Description

Method and apparatus for achieving efficient fine tuning of unstructured sparse low-precision large pre-trained base models Background A base model, such as a pre-trained Large Language Model (LLM), is a neural network that performs related tasks based on Artificial Intelligence (AI). These models utilize millions or billions of parameters that may require fine tuning of new data sets or downstream tasks such as mathematical reasoning. LLM includes an encoder-only model for classification tasks, a decoder-only model for content generation tasks, and an encoder-decoder model for content evaluation tasks and content generation tasks such as translation and summarization. Drawings FIG. 1 illustrates existing limitations of known methods for fine tuning sparse quantization models and merging low rank (LoRA) adapters. Fig. 2 is a block diagram of an exemplary implementation of a model tuner circuit constructed in accordance with the teachings of the present disclosure to fine tune a pre-trained LLM on a downstream task. Fig. 3 is a flowchart representative of example machine readable instructions and/or example operations that may be executed, instantiated, and/or performed by the example programmable circuit to implement the example model tuner circuit shown in fig. 2. FIG. 4 is a flowchart representative of example machine readable instructions and/or example operations that may be executed, instantiated, and/or performed by the example programmable circuit to implement the example model tuner circuit shown in FIG. 2 to perform sparsification and quantization of a base model. Fig. 5 is a flowchart representative of example machine readable instructions and/or example operations that may be executed, instantiated, and/or performed by the example programmable circuitry to implement the example model tuner circuit shown in fig. 2 to recover model accuracy using neural low rank adapter searching (NLS). FIG. 6 illustrates an exemplary pipeline configuration that may be instantiated to efficiently fine tune a large model, including (1) a first pipeline for parameter efficient fine-tuning (PEFT) of a sparse quantization model using an elastic adapter to yield an uncombined model and adapter, (2) a second pipeline for parameter efficient fine-tuning of a sparse model using sparse perceptions (sparse PEFT) to allow subsequent merger of the model and adapter, and (3) a third pipeline for parameter efficient fine-tuning of a sparse quantization model using quantization and sparse perceptron adapter merger. Fig. 7 shows an example of a known low rank (LoRA) adapter in contrast to the elastic adapter associated with the neural low rank search (NLS) disclosed herein. Fig. 8 shows an exemplary overview of fine-tuning pre-training LLM based on model sparseness, recovering basic model accuracy using the NLS adapter shown in fig. 7, and identifying a thinned fine-tuning architecture based on sub-adapter search. Fig. 9 shows a sparse parameter efficient fine tuning (sparse PEFT) using a binary mask obtained from the sparsification weights. Fig. 10 illustrates an exemplary reduction in parameters required to fine tune LLM as compared to known sparse fine tuning methods while achieving higher accuracy using the methods disclosed herein. Fig. 11A illustrates results of a fine-tuning for evaluating an exemplary first model using a known fine-tuning method compared to the methods disclosed herein, e.g., fine-tuning of a sparse quantization model (SQFT), SQFT combined with a sparse PEFT, and SQFT combined with a sparse PEFT including quantization and sparse perceptual adapter binning (QA-sparse PEFT). Fig. 11B shows the results of an ablation experiment using a known low rank (LoRA) adapter in contrast to the elastic adapter associated with the neural low rank search (NLS) shown in fig. 7 when evaluating the fine tuning disclosed herein including SQFT with sparse PEFT and SQFT with quantized perceived sparse PEFT. FIG. 12 illustrates an exemplary cost analysis of different pipelines associated with model tuning, including assessment of model storage, tuning time, and accuracy. Fig. 13 is a block diagram of an exemplary processing platform including programmable circuitry configured to execute, instantiate, and/or complete the exemplary machine readable instructions shown in fig. 3-5 and/or perform the exemplary operations shown in fig. 3-5 to implement the model tuner circuit shown in fig. 2. Fig. 14 is a block diagram of an exemplary implementation of the processor circuit shown in fig. 13. Fig. 15 is a block diagram of another exemplary embodiment of the programmable circuit shown in fig. 13. Fig. 16 is a block diagram of an exemplary software/firmware/instruction distribution platform (e.g., one or more servers) for distributing software, instructions, and/or firmware (e.g., corresponding to the exemplary machine-readable instructions shown in fig. 3-5) to client devices associated with end users and/or consumers (e.g., for licensi