CN-121998028-A - Gradient sensitivity-based large language model fine tuning method, device and medium

CN121998028ACN 121998028 ACN121998028 ACN 121998028ACN-121998028-A

Abstract

The invention discloses a large language model fine tuning method and device based on gradient sensitivity and a storage medium, and belongs to the technical field of artificial intelligence and machine learning. The method is characterized by comprising the steps of firstly extracting a small amount of samples from the whole amount of fine tuning data to serve as a probe data set to carry out preliminary training on a model, then calculating gradients of parameters of the model and evaluating gradient sensitivity of the model according to the gradients, setting gradient thresholds according to preset parameters to be trained, freezing parameters with sensitivity lower than the thresholds, and finally carrying out high-efficiency fine tuning on the whole amount of data by using only a small amount of high-sensitivity parameters. The invention has the advantages of maintaining the structural integrity of the original model, avoiding introducing extra reasoning delay, accurately screening key parameters for training, obviously reducing the consumption of calculation resources and training time, effectively preventing over-fitting and under-fitting, and improving the performance and training efficiency of the model on specific downstream tasks.

Inventors

XIA YUNLONG
Request for anonymity

Assignees

夏云龙
夏晓华

Dates

Publication Date: 20260508
Application Date: 20250922

Claims (9)

1. A large language model fine tuning method, device and medium based on gradient sensitivity is characterized by comprising the following steps: 1) Extracting part of data from the fine tuning training data as a probe data set; 2) Performing a small amount of training on the pre-training large language model by using the probe data set to acquire gradient information of model parameters; 3) Calculating gradient sensitivity of each parameter according to the gradient information, and setting a gradient threshold; 4) Setting parameters with gradient sensitivity lower than the threshold value to be in a non-trainable state, and keeping other parameters in a trainable state; 5) Training the model by using all fine tuning training data until the model converges; 6) And saving the trained model.
2. The method of claim 1, wherein the probe dataset is no more than 10% of the total fine tuning training data in size.
3. The method of claim 1, wherein the gradient sensitivity is obtained by calculating a norm of a parameter gradient.
4. The method according to claim 1, wherein the gradient threshold is dynamically determined according to a preset proportion of parameters to be trained.
5. The method of claim 4, wherein the gradient threshold is a percentile of gradient norms corresponding to a proportion of the parameter to be trained.
6. The method of claim 1, wherein the untrainable state in step (d) is implemented by setting a required_grad attribute of a parameter to False.
7. A large language model efficient fine tuning system for implementing the method of any one of claims 1 to 6, comprising: 1) The data sampling module is used for sampling from the total training data to generate a probe data set; 2) The probe training module is used for performing preliminary training on the pre-training model by using the probe data set and collecting gradients; 3) The sensitivity analysis module is used for calculating gradient sensitivity of each parameter and determining parameters to be frozen according to the gradient sensitivity; 4) The parameter freezing module is used for setting the low-sensitivity parameter into a non-trainable state; 5) And the high-efficiency fine tuning module is used for carrying out fine tuning training on unfrozen parameters only by using the full data.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 6 when the program is executed by the processor.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 6.

Description

Gradient sensitivity-based large language model fine tuning method, device and medium Technical Field The invention relates to the technical field of artificial intelligence and machine learning, in particular to a model Fine Tuning method and device based on gradient sensitivity dynamic screening and freezing parameters and a computer readable storage medium, and particularly relates to a large language model (Large Language Model, LLM) migration learning and high-efficiency Fine Tuning (Fine-Tuning) technology. Background Large Language Models (LLMs) are widely used for various natural language processing tasks by virtue of their powerful general capabilities. However, when it is applied to a specific vertical field (e.g., medical, financial, electronic business), the effect of direct application is often poor due to lack of field-specific knowledge. Thus, fine tuning of pre-trained LLMs using domain data has become a standard practice. Since LLM parameters are extremely large (typically in the billions or even trillions), performing Full Fine-Tuning (Full Fine-Tuning) requires huge computational resources and time costs, which are often difficult to withstand in practice. For this reason, parameter-efficient Tuning (PEFT) techniques have been developed that aim to approximate the performance of full Parameter Tuning by optimizing a small number of parameters. The principle and the existing defects of the prior main PEFT technology are as follows: Adapter Tuning principle-inserting small neural network modules (adapters) between model layers, training only these modules. The method has the advantages of low memory occupation, high training speed, pre-training knowledge reservation and suitability for multi-task learning. The disadvantage is that the model depth is increased and the inference delay is increased. The Adapter structure design affects performance (e.g., interlayer location, dimensions). LoRA(Low-Rank Adaptation) The principle is that the update amount of full parameter fine tuning (Δw=a·b) is approximated by low rank matrix decomposition. The method has the advantages of no reasoning delay, extremely low video memory occupation, and combination (such as QLoRA) with other methods (such as quantification). A disadvantage is that low rank hypotheses may limit model expressive power. Rank (r) selects parameters to be adjusted, and influences the effect. Prefix-Tuning Principle is to add a learnable continuous vector (Prefix) as a task prompt before input. The advantages are that: only a small number of parameters are optimized, the method is suitable for generating tasks; no modification of the model structure is required. The defects are sensitive to the length of the Prefix, the efficiency is affected by the overlong length, and the training stability is poor (skills such as gradient cutting are needed). Prompt Tuning The principle is that an extended input Prompt (Prompt) is a trainable vector and guides model output. The method has the advantages of extremely small parameter quantity (only optimizing the Prompt related parameters), and suitability for a few-sample or zero-sample scene. The method has the defects of sensitivity to the initialization of the Prompt, template design and limited complex task effect. P-Tuning series (P-Tuning v 2) The principle is to optimize the continuous sympt vector and extend to all model layers. The method has the advantages of being more stable than the traditional promt Tuning and suitable for NLU tasks. The number of parameters is still small. The disadvantage is the long training time (layer-by-layer optimization). Freeze-Based method Principle-freezing most parameters, only fine tuning part of the layers (like top layer or specific modules). The method has the advantages of low memory occupation and high training speed. Is simple and easy to realize. The disadvantage is that the performance depends on the choice of the thawing layer and may be under fitted. BitFit(Bias-Term Fine-Tuning) Principle only Bias term (Bias term) in the fine tuning model. The method has the advantages of extremely few parameters (accounting for 0.1 percent of the total parameters of the model generally) and extremely low resource consumption. The method has the defect of poor flexibility, and is only suitable for tasks with significant bias term influence. Therefore, there is a strong need in the art for a method that can not only maintain the structural integrity of the original model, but also adaptively and dependently select key parameters for efficient fine tuning, so as to significantly reduce the computational overhead while ensuring performance. Disclosure of Invention Object of the invention The invention aims to overcome the defects of the conventional PEFT technology and provide a large language model fine tuning method based on gradient sensitivity. The method aims to dynamically and adaptively identify parameters which are critical to the current fine tuning task t