CN-121981200-A - Federal fine tuning and acceleration method for visual language basic model

CN121981200ACN 121981200 ACN121981200 ACN 121981200ACN-121981200-A

Abstract

The invention discloses a federal fine tuning and accelerating method of a visual language basic model, which belongs to the field of computer vision and comprises the following specific steps that I, a server distills mixed characteristic knowledge of a CLIP model and a lightweight model by utilizing a public data set to generate a lightweight decoupling model; the server deploys the lightweight decoupling model to each client, initializes and trains residual blocks at each client respectively, and executes local training of the client; the invention greatly reduces the calculation and storage cost of the terminal, realizes individuation without damaging global priori, meets the privacy requirement, reduces the uplink data quantity, can identify bottleneck equipment and adjust training arrangement according to the requirement, reduces the performance loss caused by the fall-back and time delay jitter, and takes the global generalization capability and the local individuation capability into account.

Inventors

GUO SONGTAO
Tan Menzhuo
ZHOU PENGZHAN
Jiao Xianlong
LIU GUIYAN
LI MINGYAN
GU FUQIANG

Assignees

重庆大学

Dates

Publication Date: 20260505
Application Date: 20260127

Claims (8)

1. A federal fine tuning and acceleration method of a visual language base model is characterized by comprising the following specific steps: I. the server utilizes the public data set to carry out mixed characteristic knowledge distillation on the CLIP model and the lightweight model to generate a lightweight decoupling model; II. The server deploys the lightweight decoupling model to each client, initializes and trains residual blocks at each client respectively, and executes local training of the client; III, uploading training results of all clients to a server, and collecting client and model states by the server to perform federal aggregation; And IV, solving the optimal local iteration times of each round through PSO, and controlling the next round of training until the training is completed.
2. The federal fine tuning and acceleration method for visual language-based models according to claim 1, wherein the step I the server performs the mixed feature knowledge distillation of the CLIP model and the lightweight model using the common dataset comprises the steps of: s1.1 the server selects the image-text pairs from the plurality of groups The combined graphic data set is used as a common data set for distillation Calculating each image in the common dataset by means of an image encoder IE in the CLIP model Is also referred to as the image feature vector of (2) Calculating each text in a common dataset by means of a text encoder TE in the CLIP model Is also referred to as text feature vector of (2) Computing images in a common dataset based on a lightweight model DM of a current parameter Is also referred to as the image feature vector of (2) , wherein, ; S1.2, carrying out vector normalization processing on the obtained images and text feature vectors to obtain corresponding unit vectors, and based on the images and the text unit vectors, carrying out the first step The image is calculated by taking the image as an anchor point And text Matching probability of (2), then calculate And (3) with The mean square error MSE between the two, the matching probability and the mean square error are linearly combined according to weight, and a MFKD total loss function is generated; s1.3 from the common dataset in a predetermined batch size Sampling, and sequentially calculating the image feature vectors of IE for each batch Text feature vector for TE Image feature vector of DM Then, carrying out normalization processing on each feature vector, calculating MSE and matching probability on the batch, and obtaining MFKD total loss values of the batch; And S1.4, based on MFKD total loss values, using an Adam optimizer to carry out back propagation update, using RKD to control total training rounds, simultaneously after a preset training period, using verification graphics context pairs to evaluate the feature approximation degree and semantic retention capacity of the lightweight model DM, stopping training if the feature approximation degree and the semantic retention capacity start to decline after multiple iterations, and storing the lightweight model DM parameters with the highest feature approximation degree and semantic retention capacity for being sent to each client after the training is finished.
3. The federal fine tuning and acceleration method of a visual language base model according to claim 2, wherein the specific calculation formula of the matching probability of S1.2 is as follows: In the formula, Representative image And text Matching probabilities of (a); Representing a temperature coefficient used for the pretraining of the CLIP model; Representing the light weight model DM output A unit vector of the group image feature vector normalized by Euclidean norm; Represents the TE output of the text encoder A unit vector of the group character feature vector normalized by Euclidean norm; the specific calculation formula of the MFKD total loss function described in S1.2 is as follows: In the formula, ; Representing a common dataset Is the number of samples of (a); Represents a super parameter, and ; Representing the euclidean distance.
4. The federal fine tuning and acceleration method of a visual language base model according to claim 2, wherein the specific steps of initializing and training residual blocks at each client and performing client local training in step II are as follows: S2.1, a server transmits a lightweight model DM parameter file trained and verified on a public data set to each client side which participates in training at present, after each client side receives the data, the parameter file is loaded into a local model, all parameters of the lightweight model DM are set to be in a frozen state, namely the training is impossible, and meanwhile, a residual block RB consistent with the output dimension of the current local model is set at the client side; S2.2, each client terminal carries out forward reasoning on a local input image through a local model based on frozen DM parameters to obtain a corresponding image feature vector, then inputs the output image feature vector into a residual error block RB, calculates an output residual error vector of the RB, and then fuses the image feature vector output by the local model with the output residual error vector of the RB by adopting a weighted residual error to generate an image feature for a downstream task; s2.3, inputting the image features for the downstream tasks into a classification head to obtain corresponding logits, calculating the cross entropy loss between logits and the local labels as the loss of local training of the corresponding clients, setting the local iteration times of the round, locally cycling the clients to the set iteration times according to batches, and executing forward propagation and loss calculation on each small batch; s2.4, after each batch completes forward propagation, calculating gradients corresponding to cross entropy loss, updating residual block RB parameters based on MGD rules, simultaneously calculating cosine similarity between image feature vectors output by a local model and output residual vectors of the RB during each iteration, and mapping the cosine similarity obtained by the round into weight parameters of residual flows through linear transformation; And S2.5, after each client finishes the preset iteration times, the RB parameters and the classification head parameters of the residual block obtained by the final training are stored locally, and the final local loss and the local gradient of the round of training are calculated and recorded.
5. The federal fine tuning and acceleration method for visual language base models according to claim 4, wherein the server collecting client and model state for federal aggregation in step III comprises the following specific steps: S3.1, when each round of federal communication is finished, each client gathers the RB parameters, the local loss value, the local gradient and the gradient change rate of the residual block obtained by the round of training, packages the publicly sharable system and model running state information into personal parameters, and then uploads each item of information of the client to a server; S3.2, the server calculates the aggregation weight of the clients according to the local data quantity of the clients, then performs weighted average on the received RB parameters according to the aggregation weight of the clients to obtain new global RB parameters, and calculates the corresponding wireless transmission rate and the transmission and processing time cost of the round according to the personal parameters of the clients; S3.3, the server calculates the global gradient through weighted summation based on the received local gradient of each client, calculates the gradient deviation of the local gradient of each client to the global gradient, sorts each client according to the gradient deviation from high to low, and records the corresponding ranking of each client; And S3.4, organizing each acquired group of data into a group of dictionary sets according to the corresponding clients, sending new global RB parameters to each client through the server, simultaneously sending the wireless transmission rate of each client calculated by the server, the time cost of transmission and processing of the server and the gradient deviation ranking as control information to the corresponding clients, sending feedback of successful or failed broadcasting record by the server after completion, and logging the broadcasting time for time budget accounting.
6. The federal fine tuning and acceleration method of a visual language base model according to claim 5, wherein the specific calculation formula of the global RB parameter of S3.2 is as follows: In the formula, Represents the first Weights of the individual clients; representing the updated local model parameters; Representing the total number of local iterations; The specific calculation formula of the wireless transmission rate in S3.3 is as follows: In the formula, Represents the first Channel bandwidths of the individual clients; Represents the first The transmitting power of each client; Represents the first Channel gains for individual clients; represents the power spectral density; the specific calculation formula of the transmission and processing time cost is as follows: In the formula, Representing the size of the transmission data; Represents the first CPU frequency of each client; representing the number of CPU cycles required to process a sample of data; Representing a transmission time; Representing the processing time.
7. The federal fine tuning and acceleration method for visual language base models of claim 5, wherein the specific step of solving the optimal local iteration number per round by PSO in step IV is as follows: S4.1, after a server issues global RB parameters, solving the global RB parameters as targets when each round of global loss functions are minimized based on the wireless transmission rate, the time cost of transmission and processing of the round and gradient deviation obtained through calculation, establishing a corresponding objective function, and simultaneously setting time constraint of the objective function; S4.2, setting the value of the local iteration times as a plurality of groups of positive integers, taking each candidate value as a group of particle positions in PSO, randomly generating a plurality of groups of particles in a candidate integer interval, and storing the current position, the history optimal and the population optimal of each particle; S4.3, calculating an objective function value of each particle based on the current wireless transmission rate, the time cost of transmission and processing of the round and gradient deviation, taking the objective function value as the corresponding fitness of each particle, updating the position and the speed of each particle according to a discretization speed updating rule of PSO, and stopping iteration when the fitness change value of each particle is converged within a preset range after multiple rounds of iteration; and S4.4, after stopping iteration, ordering each particle according to the fitness from high to low, selecting the particle with the first current ranking as a global optimal particle, outputting the local iteration times by taking an integer corresponding to the particle as a current round, and issuing the local iteration times to each client as the iteration times of the next round of local training.
8. The federal fine tuning and acceleration method for visual language base models according to claim 7, wherein the specific calculation formula of the objective function of S4.1 is as follows: In the formula, Represents the first The number of local iterations selected by the round server.

Description

Federal fine tuning and acceleration method for visual language basic model Technical Field The invention relates to the field of computer vision, in particular to a federal fine tuning and acceleration method of a visual language basic model. Background Federal Fine Tuning (FFT) has emerged as a fine tuning paradigm that protects privacy. The FFT ensures the security of the private data by Federal Learning (FL) aggregating the trainable parameters. The FFT relies on the diversified sample data for domain adaptation, thereby facilitating migration of the fine-tuning process to mobile devices that provide highly heterogeneous data. However, the application of FFT in a Mobile Edge Computing (MEC) environment faces two key challenges, limited computing resources and system heterogeneity. Existing work has focused on combining Parameter Efficient Fine Tuning (PEFT) with FFT to reduce the number of trainable parameters. CLIP-Adapter deploys frozen CLIP on mobile devices with only a small fraction of the fine-tuning model parameters. However, the mobile device inevitably needs to perform the full forward propagation of CLIP, which still incurs significant computational overhead. Currently, the model size on mobile devices is typically limited to below 100 MB. Most existing CLIP variants exceed this limit. Thus, existing work still requires the deployment of a complete base model on devices with limited computing resources. On the other hand, heterogeneous computing power of mobile devices and fluctuations in wireless network status lead to system heterogeneity in MEC environments. This heterogeneity introduces uncertainty in the FL synchronization time, greatly increasing the overall completion time of the FFT. AAFL consider changing the local iteration number to account for the dequeue effect. However, the existing solution is designed based on static resource assumption, and the convergence of FL in wireless network can not be ensured, so it is important to invent a federal fine tuning and acceleration method of visual language basic model. Disclosure of Invention The invention aims to solve the defects in the prior art and provides a federal fine tuning and accelerating method of a visual language basic model. In order to achieve the above purpose, the present invention adopts the following technical scheme: A federal fine tuning and acceleration method of a visual language base model comprises the following specific steps: I. the server utilizes the public data set to carry out mixed characteristic knowledge distillation on the CLIP model and the lightweight model to generate a lightweight decoupling model; II. The server deploys the lightweight decoupling model to each client, initializes and trains residual blocks at each client respectively, and executes local training of the client; III, uploading training results of all clients to a server, and collecting client and model states by the server to perform federal aggregation; And IV, solving the optimal local iteration times of each round through PSO, and controlling the next round of training until the training is completed. As a further scheme of the invention, the specific steps of the server in the step I for carrying out mixed characteristic knowledge distillation on the CLIP model and the lightweight model by utilizing a public data set are as follows: s1.1 the server selects the image-text pairs from the plurality of groups The combined graphic data set is used as a common data set for distillationCalculating each image in the common dataset by means of an image encoder IE in the CLIP modelIs also referred to as the image feature vector of (2)Calculating each text in a common dataset by means of a text encoder TE in the CLIP modelIs also referred to as text feature vector of (2)Computing images in a common dataset based on a lightweight model DM of a current parameterIs also referred to as the image feature vector of (2), wherein,; S1.2, carrying out vector normalization processing on the obtained images and text feature vectors to obtain corresponding unit vectors, and based on the images and the text unit vectors, carrying out the first stepThe image is calculated by taking the image as an anchor pointAnd textMatching probability of (2), then calculateAnd (3) withThe mean square error MSE between the two, the matching probability and the mean square error are linearly combined according to weight, and a MFKD total loss function is generated; s1.3 from the common dataset in a predetermined batch size Sampling, and sequentially calculating the image feature vectors of IE for each batchText feature vector for TEImage feature vector of DMThen, carrying out normalization processing on each feature vector, calculating MSE and matching probability on the batch, and obtaining MFKD total loss values of the batch; And S1.4, based on MFKD total loss values, using an Adam optimizer to carry out back propagation update, using RKD to control total training roun