CN-121981199-A - Extensible expert network fine tuning method based on metacore architecture and coefficient sharing
Abstract
The invention discloses an extensible expert network fine tuning method based on metacore architecture and coefficient sharing, which comprises the steps of S1, decomposing an original convolution kernel into a metacore and a metacore coefficient, S2, learning the metacore coefficient from a pre-training weight and fixing the metacore coefficient, S3, adding the pre-training weight into the decomposed convolution kernel as a bypass, S4, constructing multiple expert branches of a shared coefficient, S5, designing and integrating a dynamic gating network, S6, setting a training strategy, S7, forward computing and loss obtaining, S8, gradient computing and updating parameters, and S9, repeatedly executing the steps S7 to S8 until a training total loss value converges. The invention combines the low parameter advantage of convolution kernel decomposition with the dynamic adaptability of the MoE architecture, and combines the parameter efficiency, the model expression capability and the generalization performance by the expert sharing coefficient and the dynamic routing mechanism on the premise of updating only a small amount of metakernel, the gating network and the classifying layer parameters, thereby having strong deployment friendliness and being suitable for downstream task adaptation in a resource-limited environment.
Inventors
- YU HEWEI
- YU JINWEI
Assignees
- 华南理工大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260123
Claims (10)
- 1. The extensible expert network fine tuning method based on metacore architecture and coefficient sharing is characterized by comprising the following steps of: s1, decomposing an original convolution kernel of a given depth neural network into a kernel and a kernel coefficient; s2, learning element core coefficients from the pre-training weights and fixing the element core coefficients; s3, adding the pre-training weight as a bypass into the decomposed convolution kernel; s4, constructing an expert part of the mixed expert framework, and creating a plurality of expert branches sharing coefficients; S5, designing and integrating a dynamic gating network; s6, setting a training strategy, including an optimizer, a learning rate, a batch size, the number of metakernels and the number of experts; S7, forward calculation and loss acquisition, wherein data are transmitted forward through a model, a gate control network fusion expert outputs to obtain a prediction result, and a loss value is calculated to provide a basis for parameter updating; S8, calculating gradient and updating parameters, solving gradient through back propagation, and updating network parameters by using an optimizer; And S9, repeatedly executing the steps S7 to S8 until the training total loss value converges.
- 2. The scalable expert network fine tuning method based on metacore architecture and coefficient sharing according to claim 1, wherein step S1 specifically comprises: will be given the depth of the neural network The original convolution kernel of a layer is represented as Wherein Representing the input channel(s), Representing the output channel(s), Representing the convolution kernel size; Decomposing the original convolution kernel into metakernels And kernel coefficients , wherein, , , The number of the metanucleus is represented, Representing convolution kernel size in deep neural network The decomposition of the convolution kernel of the layer is calculated as: Wherein, the The tensor multiplication is represented by a set of tensors, Represent the first The meta-kernel coefficients of the layers, Represent the first A metacore of the layer; at the time of convolution kernel decomposition, at In layer computation, input And meta-core Sum kernel coefficient The calculation of (a) is divided into two steps, namely a space convolution process and a channel fusion process, and input Wherein And Representing the height and width of the input, respectively; Input device And meta-core Performing only spatial convolution to obtain intermediate features Intermediate features And kernel coefficients Channel fusion is carried out to obtain final output ; Input device And meta-core The process of performing only spatial convolution is specifically: In the first place In the layer, input And meta-core In the process of performing only spatial convolution, the input Each channel and each metacore of (a) Convolving to obtain the final product Intermediate features of individual channels ; Intermediate features And kernel coefficients The channel fusion process is specifically as follows: In the first place In layers, intermediate features And kernel coefficients Channel weight distribution, meta-kernel coefficients From the following components Individual components The composition of the composite material comprises the components, Wherein , Expressed as intermediate features A kind of electronic device Weights assigned by the channels through Personal (S) To obtain the final output Wherein And Representing the height and width of the output, respectively.
- 3. The scalable expert network fine tuning method based on metacore architecture and coefficient sharing according to claim 2, wherein step S2 specifically comprises: The ideas of dictionary learning and sparse coding are utilized, so that the element core coefficient learned from the pre-training weight can approach to the optimal channel combination; Dictionary learning and sparse coding are aimed at giving original weights In the case of (a) a sparse coding matrix is used And a dictionary Performing matrix multiplication to fit the original weights as much as possible And requires sparse coding matrices As sparse as possible, the optimization objective is expressed as: Wherein, the Representing Lagrangian multipliers, the first term representing maximizing the fit original weights, the second term representing a penalty on sparsity; Representing the pre-training weights of the training set, The number of input channels is indicated and, The number of output channels is indicated and, Is the convolution kernel size; Representing the number of metakernels, Representing the metakernel coefficients; Fixing element core coefficient in the first stage of learning process, training element core, fixing element core in the second stage, training element core coefficient, and obtaining element core coefficient by multiple alternative optimization of these two stages And (3) obtaining an optimal metakernel coefficient after convergence, keeping freezing in the subsequent fine tuning process, and only keeping the trainability of metakernels.
- 4. The scalable expert network fine tuning method based on metacore architecture and coefficient sharing according to claim 3, wherein step S3 specifically comprises: Will be the first Decomposed convolution kernel weights and the first in the layer The pre-training weights of the layers are added as the final first The convolution kernel of the layer is specifically calculated as: Wherein, the Represent the first The final fine-tuning of the weights of the layers, Represent the first The pre-training weights of the layer are, Represent the first Weights in the layer using convolution kernel decomposition techniques.
- 5. The scalable expert network fine tuning method based on metacore architecture and coefficient sharing according to claim 4, wherein step S4 specifically comprises: constructing an expert part of a hybrid expert framework based on the convolution kernel structure built in steps S1 to S3 ; Creating a plurality of expert branches with identical structures; Each expert Trainable metacore sets each comprising a separate set of cores But all of The experts share the same set of fixed element core coefficients obtained in step S2 。
- 6. The scalable expert network fine tuning method based on metacore architecture and coefficient sharing according to claim 1, wherein step S5 specifically comprises: designing gating network structure, at given input After that, the processing unit is configured to, Access to the gate control network is to The weight assigned by each expert is calculated as follows: At the same time Respectively enter into The expert gets this Features extracted by individual experts The calculation process is as follows: Wherein, the Representing a neural network, when input And (3) with Layer fine tuning weights After convolution processing, an output result is obtained ; Weights generated by gating network Features extracted by each expert Weighted summation is carried out to obtain the fused characteristic representation The calculation process is as follows: Wherein, the Representing scalar multiplication; the features are then fused Input to the full connection layer Performing linear transformation and classification to obtain final prediction output of the network, wherein the calculation process comprises the following steps: during training, all the metanuclear coefficients are fixed Only fine tuning of metakernels Gating network And a linear layer 。
- 7. The scalable expert network fine-tuning method based on metacore architecture and coefficient sharing according to claim 6, wherein step S6 specifically comprises: Setting the batch size to 256, adopting Adam as an optimizer, setting the initial learning rate to 0.001 and setting the weight attenuation to 0.0001; Configuring the number m of metakernels; configuring the number N of experts; And initializing a convolution kernel of the gating network by adopting a He Kaiming initialization mode taking the connection number of the output channels as a reference, and adding batch normalization layers to the output result of each convolution layer for processing.
- 8. The scalable expert network fine-tuning method based on metacore architecture and coefficient sharing according to claim 7, wherein step S7 specifically comprises: the input sample is sent to the gating network designed in step S5 for processing, and the sample is obtained Weight assigned by individual expert ; The input samples are sent to all of the described above via step S4 Parallel processing by each expert branch Using its own metacore And shared metakernel coefficients Performing a calculation, each expert Outputting the processed characteristics ; Using the weights obtained For corresponding features Weighted summation is carried out, and the weighted summation of each expert feature is obtained by output ; Weighted features Entering a full connection layer in the deep neural network until final prediction is obtained; calculating a loss value according to the prediction result and the real label 。
- 9. The scalable expert network fine-tuning method based on metacore architecture and coefficient sharing according to claim 8, wherein step S8 specifically comprises: calculation of loss values by back propagation Gradients of the trainable parameters and updating the trainable parameters using an optimization algorithm based on the adaptive moment estimation; in each training batch, a loss value is calculated according to a chain rule For current trainable network parameters Gradient of (2) Thereafter, adam optimizer updates parameters as follows: the first and second moment estimates are calculated as: Wherein, the For the number of steps of the current iteration, And The exponential decay rate super-parameters for moment estimation are set to 0.9 and 0.999 respectively; And Respectively estimating a first moment and a second moment of the gradient; Due to And Initializing to 0 at the initial training step will result in the estimate biasing towards zero, thus performing bias correction to obtain an unbiased estimate: Wherein, the Is the gradient first moment estimation after deviation correction, The gradient second moment estimation after deviation correction; Calculating a parameter update amount by using the corrected moment estimate, and updating the network parameters: Wherein, the For the global learning rate of the device, Is a constant for maintaining numerical stability.
- 10. The scalable expert network fine-tuning method based on metacore architecture and coefficient sharing according to claim 9, wherein step S9 specifically comprises: Repeating steps S7 to S8 until the total loss value is trained Converging; And defining a fine-tuned mixed expert network DCK_MoE based on convolution kernel decomposition by using final converged model parameters, and applying the network to test set sample data to obtain a final task identification result.
Description
Extensible expert network fine tuning method based on metacore architecture and coefficient sharing Technical Field The invention belongs to the technical field of deep learning, and particularly relates to an extensible expert network fine tuning method based on metacore architecture and coefficient sharing. Background In recent years, large pre-training models represented by deep convolutional neural networks and transformers have been developed in the fields of computer vision, natural language processing, and the like. However, when these generic pre-training models are applied to specific downstream tasks (e.g. image classification, object detection), a full fine-tuning, i.e. updating all parameters of the model, is typically required. This process requires significant computing resources, memory space and memory overhead. Parameter-efficient fine Tuning (PEFT) solves this problem by updating only a small portion of the parameters (typically on the order of millions), while maintaining the generalization performance of the pre-trained model, while accommodating downstream applications. In the existing efficient fine tuning method of parameters, low-Rank Adaptation (LoRA) and its variants are a representative technology widely used. LoRA freezing pre-training weights by training an extra low rank matrix can achieve storage and computational efficiency, but LoRA is prone to overfitting when faced with limited data. In addition, the number of parameters LoRA increases linearly with increasing rank, further increasing the number of training parameters, which is still not friendly for memory-limited trim environments. LoRA although the number of updated parameters can be reduced to no more than 1% of the full parameter fine tuning, in deeper and larger networks LoRA still needs to increase the rank within a certain range to guarantee fine tuning performance, which inevitably increases the amount of trainable parameters. Another class of tuning methods based on convolution filter decomposition (i.e., tuning methods based on filter subspaces) is based on the idea of LoRA building an intrinsic low-rank dimension, which proposes to decompose the convolution kernel into atoms and atomic coefficients, and to fine tune only the atoms. Wherein atoms have only a relatively small amount of parameters, atoms can be considered basic. By linear combination of atomic coefficients to such basis, the convolution filter decomposition can reconstruct the parameter space in a lower dimension. This fine-tuned paradigm enables updates to high dimensional weights with a smaller amount of parameters, but may be difficult to adequately capture features over complex tasks. The fine tuning method based on the filter subspace further decomposes atoms into sub-atoms and sub-atom coefficients to increase the trainable parameter number, while the performance improvement is obvious, the parameter number is also increased rapidly. Disclosure of Invention The invention mainly aims to overcome the defects and shortcomings of the prior art and provides an extensible expert network fine tuning method based on metacore architecture and coefficient sharing. In order to achieve the above purpose, the present invention adopts the following technical scheme: An extensible expert network fine tuning method based on metacore architecture and coefficient sharing comprises the following steps: s1, decomposing an original convolution kernel of a given depth neural network into a kernel and a kernel coefficient; s2, learning element core coefficients from the pre-training weights and fixing the element core coefficients; s3, adding the pre-training weight as a bypass into the decomposed convolution kernel; s4, constructing an expert part of the mixed expert framework, and creating a plurality of expert branches sharing coefficients; S5, designing and integrating a dynamic gating network; s6, setting a training strategy, including an optimizer, a learning rate, a batch size, the number of metakernels and the number of experts; S7, forward calculation and loss acquisition, wherein data are transmitted forward through a model, a gate control network fusion expert outputs to obtain a prediction result, and a loss value is calculated to provide a basis for parameter updating; S8, calculating gradient and updating parameters, solving gradient through back propagation, and updating network parameters by using an optimizer; And S9, repeatedly executing the steps S7 to S8 until the training total loss value converges. Compared with the prior art, the invention has the following advantages and beneficial effects: 1. the invention realizes the double breakthrough of parameter efficiency and expression capability, namely, a mixed expert (MoE) basic framework based on dynamic routing is constructed, and a fine tuning method based on convolution kernel decomposition is introduced into the expert network design. The low parameter advantage of convolution kernel decomp