CN-122021792-A - Competitive knowledge distillation based on genetic constraint optimization
Abstract
The invention discloses a competitive knowledge distillation method based on genetic constraint optimization, and aims to solve the problems that the optimization method depending on gradients is prone to being in local optimum and poor in stability when the semantic difference between a teacher model and a student model is large in the existing knowledge distillation technology. The method is characterized in that a group comprising a plurality of candidate student models is constructed, and iterative evolution is carried out in a genetic constraint optimization mode. The method mainly comprises the steps of firstly, evaluating and sequencing candidate student groups based on distillation loss and task loss through a multi-student selection module (MSSM), reserving a high-quality student model, secondly, performing cross recombination between optimal students and suboptimal students through a Student Knowledge Sharing Module (SKSM) to generate a new generation student model with better potential, thirdly, perturbing the newly generated student model through a mixed mutation module (HMM), combining local mutation guided by middle characteristics of a teacher model and global mutation guided by Gaussian noise, controlling weights of the two models through an adaptive adjustment factor to balance exploration and convergence, and finally, searching for a pareto optimal solution between task performance and distillation efficiency through a multi-objective optimization strategy, and outputting the student model with superior performance and light weight. The invention enhances the global searching capability of gradient optimization, and remarkably improves the robustness, generalization capability and deployment feasibility of knowledge distillation.
Inventors
- WU YIRUI
- ZHEN CHENG
Assignees
- 河海大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260113
Claims (1)
- 1. Competitive knowledge distillation based on genetic constraint optimization, characterized by comprising the steps of: 1. multi-student selection module (MSSM, multi-Student Selection Module) This module is responsible for performing the "select" operation in the genetic algorithm. And sequencing the student model groups according to the performance scores obtained in the previous stage. One typical strategy is to employ a selection method based on multi-objective optimization, such as non-dominant ordered genetic algorithm (NSGA-II). The method takes the task loss and the distillation loss as two independent optimization targets, and searches for the pareto optimal solution set which has good performance on the two targets. In this way, a set of "elite" students, which perform well in terms of both model performance and knowledge modeling, can be selected as the basis for the next generation of reproduction. 2. Student knowledge sharing module (SKSM, student Knowledge Sharing Module) The module corresponds to the 'cross' operation in the genetic algorithm, and aims to generate offspring with more potential through knowledge recombination among high-quality individuals. Specifically, from elite students selected by MSSM, the best performing (top-1) and suboptimal (top-2) student models can be chosen. Then, a new child student model is generated by performing a crossover operation (such as single-point crossover, multi-point crossover, or arithmetic crossover) on its network parameters (weights and biases). The design simulates recombination of 'good genes', so that newly generated students can inherit and fuse knowledge of a plurality of excellent parents, thereby accelerating convergence and jumping out of local optimum. 3. Mixed variation module (HMM Hybrid Mutation Module) The module performs a 'mutation' operation to introduce new diversity to the student population, preventing premature convergence. The invention designs a mixed mutation strategy, which combines local guidance and global exploration: (1) And (3) local guiding variation, namely slightly and directionally disturbing parameters of corresponding layers of the student model by using a middle layer feature map of the teacher model as a knowledge anchor point to guide the student model to better align internal characterization of the teacher model. (2) Global exploration variation by applying a random perturbation to the parameters of the student model, such as sampling a noise vector from a gaussian or cauchy distribution. This undirected variation facilitates a more extensive exploration of the model in parameter space. (3) Adaptive tuning, i.e. to balance exploration and convergence, an adaptive tuning factor is introduced. In the initial stage of training, the model can be given higher weight for global exploration variation to encourage the model to search widely, and the weight for local guidance variation is gradually increased along with the progress of training to help the model to carry out fine tuning in a hopeful area.
Description
Competitive knowledge distillation based on genetic constraint optimization Technical Field The invention relates to the technical field of artificial intelligence and model compression, in particular to a knowledge distillation method for deep neural network optimization, and particularly relates to a multi-student competitive knowledge distillation method based on a genetic constraint optimization strategy. Background Deep learning makes a revolutionary breakthrough in many fields such as computer vision and natural language processing. However, in pursuit of higher performance, modern Deep Neural Networks (DNNs) are increasingly thick and complex in structure, with parameter volumes reaching millions or even billions of levels. This complexity results in the need for the model to consume significant computing resources and memory space during training and reasoning, greatly limiting its direct deployment and application on resource constrained devices (e.g., smartphones, embedded systems, edge computing nodes). As artificial intelligence sinks to the edge side, developing efficient, lightweight models has become a central resort to the industry. Knowledge distillation (Knowledge Distillation, KD) has emerged as an effective model compression technique. The core idea is to migrate the knowledge contained in a large and complex 'teacher model' into a small and light 'student model'. Teacher models are usually pre-trained on large amounts of data, with powerful generalization capabilities. The output soft label (namely the probability distribution of the softmax layer) not only contains the information of correct answers, but also reveals the similarity relation among categories, and provides more abundant supervision information for the student model than the traditional hard label (one-hot coding). By mimicking the soft-label or intermediate-layer features (Feature-based Distillation) of the teacher model, the student model is able to approximate or even surpass the performance of the teacher model as much as possible while significantly reducing the number of parameters and computational complexity. However, conventional knowledge distillation methods mostly rely on gradient back propagation for optimization. When there is a huge difference between the network structure and parameter scale of the teacher model and the student model (i.e. "semantic gap" is large), this gradient-based optimization approach faces a series of challenges: 1. Locally optimal traps-gradient descent methods tend to fall into a local optimum under multiple objective loss functions of knowledge distillation (typically including task loss and distillation loss). The optimization directions of task loss and distillation loss may have a conflict, resulting in the model converging to a suboptimal solution in the parameter space, failing to achieve globally optimal performance. 2. The optimization process is unstable, and the huge model capacity difference makes the knowledge migration process very sensitive. The gradient explosion or gradient disappearance problem possibly occurs in the propagation process, so that the training process is unstable, the performance fluctuation of the student model is large, and the reproduction is difficult. 3. The dependence on model initialization is that the final performance of a student model depends to a large extent on the initial state of its parameters. Different initializations may cause the model to converge to distinct performance levels, increasing training uncertainty. To alleviate these problems, some studies have proposed the introduction of "teaching aid models" that bridge the transfer of knowledge through one or more intermediate models of parameter scale between the teacher and the student. However, not only does this approach increase the overall complexity and computational cost of training, but the design and selection of the teaching aid model itself is a difficult problem, and unsuitable teaching aid models may even result in distortion or loss of knowledge during the transfer process. Therefore, a new model of knowledge distillation is needed that can get rid of the strong dependence on gradient information, has stronger global searching capability, and can adaptively explore the optimal student model structure. Based on the background, the invention creatively combines the non-gradient optimization idea (especially genetic algorithm) with a multi-student competition mechanism, and aims to fundamentally solve the bottleneck of the traditional knowledge distillation method. Disclosure of Invention The invention aims to provide a multi-student competitive knowledge distillation method and system based on genetic constraint optimization, aiming at the problems that the existing knowledge distillation method based on gradient is easy to fall into local optimum, unstable training, limited generalization capability and the like when processing huge semantic differences between a tea