CN-121998029-A - Teacher model guided student model diffusion self-distillation method and device
Abstract
The invention provides a teacher model guided student model diffusion self-distillation method which comprises the steps of loading a pre-trained teacher model and a student model, starting training the student model, respectively extracting teacher characteristics and original student characteristics from training data, guiding the diffusion model to conduct denoising sampling on the original student characteristics through the teacher model to generate corresponding denoising student characteristics, conducting self-distillation training on the basis of the denoising student characteristics and the original student characteristics, calculating self-distillation training loss of the student model, updating parameters of the student model, and circularly executing the steps until the student model training is finished. The invention also provides a student model diffusion self-distillation device guided by the teacher model, a storage medium and electronic equipment. Therefore, the invention can better train the student model, thereby improving the performance of the student model.
Inventors
- YANG CHUANGUANG
- WANG YU
- AN ZHULIN
- HUANG LIBO
- XU YONGJUN
Assignees
- 中国科学院计算技术研究所
Dates
- Publication Date
- 20260508
- Application Date
- 20260119
Claims (10)
- 1. A teacher-model-guided student model diffusion self-distillation method, comprising: initializing, namely loading a pre-trained teacher model and a pre-trained student model, and starting training the student model; A feature extraction step of respectively extracting teacher features and original student features from the training data; Denoising and sampling, namely leading the diffusion model to denoise and sample the original student characteristics through the teacher model, and generating corresponding denoising student characteristics; A self-distillation step of performing self-distillation training based on the denoising student features and the original student features; Calculating and updating, namely calculating the self-distillation training loss of the student model and updating the parameters of the student model; and circularly executing the feature extraction step, the denoising sampling step, the self-distillation step and the calculation updating step until the training of the student model is finished.
- 2. The teacher-model-guided student model diffusion self-distillation method of claim 1, wherein the initializing step further comprises: Reading the data set; loading the pre-trained teacher model and freezing parameters of the teacher model; Initializing the student model; The student model starts to be trained.
- 3. The teacher model guided student model diffusion self-distillation method of claim 1, wherein the denoising sampling step further comprises: And guiding the diffusion model to carry out inverse denoising sampling on the original student characteristics through the teacher classifier of the teacher model to obtain the denoising student characteristics with teacher semantic information.
- 4. The teacher model guided student model diffusion self-distillation method of claim 3, wherein the denoising sampling step further comprises: And guiding the diffusion model to generate a denoising image through the teacher classifier: equation 1 Wherein Z is a normalization constant, Is an unconditional inverse denoising process following a denoising diffusion probability model, θ is a diffusion model parameter, Is a pre-trained classifier, x t represents the noise image of step t, y is a class label; characterizing the original student Step t, noisy feature x t , where H, W, C represent the feature height, width, and channel number, respectively; Using the teacher model as a pre-trained classifier for the raw student features And performing diffusion sampling guided by a T-step teacher model, wherein the sampling form is as follows: Equation 2 The said Is a conditional Markov process for classifying the teacher with the noise predictor parameter θ Under the condition of (1) sampling the original student feature from x t+1 to x t , Representing the teacher's classifier as described above, Inferring a conditional probability of the derived class y based on the original student feature x t ; the teacher classifier typically includes a global averaging pooling layer and a linear weight matrix for outputting a class probability distribution; Using the set of diffusion models, the gaussian distribution was used to predict x t , which is derived from x t+1 : Equation 3 Wherein, the The logarithmic form of equation 3 is: Equation 4 When the upper limit of the number of diffusion steps goes to infinity, it is deduced that At this time compared with Items Is smaller, thus in Pair of parts Performing first-order Taylor expansion approximation: Equation 5 Wherein, the Regarding as a constant, further the logarithmic form of equation 2 is: equation 6 Wherein, the Therefore, the conditional sampling strategy is approximately unconditional Gaussian sampling, but the mean value is determined Translating; Furthermore, a gradient scaling factor is introduced To control the pilot strength of the teacher classifier, the teacher model-guided diffusion sampling process is expressed as: Equation 7 The gradient scaling coefficient k smoothes the class probability distribution of the teacher model, and the effect and the speed are the same as those of the teacher model Proportional to the ratio; when k >1, the distribution is sharper, the teacher classifier has stronger guidance force, and the denoising student feature with higher fidelity can be sampled ; As the teacher model-guided diffusion sampling process passes After the step, the original student features That is, x t will be converted to the denoised student characteristics I.e., x 0 , to obtain the denoised student characteristics with the teacher semantic information 。
- 5. The teacher-model-guided student model diffusion self-distillation method of claim 1, wherein the calculation updating step further comprises: Calculating a first characteristic distillation loss of the denoised student characteristics in each training; calculating second characteristic distillation loss of the denoising student characteristics in the overall training; Updating parameters of the student model according to the second characteristic distillation loss.
- 6. The teacher-model guided student model diffusion self-distillation method of claim 5, wherein the step of calculating a first feature distillation loss of the denoised student features per training further comprises: By using the denoising student characteristics And original student features Calculating the first feature distillation loss of the denoised student feature in each training ; Equation 8 Wherein, the Is a feature of the original student in question, Is the de-noising student feature; The step of calculating the second feature distillation loss of the denoised student feature in the global training further comprises: Calculating a second characteristic distillation loss of the denoising student characteristic in the overall training through an overall loss function, wherein the overall loss function is as follows: Equation 9 Where alpha, beta, gamma are loss weights for balancing the partial losses, For the loss of the predicted value of the student model and the real label of the training data, For the loss between the probability distributions output by the student model and the teacher model, Training the loss of the diffusion model for the teacher feature.
- 7. The method for teacher-guided student model diffusion self-distillation according to claim 1, wherein the method is applied to a teacher-guided student model diffusion self-distillation system comprising the teacher model, the student model, the diffusion model, and a student feature extractor Teacher feature extractor Classifier for students Teacher classifier Noise adapter ; Wherein, the Representing the model parameters of students, The teacher model parameters are represented, x represents the input picture, and y represents the corresponding class label of the input picture.
- 8. A teacher-model-guided student model diffusion self-distillation device constructed based on the method of any one of claims 1 to 7, the device comprising: the initialization module is used for loading a pre-trained teacher model and a pre-trained student model and starting training the student model; The feature extraction module is used for respectively extracting teacher features and original student features from the training data; The denoising sampling module is used for conducting denoising sampling on the original student characteristics through the teacher model and guiding the diffusion model to generate corresponding denoising student characteristics; the self-distillation module is used for carrying out self-distillation training based on the denoising student characteristics and the original student characteristics; the calculation updating module is used for calculating the loss of self-distillation training of the student model and updating the parameters of the student model; And circularly executing the feature extraction module, the denoising sampling module, the self-distillation module and the calculation updating module until the training of the student model is finished.
- 9. A storage medium storing a computer program for performing the method of any one of claims 1 to 7.
- 10. An electronic device comprising a storage medium, a processor and a computer program stored on the storage medium and executable on the processor, characterized in that the processor implements the method of any one of claims 1-7 when executing the computer program.
Description
Teacher model guided student model diffusion self-distillation method and device Technical Field The invention relates to the technical field of model knowledge distillation, in particular to a teacher model-guided student model diffusion self-distillation method, a teacher model-guided student model diffusion self-distillation device, a storage medium and electronic equipment. Background In recent years, with the continuous development of deep learning, model parameters show a continuous trend, which leads to continuous increase of cost of model training and deployment. It is difficult to apply directly in mobile terminals, edge devices, etc. power, memory limited scenarios. Therefore, compression and acceleration of the model are performed on the premise of ensuring the performance of the model as much as possible, and the model is one of important research directions in industry and academia. Knowledge distillation is a classical method of model compression, and the core idea is to use a teacher model with stronger performance to supervise a student model in the training process so as to improve the effect of the student model compared with independent training. Knowledge distillation methods can be classified into output distillation, feature distillation, and relationship distillation. The main idea of output distillation is that a student model imitates the final output result of a teacher model, and the final output is softened and output through temperature coefficient and is aligned by loss of KL divergence and the like, so that the soft label of the teacher is learned. The main idea of feature distillation is to align the teacher's feature representation with the student's feature representation, such as feature map, channel response, attention, etc., in the middle layer. The teacher's characterization capability is migrated to students through feature matching loss or mapping transformations. The main idea of relational distillation is that the output or characteristic value of a single sample is not directly aligned, but the structural relationship among samples, channels or space positions, such as similarity, distance, angle, correlation matrix and the like, is distilled, so that students learn the relative structural information in the representation of teachers. In the research process of carrying out the knowledge distillation method, the current characteristic distillation method has certain problems and defects. Because of the difference of semantic information and mapping modes in the feature space of the teacher feature and the student feature, the difficulty of directly aligning the feature optimization between the teacher model and the student model is high. In the prior knowledge distillation technology, teacher features are directly used as supervisory signals for distillation of student models, so that the student models learn redundant information which is not understood and accepted by a shallow feature extractor. In summary, it is clear that the prior art has inconvenience and defects in practical use, so that improvement is needed. Disclosure of Invention In view of the above-mentioned drawbacks, an object of the present invention is to provide a teacher model-guided student model diffusion self-distillation method, apparatus, storage medium, and electronic device, which can better train a student model, thereby improving performance of the student model. In order to solve the technical problems, the invention is realized as follows: In a first aspect, an embodiment of the present invention provides a teacher-model-guided student model diffusion self-distillation method, including: initializing, namely loading a pre-trained teacher model and a pre-trained student model, and starting training the student model; A feature extraction step of respectively extracting teacher features and original student features from the training data; Denoising and sampling, namely leading the diffusion model to denoise and sample the original student characteristics through the teacher model, and generating corresponding denoising student characteristics; A self-distillation step of performing self-distillation training based on the denoising student features and the original student features; Calculating and updating, namely calculating the self-distillation training loss of the student model and updating the parameters of the student model; and circularly executing the feature extraction step, the denoising sampling step, the self-distillation step and the calculation updating step until the training of the student model is finished. According to the teacher model-guided student model diffusion self-distillation method, the initializing step further comprises: Reading the data set; loading the pre-trained teacher model and freezing parameters of the teacher model; Initializing the student model; The student model starts to be trained. According to the teacher model-guided student model diffusio