CN-121413707-B - General performance enhanced distillation method for convolutional neural network

CN121413707BCN 121413707 BCN121413707 BCN 121413707BCN-121413707-B

Abstract

The invention discloses a general performance enhancement distillation method for a convolutional neural network, and relates to the technical field of artificial intelligence. The method comprises the steps of firstly obtaining feature graphs output by a teacher model and a student model, generating an importance graph through channel projection and activation, calculating two complementary spatial gains of consistency and coverage rate, modeling the teacher-student feature association through a learnable pixel-level gate control consensus module, adaptively outputting a mixing coefficient, generating a dynamic spatial consensus weight mask through convex combination, constructing a feature distillation loss term by combining the mask, constructing a total loss function by introducing a gate control regularization term, and iteratively optimizing a student model. The method solves the problems of rigidity of weighting rules, insufficient adaptation of training stages and inaccurate gradient distribution of the existing method, realizes self-adaptive matching of distillation strategies and training stages, remarkably improves the precision and generalization capability of a lightweight student model in a dense target detection task on the premise of not increasing reasoning expenditure, and is suitable for resource-limited platforms such as edge equipment and the like.

Inventors

ZHOU HAOJIE
HONG WANTING
WANG NING
ZHANG CHAO
FAN LINGHONG
PEI FENG
GUO WEI

Assignees

江南大学
江苏磐智数云科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251229

Claims (9)

1. The general performance enhancement distillation method for the convolutional neural network is characterized by comprising the following steps of: S1, acquiring image data to be processed, respectively inputting the image data into a pre-trained teacher model and a pre-trained student model, and respectively outputting a teacher feature map and a student feature map through the teacher model and the student model; s2, respectively carrying out projection and activation processing of channel dimensions on the teacher feature map and the student feature map to generate a teacher importance map and a student importance map; s3, calculating the same spatial position based on the teacher importance map and the student importance map The consistency gain is used for focusing on a consensus area of which the confidence coefficient is larger than a threshold value, and the coverage gain is used for capturing a dominant area of which the confidence coefficient is larger than the threshold value and the student model is insufficiently learned; s4, for the same spatial position of the teacher importance map and the student importance map Calculation and spatial position Related mixing coefficients; s5, calculating to obtain a space position through convex combination based on the mixing coefficient, the consistency gain and the coverage rate gain Generating a space weight mask capable of balancing the reinforcing requirement of a teacher-student model consensus region and the supplementing requirement of a teacher model advantage region; S6, constructing a characteristic distillation loss term by using the space weight mask, combining a gate regularization term to obtain an optimized total loss function, and performing iterative training optimization on the student model through the optimized total loss function to obtain a student model with enhanced performance; wherein in step S4, for the same spatial position of the teacher importance map and the student importance map Calculation and spatial position Related mixing coefficients The method of (2) is as follows: the same spatial location for the teacher importance map and the student importance map Building feature descriptors in the form of quaternions The expression is: ; Wherein, the Representing the teacher importance map at a spatial location The numerical value of the position is calculated, Representing the student importance map at a spatial location The numerical value of the position is calculated, Representing spatial position Consistency gain at; Characterizing the feature descriptor The method comprises the steps of inputting a learnable pixel-level gating consensus module, wherein the pixel-level gating consensus module maps a network through a lightweight convolution sharing parameters For the feature descriptor Processing, outputting and spatial position Related mixing coefficients The following are provided: 。
2. The method for enhancing the universal performance of a convolutional neural network as claimed in claim 1, wherein the lightweight convolutional mapping network Configured to output a constant value during initial training, the mixing coefficient being dynamically adjusted by counter-propagating gradients during training Is a function of the output value of (a).
3. The method for enhancing the universal performance of a convolutional neural network as described in claim 1, wherein said consistency gain The calculation method of (2) is as follows: ; Wherein, the Representing the corresponding element product operation.
4. The method for enhancing the universal performance of a convolutional neural network as recited in claim 1, wherein S5, the spatial position is obtained by convex combination calculation Spatial consensus weights of (2) The method of (2) is as follows: ; Wherein, the Representation and spatial location The mixing coefficient of the correlation is calculated, Representing spatial position A consistency gain at the point where the data is to be read, Representing spatial position Coverage gain at that point.
5. The method for general purpose performance enhancing distillation of a convolutional neural network of claim 4, wherein said coverage gain The method of (2) is as follows: ; Wherein, the Representing the corresponding spatial position A teacher's importance value at the location, Representing the corresponding spatial position Student importance value at.
6. The method for general purpose performance enhancing distillation of convolutional neural network of claim 1, wherein said total loss function The expression of (2) is as follows: ; Wherein, the Representing the original detected loss term of the student model, Is the first The hierarchy of the feature layers loses terms.
7. The method for enhancing the universal performance of a convolutional neural network as described in claim 6, wherein said first step Hierarchical loss term of individual feature layers The calculation method of (2) is as follows: For the first Each feature layer uses teacher feature map Student characteristic diagram And the layer space position Corresponding spatial consensus weights By means of a set of spatial positions Is used for constructing characteristic distillation loss term The formula is as follows: ; Wherein, the And Respectively represent the first The layer teacher characteristic diagram and the student characteristic diagram are in space positions The feature vector at which the feature vector is located, Representation of A norm for quantifying the degree of difference of the teacher-student characteristics at the spatial position; introduction of lightweight gating regularization term The calculation formula is as follows: ; Distillation loss term based on the characteristic And the gating regularization term Obtain the first Hierarchical loss term of individual feature layers The calculation formula is as follows: ; Wherein, the Representing feature layers Is a sum of spatial positions of (a); Represent the first Layer teacher feature map in spatial position A feature value at the location; Represent the first The layer student characteristic diagram is in space position A feature value at the location; representation and spatial location Related mixing coefficients; Representing a target average gate value; 、 and as a balance coefficient, the characteristic distillation loss and regularization loss contribution weight is adjusted.
8. The convolutional neural network-oriented general performance enhanced distillation method of claim 1, wherein the teacher model and the student model are convolutional neural networks for dense target detection tasks, and the total amount of network parameters, the number of feature extraction channels, and the network depth of the teacher model are all greater than those of the student model.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the general performance enhancing distillation method for convolutional neural networks of any one of claims 1 to 8 when the program is executed.

Description

General performance enhanced distillation method for convolutional neural network Technical Field The invention relates to the technical field of artificial intelligence, in particular to a general performance enhancement distillation method for a convolutional neural network. Background In the field of deep learning technology, dense target detectors have been widely used in various visual perception tasks, and are mainly classified into two main classes, i.e., a single-stage detector (e.g., RETINANET, FCOS) and a two-stage detector (e.g., faster R-CNN). Although the model has excellent detection performance, the model generally has higher computational complexity and parameter quantity, and is difficult to be directly deployed on a platform with limited computational resources such as edge equipment. In order to solve the deployment problem, knowledge distillation (Knowledge Distillation, KD for short) is a mainstream model compression and performance enhancement means, and the core idea is to guide a lightweight student model to learn by means of rich training signals generated by a high-capacity and high-performance teacher model, so that the detection precision of an original model is kept as much as possible while the complexity of the model is reduced. Unlike simple visual tasks such as image classification, the dense target detection task has remarkable specificity that thousands of spatial positions need to be predicted on a single image, and serious problems of unbalanced foreground and background categories exist, so that the response generated by the teacher model presents extremely strong non-uniformity on spatial distribution. In this context, how to efficiently migrate the knowledge of the teacher model to the student model becomes a core challenge for dense target detection knowledge distillation. In the prior art, a knowledge distillation method based on characteristics is mostly adopted, and the importance of different positions of a characteristic diagram is distinguished by designing a space weighting rule so as to optimize the distillation effect. For example, typical methods such as decoupling feature distillation (Decoupled Feature Distillation, deFeat for short), gradient-guided Instance-aware distillation (GID for short), feature-level knowledge distillation (Foreground-focused Distillation, FKD for short), foreground focusing distillation (Foreground-focused Distillation, FGD for short) and the like generate a spatial mask through labeling regions (Ground Truth), teacher model confidence, ioU scores or Gradient information, and re-weight specific positions of feature graphs, thereby realizing targeted knowledge migration. While prior knowledge distillation methods provide a viable path for model compression for dense target detection, a number of significant drawbacks remain exposed in practical applications, particularly as follows: Firstly, the uniform distillation strategy causes effective information dilution and noise interference, namely the traditional intensive detection distillation method applies equal weight supervision signals to all positions of a feature map, and under the scene of serious unbalance of foreground and background and uneven spatial distribution of teacher response, key target clues can be diluted by background features, and meanwhile, background prediction noise of a teacher model is introduced to interfere normal convergence of a student model, so that feature learning efficiency is reduced. Second, the spatial weighting rules lack flexibility in that existing methods rely on manually designed spatial weighting rules whose mathematical form is fixed once set. The static weighting mode is difficult to adapt to complex and changeable detection scenes, cannot be dynamically adjusted according to actual task demands, and lacks of adaptability to different scenes. Thirdly, ignoring the dynamic evolution of the relationship between the teacher and the student in the training process, wherein the existing method repeatedly uses a single-form static weighting template in the whole training period, and does not consider the dynamic change of the capability difference between the teacher and the student model along with the training process. In the early stage of training, the student model is not converged, the reliability of the weight generated based on the high noise response or related indexes is lacking, in the later stage of training, the attention point cannot be timely adjusted by a static rule, and the focus on a high-order information area mastered by a teacher is difficult, so that the knowledge migration efficiency in the whole training process cannot reach the optimal. Fourth, it is difficult to balance the distillation requirements of different areas, namely the weighting strategy of the existing method focuses on foreground areas or high confidence areas, dynamic adjustment mechanisms are lacked, and the distillation requirements of three key are