CN-116682161-B - Facial expression recognition method based on attention mechanism and self-distillation
Abstract
The invention relates to a facial expression recognition method based on an attention mechanism and self-distillation, which comprises the following steps of 1, constructing a facial expression recognition data set, 2, constructing a facial expression recognition model which is composed of a feature extraction module, a self-adaptive channel attention module, a remodelling module and a self-distillation network, 3, preprocessing an input facial expression image, 4, training the facial expression recognition model by utilizing images in a training set in the facial expression recognition data set, and 5, performing reasoning classification on a test set and a verification set in the data set by utilizing the trained facial expression recognition model. The method designs a reasonable and efficient facial expression recognition scheme, well guides a model to pay attention to important characteristic information in a mode of attention mechanism, refines and compresses network knowledge in a new self-distillation mode, improves robustness of shallow network output characteristics and reduces complexity of a model in an inference stage.
Inventors
- ZHANG XIN
- ZHU JINLIN
- YIN YUYU
- ZHOU LI
- SUN QIANQIAN
Assignees
- 杭州电子科技大学
Dates
- Publication Date
- 20260512
- Application Date
- 20230613
Claims (3)
- 1. A facial expression recognition method based on an attention mechanism and self-distillation, which is characterized by comprising the following steps: s1, splitting a public data set into a training set, a testing set and a verification set, wherein the training set, the testing set and the verification set comprise a plurality of 7 basic expression images; s2, constructing a facial expression recognition model, wherein the model comprises a feature extraction module, a self-adaptive channel attention module, a remodelling module and a self-distillation network; The self-distillation is performed in a mode that each basic block extracts knowledge from the next basic block except the last basic block, and distillation is performed gradually from deep to shallow during the self-distillation; s3, aligning human faces of images of a training set, a test set and a verification set which are input by the model, and cutting the images into fixed sizes; S4, inputting the split training set image into a constructed facial expression recognition model, training and optimizing the learnable parameters in the model until the accuracy of the model is not improved, wherein the specific training process comprises the following substeps: S41, inputting the processed image into a feature extraction module, and outputting a feature map of each basic block in the feature extraction module As the characteristics output by each stage of the characteristic extraction module; s42, adding an adaptive channel attention module behind the feature map of each basic block output, wherein the feature map of each basic block output is firstly obtained by using maximum pooling and average pooling Is aggregated to generate 2 different channel vectors Secondly, the adaptive weight module is utilized as Generating corresponding weights Then, will And The multiplied features And The multiplied features are added and fused to obtain the final channel weight feature And finally, the obtained And input feature map Multiplication to obtain the final output characteristic diagram The calculation formula is as follows: ; ; ; ; ; ; Wherein the method comprises the steps of Representing the function of the ReLU activation, , Representing the maximum pooling operation and the average pooling operation respectively, wherein MLP (x) represents the output of the appointed vector after entering the MLP, and W (x) represents the parameter of the full connection layer; s43, further refining the features of the shallow basic block output by using a remodelling module and finally mapping the features of each basic block output to the same dimension to obtain The calculation formula is as follows: ; Wherein the method comprises the steps of A convolution operation is represented and is performed, Representing convolution kernel parameters, BN and Respectively representing a normalization layer and an activation layer, wherein c represents the number of basic blocks; s44, calculating each using the full connectivity layer and softmax function Corresponding prediction score The calculation formula is as follows: ; w represents parameters of the full connection layer; S45, gradually performing knowledge distillation on the model from deep to shallow through a self-distillation network and corresponding distillation loss; S5, reasoning the images in the test set and the verification set, removing the self-distillation and attention module in the reasoning stage, performing final prediction on the input by using the feature extraction module trained in the training stage, and inputting the processed images of the training set and the test set into the model to obtain corresponding classification results, wherein the results are one of 7 basic expressions of aversion, happiness, gas generation, fear, surprise, sadness and neutrality.
- 2. A facial expression recognition method based on attention mechanism and self-distillation as claimed in claim 1, characterized in that in step s45, it comprises the following sub-steps: s451, measuring the difference between the prediction score output by the last basic block and the real label of the training set by using cross entropy loss, wherein the calculation formula is as follows: ; Wherein the method comprises the steps of Representing the cross-entropy loss, Representing the prediction score obtained for the C-th basic block, Representing the corresponding label; S452, measuring the difference between the prediction score obtained by the current basic block and the prediction score obtained by the next basic block by utilizing KL divergence loss, wherein the calculation formula is as follows: ; Wherein the method comprises the steps of Represents the loss of the KL divergence, Indicating that the updating of the gradient is canceled when the gradient return is carried out; s453, the L2 loss is used for measuring the difference between the output characteristic of the current basic block and the output characteristic of the next basic block, and the calculation formula is as follows: ; represents an L2 loss; s454 Total self distillation loss The three losses are combined, and the calculation formula is as follows: ; Is a super parameter used to adjust the contribution of KL loss and L2 loss to the total loss.
- 3. The facial expression recognition method based on an attention mechanism and self-distillation according to claim 1, wherein said adaptive channel attention module is composed of a channel attention module and an adaptive weight module; The channel attention module is used for acting on the average pooling feature and the maximum pooling feature, judging the importance of each channel by using an attention mechanism, and weighting each channel according to the importance of each channel; the self-adaptive weight module acts on the average pooling feature and the maximum pooling feature, judges the importance of the average pooling feature and the maximum pooling feature by using an attention mechanism, and generates corresponding weights.
Description
Facial expression recognition method based on attention mechanism and self-distillation Technical Field The invention relates to the field of facial expression recognition, in particular to a facial expression recognition method based on an attention mechanism and self-distillation. Background Facial expressions play a key role in interpersonal interaction, and people can clearly perceive emotion which the other party wants to express through the change of the facial expressions. Over the past decades, as facial expression recognition has been widely used in the industries of digital entertainment, human-computer interaction, driving detection, healthcare, etc., more and more researchers have begun to engage in research in this area. Facial expression recognition under field conditions is still challenging due to poor lighting conditions, facial expression occlusion, posture changes, and the like. Conventional methods typically use manual or shallow features to identify facial expressions, which take a global facial expression image as an input to a model, and determine the category to which the facial expression belongs from the final output of the model. However, due to the high similarity between facial expression classes, the model has difficulty focusing on key regions that facilitate facial expression discrimination, which are important for proper classification of facial expressions. Some advanced methods increase the grasping ability of the model for important areas by increasing the complexity and attention mechanisms of the model, which has the disadvantage of significantly increasing the complexity of the model and severely hampering the deployment of the model in low-end devices. Disclosure of Invention Aiming at the problems, the invention aims to provide a facial expression recognition method based on an attention mechanism and self-distillation, which can adaptively guide a model to pay attention to some important areas in a facial expression on the premise of not increasing the complexity of the model, so that the model can correctly classify 7 basic moods of aversion, happiness, gas generation, fear, surprise, sadness and neutrality, and the efficiency and the accuracy of facial expression recognition are effectively improved. The technical scheme adopted for solving the technical problems is as follows: And 1, splitting three public data sets Affect-Net, FERPlus, RAF-DB into a training set, a testing set and a verification set, wherein the training set, the testing set and the verification set comprise a plurality of 7 basic expression images. And 2, constructing a facial expression recognition model, wherein the model comprises a feature extraction module, an adaptive channel attention module, a remodeling module and a self-distillation network. And 3, aligning human faces of images of the training set, the testing set and the verification set which are input by the model, and cutting the images into fixed sizes. Image enhancement, such as random clipping, horizontal flipping, random removal, etc., is performed on the training set input to the model to prevent the model from overfitting. Step 4, inputting the split training set image into the constructed facial expression recognition model, training and optimizing the learnable parameters in the model until the accuracy of the model is not improved, wherein the specific training process comprises the following substeps: S41, inputting the processed image into a feature extraction module, and taking an output feature map z j of each basic block in the feature extraction module as a feature output by each stage of the feature extraction module; S42, adding an adaptive channel attention module behind the feature map output by each basic block, wherein the module firstly utilizes maximum pooling and average pooling to aggregate channel information of the feature map z j output by each basic block to generate 2 different channel vectors q max,qavg, secondly utilizes an adaptive weight module to generate corresponding weight w max,wavg for q max,qavg, then adds and fuses the features obtained by multiplying q max and w max with the features obtained by multiplying q avg and w avg to obtain a final channel weight feature f cw, and finally multiplies the obtained f cw with the input feature map z j to obtain a final output feature map The calculation formula is as follows: qmax=MaxPool(zj), qavg=AvgPool(zj), wmax=Wm1(σ(Wm0(qmax))), wavg=Wa1(σ(Wa0(qavg))), fcw=σ(MLP(qmax)*wmax+MLP(qavg)*wavg), S43, further refining the characteristics of the shallow basic block output by using a remodelling module and finally mapping the characteristics of each basic block output to the same dimension to obtain The calculation formula is as follows: s44, calculating a prediction score F j corresponding to each F j by using the full connection layer and the softmax function, wherein the calculation formula is as follows: S45, carrying out knowledge distillation on