CN-116740793-B - Facial expression recognition method based on cross-layer multi-scale channel mutual attention learning

CN116740793BCN 116740793 BCN116740793 BCN 116740793BCN-116740793-B

Abstract

The invention discloses a facial expression recognition method based on cross-layer multi-scale channel mutual attention learning, which belongs to the technical field of image processing and pattern recognition, and the method uses CMSCMAL-Net model to complete the recognition of facial expression in natural scene, and has better robustness under the condition of blurred expression or occlusion. According to the invention, a multi-scale channel attention mechanism is introduced to improve a backbone network, fine details in low-level features are reserved and highlighted on the basis of combining local features and global features, meanwhile, the backbone network is divided into different stages from shallow to deep, the shallow stages learn low-level detail information, the deep stages learn high-level abstract semantic information, and a progressive multi-step strategy is adopted for training, so that complementary information can be obtained in different stages, and the recognition effect of a model is improved. The invention provides a model in a fine granularity direction for facial expression recognition, which can better capture subtle changes of facial expressions and improve the recognition rate of the facial expressions in a natural scene.

Inventors

WANG SHIGANG
CHEN SIYU
WEI JIAN
ZHAO YAN

Assignees

吉林大学

Dates

Publication Date: 20260512
Application Date: 20230731

Claims (1)

1. The facial expression recognition method based on cross-layer multi-scale channel mutual attention learning is characterized by comprising the following steps: 1) Acquiring a facial expression image database, and dividing a training set and a testing set; 2) Preprocessing the facial expression image in the step 1), namely detecting, aligning and cutting the facial expression image by using a MTCNN method, and removing redundant background information to obtain an amplified facial expression image; 3) Data enhancement, namely enhancing the data of the facial expression image obtained by preprocessing in the step 2), wherein the enhancement mode comprises random overturning and random cutting; 4) Each setting of the CMSCMAL-Net model CMSCMAL-Net model is constructed, which comprises the following steps: 4.1 constructing a Backbone feature extraction network of a Backbone for extracting features of an input image, wherein the overall structure of the Backbone feature extraction network of the Backbone is ResNet and the construction of ResNet comprises the following steps: 4.1, 1R eNet 50 is a deep convolutional neural network, which is characterized in that a residual structure is introduced to form a residual network, and in the forward and backward propagation processes of the model, an information propagation path with jump connection is provided, so that the phenomena of gradient elimination and gradient explosion common to the deep convolutional neural network can be effectively avoided, and the depth of the model is deepened to perform efficient feature extraction; the most basic structure of the residual network is a residual block, wherein the residual block is divided into a main path and a residual path, the sequence structure in the main path consists of a convolution layer with the convolution kernel size of 1 multiplied by 1, a batch standardization layer, a ReLU activation function layer, a convolution layer with the convolution kernel size of 3 multiplied by 3, a batch standardization layer, a ReLU activation function layer, a convolution layer with the convolution kernel size of 1 multiplied by 1, a batch standardization layer and a ReLU activation function layer, and the sequence structure is arranged in a top-down sequence; 4.1.3 processing the input RGB image by stacking residual block structure to obtain deep features with channel number 2048 and downsampling rate 32, dividing the layer excluding ResNet of fully connected classifier into 5 stages and setting stage1, stage2, stage3, stage4 and stage5 as the space size of the feature map is reduced from shallow stage to deep stage; 4.2 constructing a cross-layer multi-scale channel attention structure, namely designing a cross-layer multi-scale channel mutual attention structure by utilizing different stage characteristics of ResNet and multi-scale channel information, comprising the following steps: 4.2.1 design of a multiscale channel attention module, namely, channel attention can be realized on multiple scales by changing the size of a space pool, point-to-point convolution is performed by utilizing point-to-point channel interaction of each space position, the point-to-point convolution is used as a local channel information aggregator and added into global information in the attention module, and the channel attention of local features is calculated through a bottleneck structure in order to save parameters Wherein X epsilon { X 1 ,x 2 ,…,x n ,…x N } is an intermediate feature map, and the expression is: wherein, the kernel sizes of PWConv 1 and PWConv 2 are respectively: And The input X is firstly subjected to global average pooling operation once, and then is subjected to channel attention with local features The same point-by-point convolution operation can obtain the channel attention of global features Combining local channel contexts Normalizing by a sigmoid function to obtain refinement characteristics of the multi-scale channel attention module M: Wherein: Represents the attention weight generated by M; Representing broadcast additions; representing element-by-element multiplication; 4.2.2 Multi-granular feature region Generation, inputting refinement feature χ to In the case of the convolutional layer, The convolution layer has a kernel size of Where 1 x 1 is the space size and C n is the number of input channels, Is the number of output channels, and is subjected to batch normalization operation and Elu operation to obtain an intermediate characteristic diagram x n ', and then the characteristic x n ' is input into In the case of the convolutional layer, The convolution layer has a kernel size of Wherein 3×3 is the space size; C n is the output channel, then batch normalization operation and Elu operation are carried out to obtain an intermediate characteristic diagram x n '', and finally maximum pooling operation is carried out; 4.2.3 attention area generation firstly generates class activation graphs-CAMs for class k n based on x n '', the specific class of CAMs being used to identify the discrimination image area of that class, defined as: where coordinates (α, β) represent the spatial locations of x n ″ and Φ n , p n represents the classification prediction of a fully connected layer based classifier corresponding to class k n , Φ n (α, β) represents the importance of activation at the spatial location (α, β) class, resulting in classification of the image into the class k n ; The area of the image at a certain stage that is most relevant to category k n can be known by upsampling the CAM to the size of the input image, and thus, after phi is obtained, the attention map is generated by upsampling using bilinear sample check phi n Wherein H in 、W in is the height and width of the input image, respectively, and then each spatial element of the normalized attention map is calculated using a min-max normalization: Using normalized attention map As a guide, areas with discrimination are found and tailored, specifically, by first combining Is set to 1 for values greater than a threshold t (t e 0, 1) and to 0 for other values to generate I.e. The calculation formula of each spatial element is as follows: According to Can locate a cover Clipping the region from the input image, upsampling the clipped region to the size of the input image, regarding the upsampled attention region a n as a predicted attention region of a certain stage, and generating an overall attention region by summarizing the attention information learned at different stages, wherein the calculation formula is: thereafter, similar to the generation process of A n , the min-max normalization is used to process And the result is expressed as Then, based on the same threshold t Finally, locating and covering all positive value areas, and clipping the same area of the input image to obtain an integral attention area A global which is not sampled and has the same size as the input image; 4.3 multiple steps of mutual learning, namely training each stage by adopting a progressive multi-step strategy, training the stages one by one in the early steps, which can intensively learn the attention information of the corresponding stage, and finally, two steps, all the stages work together to learn effective information from the attention area and the original image respectively, wherein the method comprises the following steps: 4.3.1 training the deepest feature stage, wherein the training of the deepest feature stage involves a stage shallower than the feature stage, so that attention areas and overall attention areas { A 1 ,A 2 ,…,A n ,…,A global } proposed by all stages can be generated in the stage; 4.3.2 mutual data enhancement, training gradually turns to a shallow feature stage, when training the stage, randomly selecting one input from an image library according to the principle of mutual data enhancement, wherein the image library consists of an original input and an attention area except the shallow stage; 4.3.3 training all hierarchy stages and their connection to an overall attention area, which is composed of all hierarchy stages together, containing important attention information for each stage, amplifying and studying the commonly obtained attention information to extract finer granularity features; 4.4 model training, using a random gradient descent- -SGD training model, epoch of 200, momentum of 0.9, weight_decade of 0.0005, batch size of 64, learning rate of 0.002 using Cosine analysis, image input size of 224×224, threshold t of 0.5; 5) Training by adopting a progressive multi-step strategy of cross entropy loss, wherein the progressive multi-step strategy comprises output prediction of each stage and output prediction of cascade characteristics; 5.1 for each stage of output, performing loss calculation by adopting cross entropy between a real label y and a predictive probability distribution, wherein the calculation formula is as follows: 5.2, for the output of the cascade characteristics, performing loss calculation by adopting cross entropy between a real label y and a prediction probability distribution, wherein the calculation formula is as follows:

Description

Facial expression recognition method based on cross-layer multi-scale channel mutual attention learning Technical Field The invention belongs to the technical field of image recognition processing and pattern recognition, and particularly relates to a facial expression recognition method based on cross-layer multi-scale channel mutual attention learning. Background Artificial intelligence technology is a science that mainly researches how to simulate human beings with machines and realize and extend the way of thinking of human beings, which makes the task of understanding and simulating human emotions a hotspot. In the most important man-machine interaction link, if a machine wants to understand and analyze the emotion and psychological activities of a human, the machine must first accurately recognize the emotion of the human. Facial expression is one of the most natural, powerful and common signals that humans convey their emotional states and intent, playing an important role in human daily communications, while facial expression recognition is a requisite way to achieve true artificial intelligence. Meanwhile, along with rapid improvement of hardware computing capability and addition of an excellent deep learning model, a convolutional neural network obtains a lot of very good results on an image classification task, and application of a deep learning method in facial expression recognition has become a main choice. Facial expression recognition is generally regarded as an image classification problem and mainly comprises three steps of data preprocessing, feature extraction and prediction classification. The main task of data preprocessing is to perform operations such as screening, data enhancement, normalization and the like on data in a data set according to the requirements of a model, cut out a face region in an image through face detection, and remove a background region irrelevant to expression recognition in an original image. The feature extraction generally adopts a convolutional neural network to extract the spatial features of the image, and the prediction classification is to classify the extracted expression features as the input of a classifier. The images in the facial expression recognition data set in the natural scene are mostly acquired from the network such as the social network site, so that the data set has the problem of unbalanced sample distribution, and it is very difficult or even impossible to collect the same amount of data for each category. In addition, because of factors such as various postures, illumination, gender, identity and the like, large intra-class differences exist in the data set, the facial expression recognition in the natural scene still has large development space and faces serious challenges. The prior effective schemes are mainly classified into a first method of performing transfer learning from a face recognition model and performing semi-supervised learning by using unlabeled data, a second method of inhibiting factors such as fuzzy expression, low-quality images and subjectivity of a annotator so that a deep network can learn real expression characteristics, and a third method of realizing intra-class compactness and inter-class separation by combining a designed loss function with cross entropy loss. Disclosure of Invention Aiming at the requirement of facial expression recognition under natural conditions, the invention provides a facial expression recognition method based on cross-layer multi-scale channel mutual attention learning. The main structure of the method is based on cross-layer mutual learning and mainly comprises a Backbone network part, a multi-scale channel attention part, multi-granularity characteristic region generation, a CAM attention region generation part and a multi-step mutual learning part. Each stage is trained using a progressive multi-step strategy and a corresponding network hyper-parameter configuration is set. In the training process, the corresponding loss function is calculated for each stage of output, and gradient back propagation obtained by derivation is updated in a gradient descent parameter mode, so that the model learns the corresponding function. The invention relates to a facial expression recognition method based on cross-layer multi-scale channel mutual attention learning, which comprises the following steps: 1) Acquiring a facial expression image database, and dividing a training set and a testing set; 2) Preprocessing the facial expression image in the step 1), namely detecting, aligning and cutting the facial expression image by using a MTCNN method, and removing redundant background information to obtain an amplified facial expression image; 3) Data enhancement, namely enhancing the data of the facial expression image obtained by preprocessing in the step 2), wherein the enhancement mode comprises random overturning and random cutting; 4) Each setting of the CMSCMAL-Net model CMSCMAL-Net model is