CN-121999521-A - Dynamic facial expression recognition method and device based on feature decoupling and multi-order track modulation

CN121999521ACN 121999521 ACN121999521 ACN 121999521ACN-121999521-A

Abstract

The invention relates to the technical field of computer vision and emotion calculation, and discloses a dynamic facial expression recognition method and device based on feature decoupling and multi-order track modulation. The method comprises the steps of constructing a double-flow feature decoupling coding network, extracting facial anatomical features by utilizing a static facial structure encoder with freezing parameters, extracting muscle movement features irrelevant to structures by utilizing a dynamic emotion encoder embedded in an adapter, carrying out multi-order dynamics modeling on the dynamic emotion features, jointly calculating first-order speed and second-order acceleration, generating a time sequence emotion evolution track through self-attention mechanism aggregation, generating scaling and translation parameters by utilizing the evolution track, carrying out affine transformation on a static feature sequence through feature linear modulation, realizing lossless injection of dynamic emotion to static characterization, and finally outputting a recognition result through a time sequence classifier. The invention effectively overcomes static identity interference, finely characterizes the speed and intensity of expression evolution, and remarkably improves the cross-identity generalization capability and the dynamic identification precision.

Inventors

MAO KEJI
CHEN PINYI
LING YAN
XU RUIJI
ZHOU ZHIHU

Assignees

浙江工业大学

Dates

Publication Date: 20260508
Application Date: 20260410

Claims (11)

1. The dynamic facial expression recognition method based on feature decoupling and multi-order track modulation is characterized by comprising the following steps of: S1, acquiring a dynamic face video sequence as sample data, preprocessing and dividing the sample data into a training set, a verification set and a test set; S2, constructing a dynamic facial expression recognition network, which is formed by sequentially connecting a double-flow feature decoupling encoding network, a multi-order time sequence dynamics modeling module, a feature linear modulation module and a time sequence classifier in series, wherein a dynamic facial video sequence is input into the double-flow feature decoupling encoding network, a static facial structure encoder with frozen parameters extracts a static structure feature sequence, and a dynamic encoder embedded with a trainable adapter extracts a dynamic emotion feature sequence; S3, performing iterative training on the dynamic facial expression recognition network by adopting a two-stage strategy, wherein the backbone network parameters are kept frozen in the first stage, and the adapter module in the dynamic emotion encoder is finely tuned by utilizing a static expression data set; S4, inputting the test set into a trained dynamic facial expression recognition network to complete dynamic facial expression recognition.
2. The method for dynamic facial expression recognition based on feature decoupling and multi-order trajectory modulation as claimed in claim 1, wherein step S1 comprises: S1.1, collecting dynamic face video sequences containing different motion phases as sample data, and obtaining corresponding facial expression class labels; S1.2, carrying out data preprocessing on an acquired face video sequence, including face detection, face alignment and clipping, and extracting a continuous frame image pair sequence only comprising a face area; s1.3, dividing the preprocessed continuous frame image sequence into a training set, a verification set and a test set according to a set proportion, uniformly adjusting the images to a fixed resolution, and carrying out normalization processing.
3. The dynamic facial expression recognition method based on feature decoupling and multi-order track modulation according to claim 1, wherein the dual-flow feature decoupling encoding network in step S2 comprises a static face structure encoder and a dynamic emotion encoder which are arranged in parallel, the multi-order time sequence dynamics modeling module comprises a differential computing unit, a feature projection unit and a multi-head self-attention aggregation unit, the feature linear modulation module comprises a parameter generating network and a feature modulation computing unit, and the time sequence classifier comprises a time sequence aggregation network and a fully-connected prediction layer.
4. The method for dynamic facial expression recognition based on feature decoupling and multi-order trajectory modulation as claimed in claim 3, wherein step S2 comprises: S2.1, constructing a double-flow characteristic decoupling coding network, and respectively executing static characterization extraction related to an identity structure and dynamic emotion characteristic extraction related to facial muscle movement through parallel branches; S2.2, constructing a multi-order time sequence dynamics modeling module, executing first-order speed and second-order acceleration calculation based on a time dimension, and capturing the correlation of different order features; S2.3, constructing a characteristic linear modulation module, comprising a parameter generation network and a characteristic modulation calculation unit, and executing cross-domain channel-level affine transformation of static characteristics based on dynamic tracks; And S2.4, constructing a time sequence classifier which comprises a time sequence aggregation network and a fully-connected prediction layer, and executing global sequence aggregation and probability prediction on the modulated space-time characteristics.
5. The method for dynamic facial expression recognition based on feature decoupling and multi-order trajectory modulation as claimed in claim 4, wherein step S2.1 specifically comprises: S2.1.1 constructing a static face structure encoder, loading pre-training weights and freezing all network parameters based on a visual transducer backbone network, receiving a video frame input by processing, extracting and outputting a static structure feature sequence reflecting a stable anatomical structure; s2.1.2 constructing a dynamic emotion encoder, based on the same visual transducer backbone network as the static branch, and embedding a trainable adapter module in a transducer block of each layer; S2.1.3 processing the same video frame by using a dynamic emotion encoder, mapping backbone features to a low-dimensional space by an adapter module, introducing nonlinearity, recovering to the original dimension, finally merging the extracted facial dynamic muscle movement features into the original network features by a gating mechanism, and outputting a dynamic emotion feature sequence irrelevant to the structure.
6. The method for dynamic facial expression recognition based on feature decoupling and multi-order trajectory modulation of claim 5, wherein the information processing sequence of the adapter module in step S2.1.3 is specifically: Receiving input features of a current network layer, sequentially extracting low-rank adaptation features through linear layer, layer normalization operation, nonlinear activation function and dimension-increasing projection matrix containing dimension-decreasing projection matrix, multiplying the low-rank adaptation features by a learnable gating scalar initialized to be zero, adding the low-rank adaptation features with the input features element by element, outputting the result to a next layer network, and realizing smooth transition from structural pre-training knowledge to emotion dynamic features through a gating mechanism.
7. The method for dynamic facial expression recognition based on feature decoupling and multi-order trajectory modulation as claimed in claim 4, wherein step S2.2 specifically comprises: s2.2.1 constructing a first-order difference calculation unit, receiving a dynamic emotion feature sequence, and calculating the difference value between the current frame feature and the previous frame feature to obtain a first-order speed feature representing the expression action evolution rate; s2.2.2, constructing a second-order differential calculation unit, and calculating the difference value between the first-order speed characteristic of the current frame and the first-order speed characteristic of the previous frame to obtain a second-order acceleration characteristic representing the intensity of expression action change, wherein the speed characteristic of the first frame is initialized to be zero, and the acceleration characteristics of the first two frames are initialized to be zero; s2.2.3, constructing a multi-head self-attention aggregation unit, projecting the dynamic emotion characteristics, the first-order speed characteristics and the second-order acceleration characteristics of each time step to the same dimension through a learnable linear mapping matrix respectively, splicing along a channel to construct a dynamic characteristic matrix, inputting the dynamic characteristic matrix into a multi-head self-attention layer, calculating time sequence dependency weights among multi-order dynamic characteristics, and fusing to generate a time sequence emotion evolution track.
8. The method for dynamic facial expression recognition based on feature decoupling and multi-order trajectory modulation as claimed in claim 4, wherein step S2.3 specifically comprises: S2.3.1, using a parameter generation network to receive a time sequence emotion evolution track as a condition input variable, and generating a scaling parameter and a translation parameter of a channel level through linear mapping layer calculation; S2.3.2, executing frame-by-frame feature fusion by using a feature modulation calculation unit, carrying out element-by-element Hadamard product operation on the static structure feature sequence and the scaling parameter, and then carrying out element-by-element addition on the calculation result and the translation parameter to generate a composite feature sequence injected with a dynamic evolution mode.
9. The method for dynamic facial expression recognition based on feature decoupling and multi-order trajectory modulation as claimed in claim 4, wherein step S2.4 specifically comprises: S2.4.1, receiving the composite characteristic sequence by using the time sequence aggregation network, wherein the network adopts a time sequence transducer structure, aggregates the context information of the whole video sequence by using a learnable class token vector, and outputs the aggregated class token vector; s2.4.2, receiving the aggregated category token vector by using the fully-connected prediction layer, calculating probability distribution of each expression category, and outputting the category with the maximum probability as the final expression category.
10. The method for dynamic facial expression recognition based on feature decoupling and multi-order trajectory modulation according to claim 1, wherein the two-stage strategy in step S3 is specifically: S3.1, utilizing a static facial expression data set as input, freezing backbone network parameters of a dynamic emotion encoder, only aiming at minimizing cross entropy loss, calculating gradients and updating weights and gating scalars of adapter modules of all layers to obtain a structure-independent expression feature extractor; And S3.2, freezing all parameters of the static face structure encoder and backbone parameters of the dynamic emotion encoder by utilizing a dynamic video training set, and after the network forward propagation generates a prediction result and calculates the overall recognition loss, jointly updating parameters of an adapter module, a multi-order time sequence dynamics modeling module, a characteristic linear modulation module and a time sequence classifier by utilizing a backward propagation algorithm to finish the training of a final network.
11. The dynamic facial expression recognition device based on feature decoupling and multi-order track modulation is provided, and is characterized by comprising a memory and a processor, wherein executable codes are stored in the memory, and the processor is used for realizing the dynamic facial expression recognition method based on the feature decoupling and the multi-order track modulation according to any one of claims 1-10 when executing the executable codes.

Description

Dynamic facial expression recognition method and device based on feature decoupling and multi-order track modulation Technical Field The invention relates to the technical field of computer vision and emotion calculation, in particular to a dynamic facial expression recognition method and device based on feature decoupling and multi-order track modulation, which are particularly used for accurately recognizing a dynamic evolution process of facial emotion in a continuous video sequence in complex open scenes such as man-machine interaction, intelligent driving and mental health analysis. Background With the development of the fields of man-machine interaction, intelligent driving, psychological health analysis and the like, the dynamic facial expression recognition technology can capture the advantages of the complete emotion evolution process, and becomes an important research hotspot in the fields of computer vision and emotion calculation. However, when processing face videos in natural environments, the prior art still has the following significant drawbacks: First, the conflict between the static annotation paradigm and the dynamic emotion evolution results in severe feature coupling. In existing mainstream data sets, the entire video sequence is typically given only one static classification tag representing "peak emotion". However, the true facial expression is a continuous dynamic evolution process that includes initiation, peaking, and regression phases. The mismatching of the static labeling paradigm and the dynamic content forces the existing deep learning model to forcedly intertwine transient facial muscle movement characteristics with static facial structures (such as five sense organ topology and identity characteristics) of specific people. The method leads to the fact that the model is easy to be over-fitted to specific identities, good generalization capability is difficult to be maintained in cross-identity recognition, and key emotion information in transition frames or non-peak frames cannot be effectively extracted. Secondly, the time sequence dynamics information is not utilized enough, and the complete description of the expression evolution track is lacked. To address the above-noted annotation mismatch problem, many existing approaches tend to reject non-peak frames (e.g., transition frames, neutral frames) as "noise" or simply translate them into multi-instance learning problems to find representative frames. In fact, the speed of facial muscle movement (i.e. first order dynamics) and the rate of change of the speed (i.e. second order acceleration) are the underlying physical properties describing the emotional evolution trajectory, implying very rich discrimination information. At present, few methods focusing on motion characteristics often only introduce simple single-order interframe difference for modeling, and the important effect of second-order dynamics characteristics in describing emotion outbreaks or fading intensity is completely ignored. The excessive simplified processing of the transition frame and the neglect of the multi-order dynamics rule lead to the loss of a great deal of emotion narrative information in the prior art, and the essential difference of the small expression and the complex expression in the evolution rate and strength cannot be accurately distinguished. Patent document CN120431614a discloses a method and a system for identifying multi-modal dynamic expression based on a feature adapter. The method is characterized in that a frozen video and audio processing network backbone is respectively constructed based on a mask self-encoder model, and a learnable prompt word embedding module and a modal fusion adapter module are integrated into the backbone so as to improve the detection effect by fusing audio and video characteristics. However, the method has obvious limitations that firstly, the method highly depends on assistance of multi-mode information such as audio and the like, is easy to greatly reduce performance caused by modal deletion or environmental noise interference in practical application (such as silent monitoring or noisy environment), and secondly, although the method introduces a feature adapter, the main purpose of the method is to serve alignment and fusion of multi-mode features, and a set of clear mechanism is not established in the interior of a visual single mode to realize deep decoupling of static identity facial structures and dynamic muscle movement features. When faced with purely visual input, the network remains vulnerable to entanglement of transient expressions with specific persona identity structures, resulting in limited generalization capability across faces. Patent document CN120580728a discloses a cross-domain microexpressive recognition method based on dynamic semantic guidance and adaptive diffusion modeling. The method extracts features through motion descriptors, introduces a diffusion mechanism to perform noi