CN-121999520-A - Micro-expression recognition method and system based on graph convolution and transducer cascade model

CN121999520ACN 121999520 ACN121999520 ACN 121999520ACN-121999520-A

Abstract

The invention relates to a micro-expression recognition method and a system based on a graph convolution and a transducer cascade model, which belong to the technical field of computer vision and image processing, and the invention constructs a depth modeling frame oriented to micro-expression space-time dynamics by combining facial structure priori, deformation amplification features and optical flow motion features, so as to solve the problem that the micro-expression amplitude is weak, the duration is short, and local change is not obvious, so that the micro-expression space-time dynamics is difficult to detect; the method comprises the steps of designing a cascade model with alternately stacked graph-convolution network and Transformer to realize joint characterization on local structure relation and global dynamic dependence of the face, introducing a self-adaptive fusion mechanism of a learner-driven graph structure and multi-source features, improving feature discrimination capability while guaranteeing structure priori rationality, improving accuracy and robustness of micro-expression recognition, and providing technical support for automatic micro-expression recognition in application scenes such as psychological analysis, security monitoring and man-machine interaction.

Inventors

YAO XUNXIANG
XU YINGCHENG
ZHANG QIUYUE
LIU XIAOYAN
XU MINFENG
LIU PEIDE
ZHANG PENG

Assignees

山东财经大学

Dates

Publication Date: 20260508
Application Date: 20260130

Claims (10)

1. A microexpressive recognition method based on a graph convolution and transducer cascade model is characterized by comprising the following steps of Acquiring a start frame and a peak frame of a micro expression sequence, performing face detection, key point positioning, alignment and cutting on the start frame and the peak frame to obtain a standardized face image, and selecting a start frame and peak frame pair based on a preset rule; processing the initial frame and the peak frame pair by utilizing a pre-trained deformation amplification network, and extracting the amplified deformation characteristics of the local area of the face to highlight weak muscle movement; Calculating an optical flow image by adopting a TV-L1 optical flow algorithm based on the initial frame and the peak frame to obtain pixel-level motion information, wherein the pixel-level motion information comprises a motion component of a pixel in a horizontal direction, a motion component of the pixel in a vertical direction and motion amplitude information obtained by calculation; Dividing the area of the standardized face image according to the face key points to construct a static adjacency matrix representing the face topological relation; calculating the similarity between the areas by using the amplified deformation characteristics, and weighting the static adjacent matrix to obtain an initial graph structure with structure prior and data driving characteristics; dividing an optical flow image into a plurality of image blocks, mapping the image blocks into a feature vector sequence, and inputting a visual transducer for global dynamic modeling to obtain global space-time features of cross regions; weighting and fusing the amplified deformation characteristics and the global space-time characteristics to form a multi-mode fusion characteristic representation of the micro-expression; Inputting a multi-mode fusion feature and an initial graph structure into a cascading model formed by alternately stacking a graph rolling network and a Transformer, aggregating local neighborhood structure information through the graph rolling network, capturing long-distance dependency relationship through the Transformer, and carrying out learning update on the graph structure in the training process to obtain microexpressive discriminant features; classifying the microexpressive distinguishing characteristics and outputting corresponding microexpressive category results.
2. The micro-expression recognition method based on the graph convolution and transducer cascade model according to claim 1, wherein a pre-trained deformation amplification network is adopted to perform deformation enhancement processing on a start frame and a peak frame; the deformation amplifying network models the residual structure change of the input initial frame and peak frame pairs, and outputs deformation characteristic vectors of a plurality of local areas, wherein each deformation characteristic vector corresponds to the deformation degree of a face local area; the method comprises the steps of firstly, carrying out feature coding on a starting frame and a peak frame by a deformation amplifying network, respectively extracting intermediate feature representation representing a face geometric structure, then, carrying out difference or residual calculation on the starting frame and the peak frame to obtain residual information reflecting the structural change of a local area of a face, carrying out amplifying treatment on the residual information, and outputting feature mapping containing amplified deformation information by the deformation amplifying network after the amplifying treatment.
3. The method for identifying the microexpressive model based on the graph convolution and the transducer cascade model of claim 1, wherein the method for identifying the microexpressive model is characterized by dividing the standardized face image into regions according to the facial key points, and comprises the following steps: dividing a human face into a plurality of subareas according to a preset facial marker point set, wherein each subarea is used as a node in a graph structure, and constructing a static adjacency matrix through the space adjacent relation and the anatomical correlation between the nodes so as to represent the topological connection relation between the facial areas.
4. The method for identifying the microexpressive model based on the graph convolution and the transducer cascade model according to claim 1, wherein the method for identifying the microexpressive model based on the graph convolution and the transducer cascade model is characterized in that the similarity between the areas is calculated by utilizing the amplified deformation characteristics, the static adjacency matrix is weighted, and an initial graph structure with structure prior and data driving characteristics is obtained, and the method comprises the following steps: for each pair of areas with static connection relations, the areas with the static connection relations refer to face area nodes which are defined as areas with direct connection relations in a static adjacent matrix, corresponding amplified deformation feature vectors are respectively extracted, cosine similarity between the two is calculated, the obtained cosine similarity is used as a weight coefficient to be multiplied with the static adjacent matrix in element level, and an initial weighted adjacent matrix, namely an initial graph structure, is obtained; For the region with the static connection relation, extracting corresponding amplified deformation feature vectors respectively, and calculating cosine similarity, wherein the cosine similarity is as follows: ; Wherein, the Representing a deformation feature similarity between the ith face region and the jth face region; Representing an amplified deformation feature vector corresponding to the ith face region; and the enlarged deformation characteristic vector corresponding to the jth face area is represented.
5. The method for identifying the micro-expression based on the graph convolution and transducer cascade model as set forth in claim 1, wherein the method for identifying the micro-expression based on the graph convolution and transducer cascade model is characterized in that an optical flow image is divided into a plurality of image blocks and mapped into a feature vector sequence, a visual transducer is input for global dynamic modeling, and global space-time features of cross regions are obtained, and the method comprises the following steps: dividing an optical flow image into a plurality of image blocks according to a preset size, flattening and linearly mapping each image block to obtain a feature vector sequence, inputting a visual transducer after superposition of a learner-able position code, and establishing a dependency relationship between different image blocks through a self-attention mechanism to capture cross-region motion association and an overall dynamic mode, wherein the method specifically comprises the following steps: The feature vector sequence is input into a visual transducer, the correlation weight among different image blocks is calculated through a multi-head self-attention mechanism, the feature vector output by the visual transducer not only contains local optical flow information, but also fuses dynamic association information from different facial regions through the iterative processing of a multi-layer self-attention and a feedforward network, and finally, the global space-time feature representation capable of reflecting the integral motion mode of the face and the cooperative change among the regions is formed.
6. The method for identifying the microexpressive model based on the graph convolution and the transducer cascade model according to claim 1, wherein the method for identifying the microexpressive model by weighting and fusing the amplified deformation features and the global space-time features to form the multi-modal fusion feature representation of the microexpressive comprises the following steps: Mapping the amplified deformation characteristics and the global space-time characteristics to the same dimension space respectively, and carrying out linear weighted fusion through a learnable fusion coefficient alpha, wherein the method comprises the following steps of: ; wherein alpha is weight obtained by gradient descent automatic learning in the training process; representing the sum of the weights alpha through fusion And (3) with The characteristic representation obtained after linear weighted fusion, namely the multi-mode fusion characteristic representation; representing deformation characteristics extracted by a deformation amplification network and subjected to region aggregation, wherein the deformation characteristics are used for representing geometric deformation information of each local region of the face between a starting frame and a peak frame; representing global space-time characteristics obtained by global dynamic modeling of the light flow image by the visual transducer, wherein the global space-time characteristics are used for representing cross-region motion association and overall dynamic modes between different regions of the face; the cascade model comprises a plurality of cascade graph Transformer modules, and each graph Transformer module comprises a graph rolling sub-module, namely a graph rolling network, a self-attention sub-module and a residual error connection and normalization unit matched with the graph rolling network; The graph convolution submodule is used for carrying out local neighborhood information aggregation on node characteristics under the constraint of the current graph structure, inputting the node characteristics into a node characteristic matrix and a corresponding adjacent matrix, and carrying out weighted aggregation on adjacent node characteristics with a connection relation with the current node to obtain node characteristic representation fused with neighborhood structure information; the self-attention submodule is used for carrying out global dependency modeling on node characteristics of graph convolution output, calculating correlation weights among different nodes through a self-attention mechanism, and comprehensively considering characteristic information of other nodes in a characteristic updating process of each node so as to establish a cross-region long-distance dependency relationship; the method comprises the steps of aggregating local neighborhood structure information through a graph convolution network, capturing long-distance dependency relations through a transducer, and carrying out learning update on the graph structure in the training process to obtain micro expression discrimination characteristics, wherein the method comprises the following steps: The multi-mode fusion features and the initial graph structure are input into a first graph transform module together, firstly, the graph convolution submodule carries out local neighborhood aggregation on the features of each face region node under the constraint of an adjacent matrix, so that the node features fuse the structure and deformation information of adjacent regions, and intermediate feature representation comprising local topological relations is obtained; Then, the self-attention sub-module carries out global modeling on the intermediate feature representation, and the node features sense the dynamic association relation between the remote areas by calculating the correlation weight between any two area nodes; meanwhile, in the training process, the self-adaptive updating is carried out on the initial graph structure by introducing a learnable adjacency matrix, so that the graph structure dynamically adjusts the connection weight according to the facial movement modes of different samples; After cascade processing of a plurality of graph converters, the finally obtained node characteristics or the convergence result thereof form micro-expression distinguishing characteristics which simultaneously comprise facial local structural relations and cross-region dynamic dependency information.
7. The method for identifying the micro-expression based on the graph convolution and Transformer cascade model according to any one of claims 1 to 6, wherein the graph convolution sub-module performs local neighborhood aggregation on the characteristics of each facial region node under the constraint of an adjacency matrix to enable the node characteristics to fuse the structure and deformation information of adjacent regions so as to obtain an intermediate characteristic representation comprising a local topological relation, and the specific implementation process is as follows: The initial input is that the input of the graph rolling network GCN is the node characteristic of each face area and a corresponding adjacent matrix, and the adjacent matrix defines the connection relation among the nodes in the graph; the aggregation process comprises the steps that a graph convolution submodule aggregates the characteristics of each node through an adjacent matrix, in each layer of graph convolution, the characteristics of the nodes and the characteristics of adjacent nodes are weighted and averaged to obtain a new characteristic representation, and a specific formula is expressed as follows: ; Wherein, the Is the first Node characteristic matrix of the layer; Is a standardized adjacency matrix including structural information between nodes; Is the first A layer's learnable weight matrix; is an activation function; Is the first Node feature matrix of +1 layer; the aggregated features are represented by new features after graph convolution aggregation, and the local topological relation and deformation information of the nodes are fused to generate intermediate feature representation; Introducing the learnable weight matrixes with the same number as the nodes on the basis of the initial weighted adjacent matrix, adjusting the connection strength of each side, and normalizing the updated adjacent matrix through a Sigmoid function; The classification module comprises a full-connection layer and an activation function layer, wherein the nonlinear mapping is carried out on the microexpressive distinguishing characteristics, probability distribution of each target microexpressive category is output, a final microexpressive recognition result is determined based on a preset decision rule, and a cross entropy loss function is adopted in the cascade model training process to optimize network parameters in a supervision mode.
8. A microexpressive recognition system based on a graph convolution and transducer cascade model, comprising: The data acquisition and preprocessing module is configured to acquire a start frame and a peak frame of the micro expression sequence, perform face detection, key point positioning, alignment and clipping on the start frame and the peak frame to obtain a standardized face image, and select a start frame and peak frame pair based on a preset rule; the deformation characteristic extraction module is configured to process the initial frame and the peak frame pair by utilizing a pre-trained deformation amplification network, and extract the amplified deformation characteristics of the local area of the face so as to highlight weak muscle movement; The optical flow characteristic extraction module is configured to calculate an optical flow image by adopting a TV-L1 optical flow algorithm based on a start frame and a peak frame to obtain pixel-level motion information, wherein the pixel-level motion information comprises a motion component of a pixel in a horizontal direction, a motion component of the pixel in a vertical direction and motion amplitude information obtained by calculation; The image structure construction module is configured to divide areas of the standardized face image according to the face key points to construct a static adjacent matrix representing the face topological relation, calculate the similarity between the areas by utilizing the amplified deformation characteristics, and weight the static adjacent matrix to obtain an initial image structure with structure prior and data driving characteristics; The global dynamic modeling module is configured to divide an optical flow image into a plurality of image blocks and map the image blocks into a feature vector sequence, input a visual transducer for global dynamic modeling, and obtain global space-time features of cross regions; The feature fusion module is configured to perform weighted fusion on the amplified deformation features and the global space-time features to form multi-modal fusion feature representation of the micro-expression; The diagram-transducer cascade modeling module is configured to input a multi-mode fusion feature and an initial diagram structure into a cascade model formed by alternately stacking a diagram rolling network and a transducer, aggregate local neighborhood structure information through the diagram rolling network, capture long-distance dependency relationship through the transducer, and perform learning update on the diagram structure in the training process to obtain micro expression discrimination features; And the classification module is configured to output a micro-expression category recognition result according to the micro-expression discrimination characteristics.
9. A computer readable storage medium having stored thereon a computer program for implementing the steps in the method for microexpressive recognition based on a graph convolution and transducer cascade model according to any of claims 1-7 when said computer program is executed by a processor.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor, when executing the computer program, implements the steps of the method for micro-expression recognition based on a graph convolution and a transducer cascade model according to any one of claims 1-7.

Description

Micro-expression recognition method and system based on graph convolution and transducer cascade model Technical Field The invention belongs to the technical field of computer vision and image processing, in particular relates to a micro-expression recognition method and a micro-expression recognition system based on a graph convolution and Transformer cascade model, and particularly relates to deep learning model design and realization for extracting micro-facial action characteristics from a face image sequence and realizing micro-expression automatic recognition. Background Micro-expressions refer to involuntary, fine expression changes of a human being with a duration typically between 0.04 seconds and 0.5 seconds when the emotion is suppressed or rapidly changed. Because the method has short duration, small amplitude and difficult detection, the micro-expression is widely applied to the fields of psychological analysis, security monitoring, medical diagnosis, man-machine interaction and the like, and has important research and application values for accurately identifying the micro-expression. However, automatic recognition of micro-expressions still faces many challenges. First, the micro-expressive motion amplitude is extremely small, usually only appearing as slight changes in local muscles, and it is difficult for conventional texture feature or optical flow feature based methods to adequately capture the subtle differences. Secondly, the microexpressive sample data are scarce, the scale of each public data set is limited, the categories are unbalanced, and the training effect of the deep learning model is severely restricted. Again, there are complex structural relationships and correlations between different regions of the face, relying on only local convolution or global modeling easily ignores topological correlations between key action regions, resulting in insufficient recognition accuracy. In recent years, some micro-expression recognition methods based on optical flow estimation, space-time texture description and Convolutional Neural Network (CNN) are widely studied, but due to the limitation of modeling capability, the method still has the defects in processing weak motion, long-distance dependency and fine-grained feature fusion. With the development of graph neural networks (Graph Neural Network, GNN) and visual transducer technologies, the division of facial regions into nodes and modeling of their spatial topology by graph structures has become a new direction of investigation. The graph rolling network (Graph Convolutional Network, GCN) can characterize the connection structure between local regions, facilitating the associative modeling of micro-muscular actions. Vision transducer relies on self-attention mechanisms to capture global dynamic relationships across regions, which is advantageous for modeling spatiotemporal features of subtle expressions. However, the existing method often only depends on a single model structure, and has the defects of joint modeling of a local structure and global dynamics, the partial method adopts a static diagram structure or a fixed adjacent relation, and lacks self-adaptive expression associated with real facial actions, in addition, the deformation characteristics of the expression and the optical flow movement characteristics often model separately, and an effective multi-source dynamic fusion mechanism is not formed, so that the recognition performance is limited. Therefore, a new model architecture combining the advantages of graph convolution and visual transducer is needed to realize joint modeling of the face local structural relationship, the global dynamic dependency relationship and the fine motion characteristics, so as to improve the accuracy and the robustness of micro-expression recognition. Disclosure of Invention In order to overcome the defects of the prior art, the invention provides a micro-expression recognition method and a micro-expression recognition system based on a graph convolution and transform cascade model, which are used for constructing a depth modeling framework facing micro-expression space-time dynamics by combining facial structure priori, deformation amplification features and optical flow motion features, solving the problem that micro-expression amplitude is weak, duration is short and local variation is not obvious and is difficult to detect, designing a cascade model with a graph convolution network and a transform alternately stacked, realizing joint characterization on facial local structure relationship and global dynamic dependence, introducing a self-adaptive fusion mechanism of a learner-based graph structure and multi-source features, improving feature discrimination capability while ensuring structure priori rationality, thereby improving accuracy and robustness of micro-expression recognition and providing technical support for automatic micro-expression recognition in application scenes such as p