CN-122024293-A - Facial micro-expression analysis method based on space-time multi-mode large language model

CN122024293ACN 122024293 ACN122024293 ACN 122024293ACN-122024293-A

Abstract

The invention discloses a facial micro-expression analysis method based on a space-time multi-mode large language model, and relates to the technical field of facial micro-expression analysis. A facial micro-expression analysis method based on a space-time multi-mode large language model comprises the following steps of model construction, positioning and recognition, motion feature extraction and operation output. The method integrates the positioning and identifying tasks of the micro-expressions through the unified multi-modal large language model frame, avoids the separation problem of the traditional staged method, improves the capturing capacity of the micro-expressions by utilizing the space-time optical flow characteristics and the multi-modal input, reduces the dependence on accurate detection of peak frames, supports multi-task expansion of visual questions and answers and the like, and has good application expansibility and flexibility.

Inventors

PAN HANG
Li Dangen
SUN WENKAI

Assignees

长治学院

Dates

Publication Date: 20260512
Application Date: 20251119

Claims (1)

1. A facial micro-expression analysis method based on a space-time multi-mode large language model is characterized by comprising the following steps of model construction, positioning and recognition, motion feature extraction and operation output, The method comprises the steps of S1 model construction, wherein a model structure consists of a multi-mode input layer, a space-time coding layer and a large language model interaction layer, the multi-mode input layer fuses visual characteristics and optical flow information, the space-time coding layer utilizes a space-time transducer and a multi-mode fusion module to extract cross-frame local and global emotion changes, and the large language model interaction layer converts the characteristics into language semantic embedding to realize joint modeling of positioning and recognition; s2, positioning and identifying, namely inputting a micro-expression video sequence or a key frame pair to complete the task of positioning and identifying the micro-expression, and in the positioning stage, utilizing optical flow or emotion change of two frames of images in the video sequence as input and outputting the inter-frame difference as an audio-visual model; S3, extracting motion characteristics, namely extracting a micro-expression video sequence or a key frame pair from the output video and audio model, extracting the motion characteristics between two frames through optical flow calculation, and simultaneously carrying out face detection and alignment; And S4, running and outputting, namely running the multi-task output layer by the model based on the extracted motion characteristics of S3, and simultaneously providing positioning task output and recognition task output, wherein the positioning task output refers to inter-frame difference representation, the recognition task output refers to micro-expression emotion type, and the multi-task joint strategy is adopted to share the bottom layer characteristics in the training process.

Description

Facial micro-expression analysis method based on space-time multi-mode large language model Technical Field The invention relates to the technical field of facial micro-expression analysis, in particular to a facial micro-expression analysis method based on a space-time multi-mode large language model. Background With the rapid development of artificial intelligence and man-machine interaction, society is gradually entering a highly intelligent era. The intelligent man-machine interaction not only requires the robot to complete the established task, but also requires emotion cognition, expression and feedback capability similar to human in the interaction process. Psychological studies have shown that more than 50% of the information in human emotional expression is derived from facial expressions. Facial expressions are classified into macro-expressions and micro-expressions. Compared with macro-expressions, micro-expressions have the characteristics of short duration and small muscle movement amplitude, and often reveal the true emotion of an individual. However, the existing micro-expression analysis method generally decomposes tasks into two stages of micro-expression positioning and micro-expression recognition, namely, firstly positioning a segment which possibly contains micro-expressions in a video, and then recognizing key frames in the segment. The staged approach has the limitations that micro-expression muscle movement is highly coupled with individual identity, and movement signals are weak, so that positioning and recognition tasks are separated, and the problem of difficulty in efficiently solving in a unified frame is solved. The existing micro-expression analysis method mainly adopts a staged strategy based on optical flow, characteristic point matching or convolutional neural network to separate micro-expression positioning and recognition tasks. For example, in the baseline schemes of MEGC 2024 and 2025, positioning and identification are independently completed, a unified modeling framework is lacked, and feature extraction is highly dependent on accurate time sequence frame segment labeling and peak frame detection and is easily affected by environmental illumination changes, occlusion and individual differences. Therefore, the method has the remarkable limitations that on one hand, the task separation leads to low information utilization rate and lack of sharing mechanism between positioning and recognition, and on the other hand, the staged method has insufficient robustness due to small micro-expression muscle movement amplitude and is difficult to expand to more complex multi-mode tasks such as visual questions and answers, and the whole recognition accuracy and application flexibility are limited. Disclosure of Invention The invention aims to provide a facial micro-expression analysis method based on a space-time multi-mode large language model, which solves the problem that a unified modeling frame is lacking and is easily influenced by environmental illumination change, shielding and individual difference. In order to achieve the above purpose, the present invention provides the following technical solutions: A facial micro-expression analysis method based on a space-time multi-mode large language model comprises the following steps of model construction, positioning and recognition, motion feature extraction and operation output, The method comprises the steps of S1 model construction, wherein a model structure consists of a multi-mode input layer, a space-time coding layer and a large language model interaction layer, the multi-mode input layer fuses visual characteristics and optical flow information, the space-time coding layer utilizes a space-time transducer and a multi-mode fusion module to extract cross-frame local and global emotion changes, and the large language model interaction layer converts the characteristics into language semantic embedding to realize joint modeling of positioning and recognition; s2, positioning and identifying, namely inputting a micro-expression video sequence or a key frame pair to complete the task of positioning and identifying the micro-expression, and in the positioning stage, utilizing optical flow or emotion change of two frames of images in the video sequence as input and outputting the inter-frame difference as an audio-visual model; S3, extracting motion characteristics, namely extracting a micro-expression video sequence or a key frame pair from the output video and audio model, extracting the motion characteristics between two frames through optical flow calculation, and simultaneously carrying out face detection and alignment; And S4, running and outputting, namely running the multi-task output layer by the model based on the extracted motion characteristics of S3, and simultaneously providing positioning task output and recognition task output, wherein the positioning task output refers to inter-frame difference representation, t