CN-117197719-B - Multi-modal emotion recognition method, device, equipment and computer storage medium

CN117197719BCN 117197719 BCN117197719 BCN 117197719BCN-117197719-B

Abstract

The invention discloses a multi-mode emotion recognition method, a device, equipment and a computer storage medium, which are applied to the technical field of intelligent recognition and specifically comprise the steps of extracting information of different modes with labels from video data with the labels, training a feature fusion model to obtain a trained feature fusion model, extracting information of different modes without labels from the video data without labels, processing the characteristics without labels through the trained feature fusion model, processing the data with labels and selected pseudo-label data through the trained feature fusion model to obtain a final recognition result, and outputting the final recognition result. The information of the labeled data and the unlabeled data is fully utilized, the multi-mode is extracted through the pre-training model, and the characteristics are effectively combined through the characteristic fusion model, so that the accuracy of emotion classification can be improved.

Inventors

PENG XIAOJIANG
Cheng Zebang
LIN YUXIANG

Assignees

深圳技术大学

Dates

Publication Date: 20260508
Application Date: 20230926

Claims (7)

1. A method of multimodal emotion recognition, the method comprising the steps of: Extracting information of different modes with labels from the video data with labels, and processing the information of different modes with labels through a corresponding pre-training model to extract characteristics with labels; Taking the characteristics with the labels as the input of a characteristic fusion model, training the characteristic fusion model to obtain a trained characteristic fusion model; Extracting information of different modes without labels from the video data without labels, and processing the information of different modes without labels through a corresponding pre-training model so as to extract characteristics without labels; Processing the label-free features through a trained feature fusion model to generate pseudo labels, selecting the features of the pseudo labels with higher confidence from the pseudo labels, and adding the features of the pseudo labels with higher confidence and the corresponding pseudo labels into a training set; processing the tagged data and the selected pseudo tag data through the trained feature fusion model to obtain a final recognition result, and outputting the final recognition result; The information of different modes comprises image frames, and the step of extracting the information of different modes with labels from the video data with labels or the step of extracting the information of different modes without labels from the video data without labels comprises the following steps: randomly selecting one frame of image of the video as an input image of the current training round, and carrying out random masking operation and recovery operation of a single image so as to capture the static characteristics of the facial expression in each frame of image; Simultaneously, a plurality of video frames are used as input, and random masking operation and recovery operation are carried out to extract dynamic characteristics of facial expressions in the video; The information of different modes comprises text and image frames, and the step of extracting the information of different modes with labels from the video data with labels comprises the following steps: Extracting the cross characteristics of the text and the image frames through a multi-mode model CLIP; The tagged data and the selected pseudo tag data respectively comprise dynamic characteristics, static characteristics, text characteristics, audio characteristics and cross characteristics of text and image frames, and the step of processing the tagged data and the selected pseudo tag data through a trained characteristic fusion model to obtain a final recognition result comprises the following steps: processing the dynamic characteristics and the static characteristics through a bilinear pooling model to obtain fused visual characteristics; processing the cross characteristics and the text characteristics of the text and the image frames through a bilinear pooling model to obtain fused visual text characteristics; processing the visual features, the visual text features and the audio features through an Attention model to obtain a fused final feature representation; Carrying out emotion classification on the final feature representation through the linear layer to obtain a final recognition result; The bilinear pooling model carries out bilinear projection and decomposition on input features, interacts the features of different modes and generates fused feature representations; the bilinear pooling model captures nonlinear relations among different modes by carrying out product operation on the characteristics of the different modes; The bilinear pooling model maps a high-dimensional feature representation to a low-dimensional space by decomposing the input features.
2. The method for multi-modal emotion recognition as recited in claim 1, wherein in the step of simultaneously taking a plurality of video frames as input and performing a random masking operation and a recovery operation to extract dynamic characteristics of facial expressions in video, the random masking operation is specifically that mask areas between different frames are kept consistent.
3. The multi-modal emotion recognition method as recited in claim 1, wherein said information of different modalities includes audio, The step of extracting the information of the different modes with labels from the video data with labels, and processing the information of the different modes with labels through the corresponding pre-training model to extract the characteristics with labels comprises the following steps: extracting tagged audio from tagged video data, and converting the tagged audio into a tagged mel spectrogram; processing the labeled mel spectrogram through HuBERT model to extract labeled audio features; The step of extracting the information of the different modes without labels from the video data without labels, and processing the information of the different modes without labels through the corresponding pre-training model to extract the characteristics without labels comprises the following steps: Extracting unlabeled audio from unlabeled video data, and converting the unlabeled audio into an unlabeled Mel spectrogram; and processing the untagged Mel spectrogram through HuBERT model to extract untagged audio characteristics.
4. The method for multi-modal emotion recognition as recited in claim 1, wherein said information of different modalities includes text, and said step of extracting information of different modalities with labels from the labeled video data, and processing the information of different modalities with labels through a corresponding pre-training model to extract characteristics with labels includes: extracting tagged text from the tagged video data; processing the tagged text through MacBERT model to extract tagged text features; The step of extracting the information of the different modes without labels from the video data without labels, and processing the information of the different modes without labels through the corresponding pre-training model to extract the characteristics without labels comprises the following steps: extracting unlabeled text from unlabeled video data; and processing the tagged text through MacBERT models to extract tagged text features.
5. A multi-modal emotion recognition device, comprising: the system comprises a tagged feature extraction unit, a processing unit and a processing unit, wherein the tagged feature extraction unit is used for extracting tagged information of different modes from tagged video data, and processing the tagged information of different modes through a corresponding pre-training model so as to extract tagged features; The model training unit is used for training the feature fusion model by taking the tagged features as the input of the feature fusion model to obtain a trained feature fusion model; The system comprises a non-tag feature extraction unit, a pre-training unit and a non-tag feature extraction unit, wherein the non-tag feature extraction unit is used for extracting non-tag information of different modes from non-tag video data, and processing the non-tag information of different modes through a corresponding pre-training model so as to extract non-tag features; A label-free feature selection unit for processing the label-free features through a trained feature fusion model to generate pseudo labels, selecting the features of the pseudo labels with higher confidence from the pseudo labels, adding the features of the pseudo labels with higher confidence and the corresponding pseudo labels into a training set, and The fusion recognition unit is used for processing the tagged data and the selected pseudo tag data through the trained feature fusion model to obtain a final recognition result and outputting the final recognition result; The information of different modes comprises image frames, and the step of extracting the information of different modes with labels from the video data with labels or the step of extracting the information of different modes without labels from the video data without labels comprises the following steps: randomly selecting one frame of image of the video as an input image of the current training round, and carrying out random masking operation and recovery operation of a single image so as to capture the static characteristics of the facial expression in each frame of image; Simultaneously, a plurality of video frames are used as input, and random masking operation and recovery operation are carried out to extract dynamic characteristics of facial expressions in the video; The information of different modes comprises text and image frames, and the information of different modes with labels is extracted from the video data with labels, and the information comprises: Extracting the cross characteristics of the text and the image frames through a multi-mode model CLIP; The tagged data and the selected pseudo tag data respectively comprise dynamic features, static features, text features, audio features and cross features of text and image frames, the tagged data and the selected pseudo tag data are processed through a trained feature fusion model to obtain a final recognition result, and the method comprises the following steps: processing the dynamic characteristics and the static characteristics through a bilinear pooling model to obtain fused visual characteristics; processing the cross characteristics and the text characteristics of the text and the image frames through a bilinear pooling model to obtain fused visual text characteristics; processing the visual features, the visual text features and the audio features through an Attention model to obtain a fused final feature representation; Carrying out emotion classification on the final feature representation through the linear layer to obtain a final recognition result; The bilinear pooling model carries out bilinear projection and decomposition on input features, interacts the features of different modes and generates fused feature representations; the bilinear pooling model captures nonlinear relations among different modes by carrying out product operation on the characteristics of the different modes; The bilinear pooling model maps a high-dimensional feature representation to a low-dimensional space by decomposing the input features.
6. A computer device comprising a memory, a processor, and computer readable instructions stored in the memory and executable on the processor, which when executed implement the multi-modal emotion recognition method of any of claims 1-4.
7. One or more readable storage media storing computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the multi-modal emotion recognition method of any of claims 1-4.

Description

Multi-modal emotion recognition method, device, equipment and computer storage medium Technical Field The invention relates to the technical field of intelligent recognition, in particular to a multi-modal emotion recognition method, device, equipment and computer storage medium. Background At present, emotion recognition technology has been widely studied and applied, and mainly comprises the aspects of text emotion analysis, voice emotion recognition, image emotion analysis and the like. Text emotion analysis refers to analyzing emotion information contained in a text through computer technology, so that emotion states of text authors are identified. Currently, the main techniques of text emotion analysis include emotion dictionary method, machine learning, deep learning and the like. Wherein deep learning is capable of automatically extracting features in text and classifying the text. Since deep learning can take into account information about the context, grammar and semantics of text, the performance in emotion classification is better. However, the deep learning model requires a large amount of computing resources and data sets, the training process is complex, the quality and coverage of the data sets can affect the accuracy of emotion analysis, and for some text with ambiguous context or difficult recognition, the accuracy of emotion analysis can be reduced. Speech emotion recognition refers to recognizing the emotion state of a speaker by analyzing sound features in a speech signal. The emotion state of the speaker is identified by extracting the features of tone, pitch, speech speed, speaking volume and the like in the voice signal, but the features of the voice signal are influenced by factors such as individual differences of the speaker, environmental noise and the like. Researchers can automatically extract characteristics in the voice signals by using a technology of modeling and classifying the voice signals by using a deep neural network, and recognize different emotion states by training a model. Since deep learning can take into account information on many aspects of speech signals, it performs better in emotion recognition. Likewise, limitations in data set quality can affect the performance and generalization ability of such a method model. Image emotion analysis refers to identifying emotion states expressed by an image by analyzing visual features in the image. And extracting the characteristics of colors, textures, shapes, spatial structures and the like in the image, so that the emotion state of the image is identified. The accuracy of the conventional method is low because the visual characteristics of the image are easily affected by factors such as illumination, angle and resolution. The deep neural network is utilized to model and classify the images, the characteristics in the images can be automatically extracted, and different emotion states can be identified through training the models, and the deep learning can take the information of multiple aspects of the images into consideration, so that the image analysis performance is better. In addition, the deep learning-based method can also carry out techniques such as transfer learning and data enhancement, so that the generalization capability and the data efficiency of the model are improved. However, this approach has some limitations, firstly, emotion information tends to be implicit, so that it is likely that the direct inference of emotion from an image is affected by subjective factors and uncertainties, secondly, emotion expression tends to be affected by context and individual differences, which increases the complexity of emotion analysis, and furthermore, deep learning models require a large amount of marked emotion data during training, but acquiring accurate emotion labels is a difficult and time-consuming task. In related academic data and research work, there are papers discussing challenges and limitations of image emotion analysis. For example, a research effort (paper: KARPATHY ET al., 2015) named "Deep Visual-SEMANTICALIGNMENTS FOR GENERATING IMAGE Descriptions" states that accurate acquisition of affective information from images requires in-depth analysis of the semantics and context of the image. Authors state that affective information is often interrelated with the semantics and visual characteristics of images, so local and global semantic information needs to be considered in the affective analysis. For example, a person's facial expression may give his emotional clues, but the person's interpretation of such expression may be altered in what circumstances (e.g., whether they are attending a meeting or a funeral). This requires our model to be able to understand the semantics and context of the image, which is a very challenging task. In the current emotion recognition technology, single-mode emotion recognition has some defects, for example, in voice emotion recognition, the characteristics of a voic