CN-119150216-B - Multimode emotion recognition method based on intra-modal perception and inter-modal cross fusion of Transformer

CN119150216BCN 119150216 BCN119150216 BCN 119150216BCN-119150216-B

Abstract

The invention discloses a multi-mode emotion recognition method based on intra-mode perception and inter-mode cross fusion of a transducer, which comprises the steps of firstly, coding voice and text to extract depth characteristics, then capturing long-distance dependency relations in each mode based on an intra-mode perception module based on the transducer, realizing local perception learning of emotion characteristics, reducing redundant information in the depth characteristics, secondly, fully utilizing complementarity of different mode information for fusing unaligned multi-mode sequence information, capturing information dependency relations among different modes by an inter-mode interaction fusion module based on the transducer to obtain fused multi-mode global information, and finally, carrying out an ablation experiment, and verifying the effectiveness of the method. The invention realizes the effective parallel calculation of the multi-modal emotion recognition and further improves the recognition performance and generalization capability of the multi-modal emotion recognition system.

Inventors

SUN LINHUI
Su Jiqi
WANG JING
LI PINGAN
YE LEI

Assignees

南京邮电大学

Dates

Publication Date: 20260508
Application Date: 20240716

Claims (7)

1. A multimode emotion recognition method based on tri-modal Mamba interaction and cascade hierarchical fusion is characterized by comprising the following steps: Step 1, shallow feature extraction is respectively carried out on voice, text and video modes which are originally input; step 1-1, each text data is passed through a Bert sub-network to obtain 1024-dimensional speech-level text feature vectors; Step 1-2, extracting features of each piece of voice data by OpenSMILE tool kit through IS13 ComParE configuration to obtain 130-dimensional speaking-level voice feature vectors; step 1-3, each piece of video data is passed through DENSEFACE sub-networks to obtain 342-dimensional speaking-level video feature vectors; Step 2, inputting the extracted shallow layer features into a depth coding network to extract deep layer features; Step 3, inputting the voice features, the text features and the video features after the depth coding into a three-mode Mamba interaction module, and acquiring main mode enhancement features after the complementary enhancement of the first stage through interaction fusion among different mode information; step 4, inputting the main mode enhancement features enhanced in the first stage into a three-mode Mamba interaction module, and acquiring the main mode enhancement features subjected to complementary enhancement in the second stage through interaction fusion among different mode information; Step 5, inputting the main mode enhancement features enhanced in the second stage to a three-mode Mamba interaction module, and acquiring the main mode enhancement features subjected to complementary enhancement in the third stage through interaction fusion among different mode information; step 6, combining the voice features, the text features and the video features after depth coding and the main modal enhancement features after enhancement in the first stage, the second stage and the third stage, obtaining final multi-modal emotion features through a cascading layering fusion mechanism, and then inputting the final multi-modal emotion features into an emotion classifier for emotion prediction; And 7, performing performance evaluation on the provided multimode emotion recognition method based on the three-mode Mamba interaction and the cascade hierarchical fusion.
2. The multimode emotion recognition method based on tri-modal Mamba interaction and cascading hierarchical fusion according to claim 1, wherein the step 2 specifically includes: Step 2-1, constructing a unidirectional LSTM network, wherein the dimension of an input layer is 1024, the dimension of a hidden layer is 128, the network comprises 2 LSTM layer units, the dropout rate is 0.3, inputting 1024-dimensional speech-level text feature vectors into the unidirectional LSTM network to extract high-level features of the text, and obtaining 128-dimensional text deep features ; Step 2-2, constructing a bidirectional LSTM network, wherein the dimension of an input layer is 130, the dimension of a hidden layer is 64, the network comprises 3 LSTM layer units, the dropout rate is 0.3, and then inputting 130-dimensional speaking-level voice feature vectors into the bidirectional LSTM network to extract high-level features of voice, thereby obtaining 128-dimensional voice deep features ; Step 2-3, constructing a unidirectional LSTM network, wherein the dimension of an input layer is 342, the dimension of a hidden layer is 128, the network comprises 3 LSTM layer units, the dropout rate is 0.3, inputting 342-dimensional speaking-level video feature vectors into the unidirectional LSTM network to extract high-level features of video, and obtaining 128-dimensional video deep features 。
3. The method for identifying multi-modal emotion based on tri-modal Mamba interaction and cascading hierarchical fusion according to claim 1, wherein the step 3 specifically includes: Step 3-1, using voice as main mode, text and video as auxiliary mode, inputting the voice feature, text feature and video feature after depth coding into a three-mode Mamba interaction module to obtain the voice enhancement feature after complementary enhancement in the first stage : Wherein, the Representing a multi-layer perceptron of the machine, Represents a linear layer of the material, 、 Representing the magnitude of the gating vector, Representing the long-range dependence characteristic of the speech mode, Representing a long-range dependence characteristic of the video modality, Representing a text modality long-range dependency feature, Representing the video-audio interaction characteristics when the primary modality is speech, Representing the text-to-audio interaction characteristics when the primary modality is speech, The representative element is multiplied by, Representing the current phase of complementary enhancement; step 3-2, inputting the depth coded voice feature, the text feature and the video feature into a three-mode Mamba interaction module by taking the text as a main mode and the video and voice as auxiliary modes to obtain the text enhancement feature after the first-stage complementary enhancement : Wherein, the Representing the audio-text interaction characteristics when the primary modality is text, Representing the video-text interaction characteristics when the main mode is text; step 3-3, using video as main mode, text and voice as auxiliary mode, inputting the depth coded voice feature, text feature and video feature into a three-mode Mamba interaction module to obtain the video enhancement feature after complementary enhancement in the first stage : Wherein, the Representing audio-video interaction characteristics when the primary modality is video, Representing the text-video interaction characteristics when the primary modality is video.
4. The method for identifying multi-modal emotion based on tri-modal Mamba interaction and cascading hierarchical fusion according to claim 1, wherein the step 4 specifically includes: step 4-1, inputting the voice characteristics, the text characteristics and the video characteristics after the complementary enhancement of the first stage into a three-mode Mamba interaction module by taking voice as a main mode and taking text and video as auxiliary modes to obtain the voice enhancement characteristics after the complementary enhancement of the second stage ; Step 4-2, inputting the voice characteristics, the text characteristics and the video characteristics after the complementary enhancement of the first stage into a three-mode Mamba interaction module by taking the text as a main mode and the video and the voice as auxiliary modes to obtain the text enhancement characteristics after the complementary enhancement of the second stage ; Step 4-3, inputting the voice characteristics, the text characteristics and the video characteristics after the complementary enhancement of the first stage into a three-mode Mamba interaction module by taking the video as a main mode and taking the text and the voice as auxiliary modes to obtain the video enhancement characteristics after the complementary enhancement of the second stage 。
5. The method for identifying multi-modal emotion based on tri-modal Mamba interaction and cascading hierarchical fusion according to claim 1, wherein the step 5 specifically includes: Step 5-1, inputting the voice characteristics, the text characteristics and the video characteristics after the complementary enhancement of the second stage into a three-mode Mamba interaction module by taking voice as a main mode and taking text and video as auxiliary modes to obtain the voice enhancement characteristics after the complementary enhancement of the third stage ; Step 5-2, inputting the voice characteristics, the text characteristics and the video characteristics after complementary enhancement in the second stage into a three-mode Mamba interaction module by taking the text as a main mode and the video and the voice as auxiliary modes to obtain the text enhancement characteristics after complementary enhancement in the third stage ; Step 5-3, inputting the voice characteristics, the text characteristics and the video characteristics after complementary enhancement in the second stage into a three-mode Mamba interaction module by taking the video as a main mode and the text and the voice as auxiliary modes to obtain the video enhancement characteristics after complementary enhancement in the third stage 。
6. The method for identifying multi-modal emotion based on tri-modal Mamba interaction and cascading hierarchical fusion according to claim 1, wherein the step 6 specifically includes: step 6-1, depth coding the voice characteristics Text feature And video features Fusion to obtain multi-mode initial characteristics : Step 6-2, aggregating the text enhancement features enhanced in the first, second and third stages together, and obtaining text aggregation features through cascade fusion : Wherein, the Representing the phase of the enhancement, Represents the first Text features after stage enhancement; Step 6-3, aggregating the voice enhancement features enhanced in the first, second and third stages together, and obtaining voice aggregation features through cascade fusion : Wherein, the Represents the first The voice characteristics after stage enhancement; Step 6-4, aggregating the video enhancement features enhanced in the first, second and third stages together, and obtaining video aggregation features through cascade fusion : Wherein, the Represents the first Step 6-5, fusing the text aggregation feature, the voice aggregation feature and the video aggregation feature to obtain the multi-mode aggregation feature : Step 6-6, fusing the multi-mode initial feature and the multi-mode polymerization feature to obtain the final multi-mode joint feature : And then inputting the final multi-mode joint characteristics into an emotion classifier to carry out emotion prediction.
7. The method for identifying multi-modal emotion based on tri-modal Mamba interaction and cascade hierarchical fusion according to claim 1, wherein in step 7, the specific method for evaluating performance of the proposed multi-modal emotion identification method based on tri-modal Mamba interaction and cascade hierarchical fusion comprises the following steps: Step 7-1, performing a comparison experiment on the current mainstream multi-mode emotion recognition method based on three-mode Mamba interaction and cascade layering fusion to verify the performance and efficiency of the method; And 7-2, comparing and analyzing the roles of all modules in the multi-mode emotion recognition method based on the three-mode Mamba interaction and the cascade hierarchical fusion.

Description

Multimode emotion recognition method based on intra-modal perception and inter-modal cross fusion of Transformer Technical Field The invention belongs to the technical field of multi-mode emotion recognition, and particularly relates to a multi-mode emotion recognition method based on intra-mode sensing and inter-mode cross fusion of a transducer. Background Emotion recognition is an important component of the field of artificial intelligence. It is mainly aimed at exploring how to perform deep analysis on input data by means of a series of mathematical processing methods, so that a computer can accurately capture the emotional state of a human being. By constructing such emotion recognition systems, people would be more hopeful to create a natural and unobstructed human-computer interaction environment. Emotion recognition can be generally classified into direct emotion recognition and indirect emotion recognition. Direct emotion recognition mainly involves the use of multi-modal information, including text, speech, images, video, etc. Indirect emotion recognition mainly relies on monitoring physiological responses of humans for implicit emotion recognition, including eye movement signals, limb movement signals, brain electrical signals, electrocardiosignals, and the like. When the emotional state of a human is changed, various information changes are often accompanied. Information of different modes often has high relevance, so that the information comprehensively acts on human emotion recognition. Therefore, the multi-mode emotion recognition research which integrates emotion information of multiple modes to act on emotion calculation has important significance. In the early emotion recognition field, researchers mostly adopt a single-mode emotion recognition technology to realize the recognition of human emotion. Single-modality emotion recognition refers to a process of recognizing and understanding emotion using only a single type of data source. Typically, such a single data source may be speech, text, images, etc. Since emotion is expressed in a variety of ways, humans can perceive the emotion or intent of others by integrating facial expressions, speech, or other information. Therefore, the emotion recognition field based on multi-modal information is receiving more and more attention. In the early stages of multi-modal emotion recognition research, researchers mostly adopt the traditional machine model method to perform feature extraction, such as a Hidden Markov Model (HMM), a Gaussian Mixture Model (GMM), and the like. However, these models can only study limited emotional context information, and cannot fully utilize the characteristics of slow human emotion change and strong dependence on the emotional context information. In recent years, with the rapid development of deep learning algorithms, emotion recognition technology based on deep learning shows brand new vitality. The deep learning technology enables researchers to extract complex modal information and subtle gaps from multi-modal data, thereby facilitating deep understanding of complex emotion expressions. 2021, cao et al have constructed a multi-modal emotion recognition system using a stacked network HNSD for better capture of high emotion distinguishing characteristics. In 2023 Xie et al proposed a multi-modal emotion recognition method based on a multitasking learning and attention mechanism, and obtained emotion recognition rates of 85.36% and 84.61% on the CMU-MOSI and CMU-MOSEI databases, respectively. In 2024, li et al proposed a multi-modal shared network with cross-modal constraints to achieve continuous emotion recognition tasks. The voice and the text are used as important expression forms of emotion information of daily life of human beings, and key judgment information is provided for the multi-mode emotion recognition system. Considering the heterogeneity of voice and text information under practical conditions, the multi-modal emotion recognition technology can effectively acquire key emotion information and complementary information in voice and text data, and can effectively fuse the extracted voice and text information, which affect the performance of the multi-modal emotion recognition system. Disclosure of Invention Aiming at the defects and shortcomings of the prior art, the invention provides a multimode emotion recognition method based on intra-mode sensing and inter-mode cross fusion of a transducer, which captures long-distance dependence in each mode by introducing an intra-mode sensing module based on the transducer, realizes local sensing learning of emotion characteristics, reduces redundant information in depth characteristics, captures information dependence among different modes by introducing an inter-mode cross fusion module based on the transducer, obtains fused multimode global information, and fully utilizes complementarity of information of different modes. The method realizes effective parallel