CN-122024768-A - Bimodal emotion recognition method integrating middle layer characteristics and modal priority mechanism

CN122024768ACN 122024768 ACN122024768 ACN 122024768ACN-122024768-A

Abstract

The invention discloses a bimodal emotion recognition method integrating middle layer characteristics and a modal priority mechanism. The method constructs DMFF-Net network model comprising a single-mode feature extraction layer, a dynamic mode fusion layer and a double-branch task layer. The method comprises the steps of firstly extracting deep features of voice and text modes respectively, secondly, in a dynamic mode fusion layer, adaptively estimating and weighting importance of each mode through a dynamic mode priority mechanism, guiding a fusion direction, secondly, based on the guiding, carrying out cross-layer attention fusion on multi-level features through an intermediate layer feature fusion mechanism to generate unified fusion emotion representation, and finally, optimizing a model through joint training of emotion classification branches and self-contrast learning branches. The method overcomes the defects that the prior method ignores middle layer information and modality fusion static stiffness, can fully mine multi-level emotion clues and dynamically adjust modality contribution, and remarkably improves the accuracy, robustness and generalization capability of emotion recognition under complex scenes.

Inventors

WEI WEI
WANG YIBING

Assignees

大连民族大学

Dates

Publication Date: 20260512
Application Date: 20260122

Claims (7)

1. A bimodal emotion recognition method integrating middle layer features and modal priority mechanisms is characterized by comprising the following steps: Preprocessing a bimodal emotion data set, wherein the bimodal emotion data set comprises voice modal data and text modal data, and dividing the preprocessed data into a training set and a testing set; Constructing a bimodal emotion recognition DMFF-Net network model, wherein the DMFF-Net network model comprises a single-mode feature extraction layer, a dynamic mode fusion layer and a double-branch task layer which are sequentially arranged, wherein the single-mode feature extraction layer respectively carries out deep feature extraction on voice mode data and text mode data; training the DMFF-Net network model by using a training set; evaluating and optimizing the trained DMFF-Net network model by using a test set; And inputting the voice data to be recognized and the corresponding text data into the DMFF-Net network model after evaluation and optimization, and outputting a corresponding emotion recognition result.
2. The method for identifying bimodal emotion integrating middle layer features and modal priority mechanisms according to claim 1, wherein the single-mode feature extraction layer comprises a voice feature coding module and a text feature coding module; the voice feature coding module utilizes a pre-trained Wav2Vec2 model to perform feature extraction on input voice modal data, and further models multi-scale time dynamic features in a voice signal through a time sequence convolution encoder to generate high-level semantic representation of a voice modality; The text feature encoding module performs semantic encoding on input text mode data by utilizing a pre-trained RoBERTa model, and introduces a middle layer feature fusion strategy to perform weighted integration on the outputs of a plurality of transducer encoding layers of the RoBERTa model so as to generate semantic representation of the text mode.
3. The method for identifying the bimodal emotion fusing an intermediate layer feature and a modal prioritization mechanism as claimed in claim 2, wherein the dynamic modal fusion layer includes a dynamic modal prioritization mechanism and an intermediate layer feature fusion mechanism; the dynamic mode priority mechanism dynamically estimates importance weights of different modes in a current sample, adaptively weights the voice mode characteristics and the text mode characteristics based on the importance weights, and guides a mode fusion process; The middle layer feature fusion mechanism carries out cross-layer attention fusion on the multi-layer middle features on the basis of modal guidance, and unified fusion emotion representation is generated.
4. The method for identifying bimodal emotion integrating middle layer features and modal priority mechanisms according to claim 3, wherein the dynamic modal priority mechanism is realized through a dynamic modal sensing module, and the dynamic modal sensing module comprises a single-modal context modeling unit, a cross-modal attention interaction unit and a dynamic modal aggregation unit; inputting each single-mode characteristic into a corresponding multi-head self-attention layer and a feedforward network layer respectively, modeling a local context dependency relationship in a mode through a multi-head self-attention mechanism, and carrying out nonlinear characteristic transformation through the feedforward network to obtain the characteristic of the enhanced single-mode context; the characteristics after the single-mode context enhancement are input to a cross-mode attention interaction unit, the cross-mode attention interaction unit adopts a cross-mode attention mechanism to carry out self-adaptive weighting on modes, wherein query vectors are generated by a learnable mode prior vector, and keys and values come from voice mode characteristics and text mode characteristics respectively; And the dynamic mode aggregation unit performs weighted fusion operation on the bimodal characteristics processed by the cross-mode attention interaction unit, dynamically adjusts the contribution proportion of different modes according to the sample characteristics, and generates fusion representation with consistent semantics.
5. The method for identifying the bimodal emotion integrating the middle layer features and the modal priority mechanism according to claim 3 is characterized in that the middle layer feature integration mechanism is realized through a middle layer feature integration module, the middle layer feature integration module performs alignment processing on middle features of different levels to unify feature dimensions and time steps, and a feature splicing or linear mapping mode is adopted to integrate multiple layers of middle features into multi-scale feature representation in a modal.
6. The bimodal emotion recognition method integrating middle layer features and modal priority mechanisms according to claim 1 is characterized in that the dual-branch task layer comprises an emotion classification branch and a self-comparison learning branch, the emotion classification branch carries out full-connection mapping on fused feature representations through a classifier, probability distribution of each emotion category is output through a Softmax activation function, the probability distribution is output as an emotion recognition result, and the self-comparison learning branch enhances discriminant and modal consistency of fused feature representations by restricting consistency of original samples and enhanced samples in feature space.
7. The bimodal emotion recognition method integrating middle layer features with a modal priority mechanism according to claim 6, wherein cross entropy loss functions are adopted for optimization of the emotion classification branch output result, contrast learning loss functions are adopted for optimization of the self-contrast learning branches, and DMFF-Net network model parameters are jointly updated through a gradient descent algorithm.

Description

Bimodal emotion recognition method integrating middle layer characteristics and modal priority mechanism Technical Field The invention belongs to the technical field of computer voice signal processing, relates to emotion intelligent computing technology, and particularly relates to a bimodal emotion recognition method integrating middle layer characteristics and a modal priority mechanism. Background The dual-Mode Emotion Recognition (MER) is an important research direction in the emotion calculation and emotion intelligence fields, and the core aim is to automatically analyze and recognize the emotion state of a speaker by fusing voice, text and other multi-source information. Compared with a single-mode emotion recognition method, the double-mode emotion recognition method can comprehensively utilize acoustic and prosodic information contained in voice and semantic and contextual information contained in text, so that emotion expression characteristics are more comprehensively depicted, and the method has wide application prospects in application scenes such as intelligent customer service, man-machine conversation systems, emotion analysis and behavior understanding. By improving the perception capability of the system to the emotion state, the bimodal emotion recognition technology is beneficial to enhancing the naturalness and the intelligentization level of man-machine interaction. In recent years, with the development of deep learning and pre-training model technology, a bimodal emotion recognition method based on a deep neural network has made remarkable progress. In the prior art, a processing flow of firstly carrying out single-mode feature extraction and then carrying out mode fusion is generally adopted, namely a pre-training voice model and a pre-training language model are respectively utilized to extract voice features and text features, and then cross-mode fusion is realized through feature splicing, weighting or attention mechanisms, so that emotion classification tasks are completed. The method improves emotion recognition performance to a certain extent, but still has obvious defects under a complex dialogue scene. On the one hand, the existing bimodal emotion recognition method mostly only uses the output of the final layer of the pre-training model as fusion input, and ignores a large amount of hierarchical emotion information contained in the model intermediate layer. In fact, different levels of features have different emphasis in emotion expression, for example, the middle layer of a voice model is often more sensitive to low-level acoustic features such as pitch variation and speech speed fluctuation, while the high-level features are more biased to semantic and emotion abstract expression, and the middle layer of a text model also keeps rich syntactic structure and emotion tendency information. Only the final layer features are relied on for fusion, so that key emotion clues are easy to lose, and the expression capacity of bimodal emotion characterization is limited. On the other hand, emotional expressions have obvious context dependency and modal uncertainty, and the importance of different modalities in different samples has dynamic changes. For example, in a scene with a blurred semantic expression but a strong speech tone, the speech mode tends to have more emotion discrimination, while in a scene with a gentle speech expression but a clear text semantic, the text mode plays a dominant role. However, in most existing fusion methods, a static fusion strategy is adopted, so that weights of different modes are difficult to dynamically adjust according to specific characteristics of an input sample, and accuracy and robustness of emotion recognition are easy to influence under the conditions of mode information redundancy, noise interference or mode unbalance. In addition, with the continuous complicating of the structure of the bimodal emotion recognition model, the model is still easy to have the problems of insufficient feature discrimination and reduced generalization capability when facing noise interference, modal deletion or data distribution change in a real dialogue environment. Although some researches attempt to introduce contrast learning or multi-task learning mechanisms to enhance model robustness, how to effectively improve feature discrimination capability in a bimodal fusion process and compromise stability and adaptability of a model is still an important challenge facing the current technology. In summary, the existing dual-mode emotion recognition technology still has limitations in aspects of insufficient mining of middle-layer emotion information, difficulty in dynamic adjustment of modal importance, to-be-improved feature robustness and the like. Therefore, a bimodal emotion recognition technology capable of fully utilizing multi-layer characteristic information and dynamically guiding modal fusion according to different environments is needed to improve ac