CN-122024771-A - Voice emotion recognition method based on three-view feature decoupling and gating fusion

CN122024771ACN 122024771 ACN122024771 ACN 122024771ACN-122024771-A

Abstract

The invention discloses a voice emotion recognition method based on three-view feature decoupling and gating fusion, which comprises the steps of modeling a spectrogram time-frequency structure by using CNN, performing time sequence dynamic modeling on an MFCC sequence by using Bi-LSTM, performing deep representation learning on an original voice waveform by using HuBERT models, performing adaptive training by using a partial defrosting fine tuning strategy to respectively obtain speech level feature representations of all views, decoupling the speech level feature representations into view independent feature representations and view specific feature representations by using a parameter sharing view independent encoder and a view specific encoder, reducing subspace redundancy by using difference loss constraint, fusing the view independent feature representations to generate global emotion consensus, dynamically adjusting fusion proportion of consensus information and view specific clues based on a self-adaptive gating mechanism to obtain primary fusion features with strong discriminant, and inputting the primary fusion feature splicing into a classifier for emotion classification. The invention obviously improves the accuracy and the robustness of voice emotion recognition.

Inventors

Sang Jinqiu
HUANG BING
LIANG JIAN
JIANG XIANQUAN
XIE XINNI

Assignees

华东师范大学
博音听力技术(上海)有限公司

Dates

Publication Date: 20260512
Application Date: 20260318

Claims (7)

1. A voice emotion recognition method based on three-view feature decoupling and door control fusion is characterized by comprising the following steps: step 1, feature extraction and encoding Modeling a spectrogram time-frequency structure by CNN, performing time sequence dynamic modeling by BiLSTM on an MFCC sequence, performing deep representation learning on an original voice waveform by HuBERT and performing adaptive training by adopting a partial thawing fine tuning strategy to respectively obtain speech level feature representations of three visual angles; Step 2 feature decoupling For the speech-level feature representation of each view, projecting the speech-level feature representation to a view-independent subspace through a view-independent encoder with shared parameters to obtain a view-independent feature representation; step 3, cross-visual angle consensus learning Splicing the visual angle irrelevant characteristic representations of the three visual angles, and carrying out nonlinear fusion through a multi-layer perceptron to generate a global emotion consensus representation; step 4, self-adaptive door control fusion For each view, adaptively calculating gating weight based on the global emotion consensus representation and view specific feature representation of the view, and carrying out adaptive weighted fusion on the global emotion consensus representation and the view specific feature representation according to the gating weight to obtain primary fusion feature representation of the view; Step 5, emotion classification Splicing the primary fusion characteristic representations of the three visual angles, inputting the primary fusion characteristic representations into a full-connection layer classifier, and outputting emotion type prediction results of the voice signals; step 6, model training In the model training stage, the emotion classification task loss and the difference loss are combined and optimized, and the accurate emotion classification prediction result is obtained by adjusting the model parameters.
2. The method of claim 1, wherein the speech-level feature representations of the three perspectives of step 1 are: ; ; Wherein, the , , Respectively a spectrogram, MFCC and original speech waveform, , , Speech-level feature representations of spectrograms, MFCCs, raw speech waveform perspectives, The parameters corresponding to CNN, biLSTM and HuBERT are shown, respectively.
3. The method for recognizing speech emotion according to claim 1, wherein the adaptive training using a partial defrosting fine tuning strategy in step 1 comprises the steps of: inputting the original voice waveform into a pre-trained HuBERT model, and extracting the hidden state of a transducer coding layer; aggregation processing is carried out on the hidden states of the frame level through pooling operation, and the speech level characteristic representation with fixed dimension is obtained ; Performing partial defrost fine tuning by maintaining the parameters of the HuBERT model's convolutional feature extractor and the underlying Transformer coding layer frozen and only thawing the model last The transducer coding layers participate in training, wherein, Is an integer of 1 to 11, and Is positively correlated with the sample complexity of the target dataset and the number of speakers.
4. The speech emotion recognition method according to claim 1, wherein the visual angle independent feature representation in step 2 is specifically: The visual angle specific characteristic is expressed specifically as follows: Wherein, the A view-independent feature representation is represented, A view-specific feature representation is represented, A view independent encoder representing a three view parameter sharing, Representation of The corresponding parameter(s) is (are), A view-specific encoder is shown, Representation of The corresponding parameter(s) is (are), For the speech-level feature representation of the perspective, 。
5. The speech emotion recognition method according to claim 1, wherein the global emotion consensus representation in step 3 is specifically: Wherein, the , , View independent feature representations representing a spectrogram view, MFCC view and original speech waveform view respectively, Representing a multi-layer perceptron for generating a global emotion consensus representation, Representation of Corresponding parameters.
6. The method for recognizing speech emotion according to claim 1, wherein the adaptively calculating gating weight in step 4 is specifically as follows: Wherein the method comprises the steps of In order to gate the mapping function, The function is activated for Sigmoid, Corresponding parameters are gated; the primary fusion characteristic representation of the visual angle is specifically as follows: Wherein, the For the element-by-element multiplication, For a global emotion consensus representation, For the view-specific feature representation, The gate weight is the dependence degree of the view on the consensus information and the view specific supplemental information; The gating weight The method meets the condition that the value range of each element is (0, 1), and for all samples of the same view angle, the average value of the gating weight reflects the dependence degree of the view angle on global consensus information.
7. The method of claim 1, wherein the difference loss in step 6 is denoted as The specific calculation formula is as follows: Wherein the first term sum is for a single view angle The second summation is for the view angle pair , And Respectively represents a visual angle irrelevant characteristic representation matrix and a visual angle specific characteristic representation matrix after zero-mean and L2 normalization processing, Indicating the Frobenius norm.

Description

Voice emotion recognition method based on three-view feature decoupling and gating fusion Technical Field The invention relates to the technical field of voice signal processing and emotion calculation, in particular to a voice emotion recognition method based on three-view feature decoupling and gating fusion Background The voice emotion recognition aims at automatically recognizing the emotion state of a speaker by analyzing acoustic features in a voice signal, and has important application value in the fields of man-machine interaction, mental health monitoring and the like. The emotion information in the voice is transmitted through multi-dimensional acoustic clues such as rhythm, frequency spectrum, time sequence and the like, is distributed in different characterization layers, and how to effectively model the heterogeneous characteristics is a core challenge in the field of voice emotion recognition. Traditional speech emotion recognition methods rely mainly on manually designed acoustic features, such as MFCCs and spectrograms, for classification in combination with machine learning algorithms such as Support Vector Machines (SVMs), hidden Markov Models (HMMs), and the like. With the development of deep learning, CNN and RNN are used for end-to-end emotion representation learning, so that dependence on manual characteristics is relieved. Subsequent studies further enhanced context modeling by graph structure and timing strategies, such as TLGCNN and SKIPGCNGAT, capture long-range dependencies with graph structures, TIM-Net models multi-scale timing dynamics to obtain richer context representations. In recent years, self-supervision pre-training models such as wav2vec2, huBERT and WavLM provide stronger semantic abstraction capability for downstream emotion recognition by learning contextual phonetic representations through large-scale unlabeled corpora. However, most of the above methods operate based on a single acoustic representation, relying on implicit modeling within the view, resulting in underutilization of structured cross-view complementarity. To overcome the limitation of single view modeling, some researches begin to explore multi-view fusion strategies, for example AMSNet integrates frame-level manual features and speech-level depth features through multi-scale attention, and a CA-MSER introduces a co-attention mechanism to perform hierarchical fusion of manual features and self-supervision features, and a SMW_CAT adopts a cross-attention transducer architecture to perform progressive multi-view fusion. Although these approaches achieve some performance improvement over standard data sets, they rely primarily on attention mechanisms for implicit feature alignment, easily resulting in redundant cumulative and unstable collaborative relationships, and dominant representations may suppress complementary information. Explicit decoupling sharing and private information have been demonstrated to enhance generalization in the multi-modal learning domain, but such structured decoupling has not been fully explored in the speech emotion recognition domain. Given the diverse statistical nature and semantic granularity of different acoustic representations, direct fusion without explicit decoupling may not take full advantage of cross-view complementarity. In summary, the prior art has the following technical defects that the focus points of attention in emotion modeling of different acoustic features are different, the direct fusion can cause repeated calculation of common information and mutual interference of personalized information, the explicit distinction between common emotion semantics and special expression clues of visual angles in multi-visual angle features is lacking, and the complementarity between visual angles is difficult to fully utilize. Therefore, a technical scheme for speech emotion recognition capable of explicitly decoupling commonalities and personalized information in multi-view features and adaptively fusing global consensus and view-specific clues is needed. Disclosure of Invention The invention aims to provide a voice emotion recognition method based on three-view feature decoupling and gating fusion, which is characterized in that three-view end-to-end feature extraction and suitability learning are carried out on original voice input through a differential coding network, view-independent emotion semantics and view-specific acoustic characteristics in different views are characterized through an explicit decoupling mechanism, and a cross-view consensus learning and self-adaption gating fusion mechanism is introduced to obtain more robust and more discriminative emotion representation. In order to achieve the above purpose, the present invention provides the following technical solutions: A voice emotion recognition method based on three-view feature decoupling and door control fusion comprises the following steps: step 1, feature extraction and encoding Modeling a sp