CN-121415815-B - Cross-granularity weight multiplexing and heterogeneous multitasking voice emotion recognition method

CN121415815BCN 121415815 BCN121415815 BCN 121415815BCN-121415815-B

Abstract

The invention discloses a cross-granularity weight multiplexing and heterogeneous multitasking voice emotion recognition method, which relates to the technical field of voice signal processing and artificial intelligence, and comprises the steps of constructing a nested mutually exclusive data partitioning system, and partitioning a training set into a teacher training set and a target reasoning set; based on the additive linear semantic consistency assumption of emotion characteristics, the sentence-level classification weight of the teacher model is multiplexed to conduct frame-by-frame reasoning on the target reasoning set to generate a frame-level hard pseudo tag, a multi-task student network with a heterogeneous characteristic extractor is constructed, the multi-task student network comprises a sentence-level main task branch and a frame-level auxiliary task branch which are parallel, and the student network is cooperatively trained through a joint loss function by utilizing the sentence-level real tag and the frame-level pseudo tag. According to the invention, on the premise of no need of manual frame-level labeling, the construction and the utilization of fine-granularity emotion monitoring signals are realized, and the modeling capability and the overall recognition performance of the model on local emotion change of the voice are effectively improved.

Inventors

ZHOU HAOJIE
Liang Shunfei
WANG NING
YANG ZHEQI
TIAN PINGAN
PEI FENG
GUO WEI

Assignees

江南大学
江苏磐智数云科技有限公司

Dates

Publication Date: 20260508
Application Date: 20251230

Claims (10)

1. A cross-granularity weight multiplexing and heterogeneous multitasking voice emotion recognition method is characterized by comprising the following steps: S1, constructing a nested mutually exclusive data partitioning system, namely partitioning an original voice training data set with sentence-level emotion labels into a plurality of outer folds, further partitioning the training set of each outer fold into a plurality of mutually exclusive inner subsets, selecting one inner subset as a target reasoning set in each round of inner processing, and taking the union set of the rest inner subsets as a teacher training set; S2, generating a frame-level pseudo tag based on cross-granularity weight multiplexing, namely training a teacher model by using the teacher training set in each round of inner layer processing, wherein the teacher model obtains classification weight by minimizing sentence-level emotion classification loss; S3, constructing a heterogeneous multitask student network and training, namely constructing a student model which comprises a pre-training feature extractor heterogeneous with the teacher model, a sentence-level main task branch and a frame-level auxiliary task branch which are parallel to each other, and performing end-to-end training on the student model by using a sentence-level real tag of an original voice training data set and the frame-level hard pseudo tag sequence through a joint loss function, wherein the joint loss function is formed by weighted summation of sentence-level main task loss and frame-level auxiliary task loss; And S4, model reasoning, namely carrying out voice emotion recognition by using the trained student model, and carrying out sentence-level emotion classification on the input voice by using only sentence-level main task branches.
2. The method for identifying speech emotion through granularity-crossing weight multiplexing and heterogeneous multitasking according to claim 1, wherein in step S1, the nested mutually exclusive data partitioning system specifically comprises: Let the original speech data set be Wherein Is the first The original audio of the individual speech samples, Is the first True sentence-level emotion tags for the individual speech samples, For the total number of samples in the original speech data set, Indexing speech samples, dividing it into A plurality of outer folds; For the first Folded outer training set Further divide it into Mutually exclusive subsets of the inner layers, denoted as ; For the first Inner layer training set for secondary inner layer rotation and inner layer teacher Inner layer target inference set And meet the following Wherein Represent the first A subset of the number of inner layers, Representing the aggregate subtraction operation, Representing an empty set.
3. The method for identifying speech emotion of cross-granularity weight multiplexing and heterogeneous multitasking according to claim 1, wherein in step S2, the generating a frame-level pseudo tag based on cross-granularity weight multiplexing specifically comprises: Based on the assumption that emotion features have additive linear semantic consistency in a vector space, namely sentence-level emotion feature vectors are formed by linearly superposing frame-level emotion feature vectors, so that classification weight hyperplane for classifying sentence-level features is multiplexed to classifying frame-level features, and for voice samples in a target reasoning set, the first is Frame-level hard pseudo tag for a frame The generation formula of (2) is as follows: ; ; ; Wherein, the In order to take the operation of the maximum value index, For the emotion type index, Represent the first The frame belonging to the first An un-normalized score for a class of emotions, For the time frame index to be used, Represent the first The un-normalized score of the frame, As a function of the non-linear activation, And As the weight matrix and the bias vector, For the projection layer parameters, Is the first The input feature vector of the frame is used, The representation is limited to be limited to a set of, Representing a sentence-level loss function, Representing the true sentence-level emotion labels corresponding to the voice samples, Representing a function that converts a vector of logical values into a probability distribution, For the classifier weight to be optimized, the weight is equal to the weight after the optimization is finished , The total number of frames is represented and, Is the first Input feature vectors for frames.
4. The method for speech emotion recognition across granularity weight multiplexing and heterogeneous multitasking of claim 1, characterized in that in step S3, the feature extractor of the heterogeneous multitasking student network Feature extractor for using model with teacher Pre-training models of different architectures, and In the case of a generic acoustic model, Is an emotion special model, i.e 。
5. The method for speech emotion recognition through granularity-crossing weight multiplexing and heterogeneous multitasking according to claim 1, wherein in step S3, the sentence-level main task branch comprises a time domain aggregation module and a classifier, and is obtained by Implementing time domain aggregation in which For a sentence-level global feature vector, Representing the time-domain aggregation operation, A heterogeneous frame-level feature sequence extracted for the student model, Is the first The heterogeneous feature vectors of the frames, For the total number of frames, The frame-level auxiliary task branches into point-to-point classifiers, does not contain any time domain aggregation operation, and directly carries out emotion classification on single frame characteristics.
6. The method for speech emotion recognition across granularity weight multiplexing and heterogeneous multitasking according to claim 1, wherein in step S3, the joint loss function is defined as: ; Wherein, the In order to combine the loss function(s), Is the main task loss of sentence level, For the loss of auxiliary tasks at the frame level, Super parameters for balancing the contribution of two tasks.
7. The method for speech emotion recognition across granularity weight multiplexing and heterogeneous multitasking of claim 6, characterized in that the sentence-level primary task is lost And frame-level auxiliary task loss The specific calculation mode of (2) is as follows: ; ; Wherein, the For the total number of samples in the original speech data set, For the indexing of the speech samples, For the total number of emotion categories to be identified, For the emotion type index, In order to indicate the function, Is the first True sentence-level emotion tags for the individual speech samples, Is the first The samples are predicted as the first in the sentence-level main task branch The probability value of the emotion-like, Is the first The total number of frames corresponding to the individual speech samples, For the time frame index to be used, Is the first Speech sample number 1 The frame-level hard pseudo tag of the frame, Is the first Speech sample number 1 The frame is predicted as the first in the frame-level auxiliary task branch Probability value of emotion.
8. The method for speech emotion recognition through granularity-crossing weight multiplexing and heterogeneous multitasking according to claim 1, wherein after training in step S3 is completed, the frame-level auxiliary task branches and their related parameters are discarded in the model reasoning stage in step S4, and only the feature extractor and the sentence-level main task branches are reserved for speech emotion recognition.
9. A cross-granularity weight multiplexing and heterogeneous multitasking speech emotion recognition system, characterized in that the system is configured to implement the cross-granularity weight multiplexing and heterogeneous multitasking speech emotion recognition method as described in any one of claims 1 to 8, and specifically comprises: The data dividing module is used for constructing a nested mutually exclusive data dividing system, namely dividing an original voice training data set with sentence-level emotion labels into a plurality of outer folds, further dividing the training set of each outer fold into a plurality of mutually exclusive inner layer subsets, selecting one inner layer subset as a target reasoning set and the union sets of the other inner layer subsets as teacher training sets in each round of inner layer processing; The pseudo tag generation module is used for generating frame-level pseudo tags based on cross-granularity weight multiplexing, wherein in each round of inner layer processing, a teacher model is trained by utilizing the teacher training set, and the teacher model obtains classification weights by minimizing sentence-level emotion classification loss; The model training module is used for constructing a heterogeneous multitask student network and training, wherein the model training module is used for constructing a student model and comprises a pre-training feature extractor heterogeneous with the teacher model, a sentence-level main task branch and a frame-level auxiliary task branch, and performing end-to-end training on the student model through a joint loss function by using a sentence-level real tag of an original voice training data set and the frame-level hard pseudo tag sequence, wherein the joint loss function consists of sentence-level main task loss and frame-level auxiliary task loss weighted summation; And the reasoning module is used for model reasoning, namely carrying out voice emotion recognition by using the trained student model, and carrying out sentence-level emotion classification on the input voice by using only sentence-level main task branches.
10. An electronic device comprising a processor, a memory and a bus system, the processor and the memory being connected by the bus system, the memory being configured to store instructions, the processor being configured to execute the instructions stored by the memory to implement the cross-granularity weight multiplexing and heterogeneous multitasking speech emotion recognition method of any one of claims 1 to 8.

Description

Cross-granularity weight multiplexing and heterogeneous multitasking voice emotion recognition method Technical Field The invention relates to the technical field of voice signal processing and artificial intelligence, in particular to a cross-granularity weight multiplexing and heterogeneous multi-task voice emotion recognition method. Background Speech emotion recognition (Speech Emotion Recognition, SER) is a core technology in the fields of man-machine interaction, intelligent customer service, vehicle-mounted speech systems, mental health assistance, etc., aiming at automatically recognizing the emotion state of a speaker by analyzing speech signals. With the rapid development of deep learning technology, the existing SER method generally adopts a deep neural network to model the spectral characteristics, time sequence structures and sounding modes of voice, and takes sentence-level emotion labels as supervision signals to complete emotion classification tasks. The current state of the art is generally based on pre-trained acoustic models (e.g., wav2Vec2, huBERT, etc.), which, in combination with attention mechanisms or end-to-end transducer architecture, achieve better baseline performance across multiple published SER datasets. Although the existing methods improve the accuracy of emotion recognition to a certain extent, the following significant limitations still exist: 1. The monitoring signal granularity is single, most of the existing methods rely on sentence-level coarse granularity emotion labels for training, and variable-length voice features are aggregated into a vector with fixed dimension in a global pooling mode and the like. This approach ignores local emotion changes in the speech signal that evolve over time (e.g., short anger segments interspersed in the overall calm sentence), resulting in models that are difficult to capture fine-grained emotion fluctuations and hierarchical emotion structures, and have limited recognition capabilities in the face of complex speech phenomena such as emotion reversal, intonation mutations, and the like. 2. The pre-training model is not fully utilized, namely, the current transfer learning mainly adopts a 'pre-training plus sentence-level fine tuning' mode in SER, namely, a sentence-level classifier is added to conduct fine tuning on the basis of the pre-training acoustic model. The method can not fully mine the hidden layering representation capability of the pre-training model on different time scales, especially lacks effective multiplexing of potential weights at a frame level or a fragment level, and limits modeling and migration effects of the model on deep emotion cues. 3. The existing multi-task learning method is concentrated on tasks with the same quality or the same granularity (such as emotion classification and speaker recognition or sentence-level emotion and sentence-level semantic tasks) in SER, and lacks a heterogeneous multi-task mechanism capable of effectively fusing supervision signals with different granularity (such as sentence-level and frame-level). Because of the lack of hierarchical alignment and semantic interaction between tasks, the cooperative optimization effect of multi-task learning is not fully exerted, and the distinguishing property of sentence-level emotion discrimination is difficult to enhance through fine granularity supervision. 4. The fine granularity marking data is deficient, and the existing SER data set generally lacks fine granularity emotion supervision due to high cost and strong subjectivity of manual marking frame-level or segment-level emotion labels. Although research attempts are made to alleviate the problem of scarcity of labeling through modes such as pseudo tag generation, the traditional method is difficult to construct a fine granularity supervision system with high quality and generalization due to the problems of tag noise, overfitting or data leakage and the like, so that the fine modeling capability of a model on local emotion characteristics is restricted. In summary, the existing speech emotion recognition technology has obvious defects in emotion supervision granularity, pre-training weight multiplexing strategy, multitasking collaborative mechanism, fine granularity label construction and the like, and restricts performance improvement and generalization capability of the model in complex real scenes. Therefore, a new method for recognizing speech emotion, which can effectively utilize multi-granularity supervision, realize cross-level feature fusion and perform robust learning under the condition of lacking fine granularity labeling, is needed. Disclosure of Invention Therefore, the embodiment of the invention provides a cross-granularity weight multiplexing and heterogeneous multi-task voice emotion recognition method, which is used for solving the problems of weak modeling capacity of a model on voice local emotion change and limited overall recognition performance caused by lack