Search

CN-115188383-B - Voice emotion recognition method based on time-frequency attention mechanism

CN115188383BCN 115188383 BCN115188383 BCN 115188383BCN-115188383-B

Abstract

The invention discloses a voice emotion recognition method based on a time-frequency attention mechanism, and belongs to the technical field of voice emotion recognition. Extracting log-Mel spectrogram characteristics, filling the log-Mel spectrogram characteristics to be longer by using a cyclic filling method, inputting the longer log-Mel spectrogram characteristics into a voice emotion recognition model, wherein the voice emotion recognition model comprises a time-frequency convolution module, a time-frequency attention module, a multi-layer convolution network and a full connection layer which are sequentially connected, the time-frequency convolution module captures time domain and frequency domain change information and the time-frequency characteristics, the time-frequency attention module generates a time-frequency weighted characteristic diagram, then the deep emotion characteristics are learned through the multi-layer convolution network, and different voice emotions are classified by using a Softmax classifier. The method has simple steps and can more accurately identify emotion in voice.

Inventors

  • JIN BIN
  • DAI YANYAN
  • GU Yu
  • FANG CONGCONG
  • MA XINGYUAN

Assignees

  • 江苏师范大学
  • 江苏师范大学

Dates

Publication Date
20260421
Application Date
20220713
Priority Date
20220713

Claims (4)

  1. 1. A voice emotion recognition method based on a time-frequency attention mechanism is characterized by comprising the following steps: extracting log-Mel spectrogram characteristics, filling the log-Mel spectrogram characteristics into a long length by using a cyclic filling method, and inputting the long length log-Mel spectrogram characteristics into a voice emotion recognition model, wherein the voice emotion recognition model comprises a time-frequency convolution module, a time-frequency attention module, a multi-layer convolution network and a full connection layer which are sequentially connected; Capturing change information of a time domain and a frequency domain for an input log-Mel spectrogram by a time-frequency convolution module, and simultaneously extracting time-frequency characteristics of the log-Mel spectrogram; step c, respectively carrying out weight learning in the time domain and the frequency domain directions on the time-frequency characteristics by using a time-frequency attention module, calibrating emotion characteristics and generating a time-frequency weighted characteristic diagram; Step d, sending the time-frequency weighted feature map into a multi-layer convolution network to learn deep emotion features; step e, carrying out mean pooling and maximum pooling along the frequency on the deep emotion characteristics along the time dimension to obtain one-dimensional emotion characteristics; Step f, sending the one-dimensional emotion characteristics into a full-connection layer to project to a required dimension, and classifying different voice emotions by using a Softmax classifier; the time-frequency attention module is constructed by the following steps: step c-1, generating time domain attention weight and frequency domain attention weight, namely outputting a time domain feature map through TCNN and FCNN And frequency domain feature map Each feature map has three dimensions, namely a channel C, a frequency H and a time dimension W, and two feature maps are output And The channel information of the feature map is aggregated through an average pooling method and a maximum pooling method to generate two-dimensional maps: 、 Two-dimensional mapping 、 The feature map of (a) represents the average pooling feature and the maximum pooling feature of the entire channel by independent convolution of two different kernel sizes And Learning, and finally generating time domain attention weight through Sigmoid activation function And frequency domain attention weights The formula is as follows: (1), (2), In the formula (1), Representation of The aggregation characteristics after the average pooling; Representation of The aggregation characteristics after the maximum pooling; Indicating that the convolution kernel is of size Wherein t represents the number of frames of the time region; representing a time dimension dependent weight; Activating the function for Sigmoid, and in the same way, in the formula (2), Representation of The aggregation characteristics after the average pooling; Representation of The aggregation characteristics after the maximum pooling; Indicating that the convolution kernel is of size Where f is the number of frequency intervals; representing frequency dimension dependent weights; step c-2 time domain feature map And frequency domain feature map Constructing a time-frequency attention module by residual connection, wherein the time-frequency attention module is a feedforward nerve attention module, deducing attention mapping in two dimensions of a time domain and a frequency domain respectively, then carrying out self-adaptive feature refinement on the attention mapping to time-frequency features, namely a time-frequency attention mechanism TF_ atten, wherein the time-frequency attention mechanism TF_ atten comprises a time-domain attention mechanism T_ atten and a frequency-domain attention mechanism F_ atten, and then generating a time-frequency weighted feature diagram by using the time-frequency attention module: time-frequency attention weighting is utilized 、 ) And corresponding characteristic diagram 、 ) In order to protect the integrity of emotion information in speech while element-wise multiplication, residual connection is used in the process of calculating a time-frequency attention weighted feature map (T, S), and the calculation method of the time-frequency weighted feature map (T, S) is as follows: (3), (4), Wherein, the Representing element multiplication.
  2. 2. The speech emotion recognition method based on time-frequency attention mechanism of claim 1, wherein step a specifically comprises: a-1, generating a log-mel spectrogram finally by sequentially carrying out pre-emphasis, framing, windowing and short-time Fourier transformation on a sound signal; A step a-2, setting an input batch, wherein the time length of the input log-Mel spectrogram is wavtime, the minimum time length of the log-Mel spectrogram in a batch is denoted as min, and the maximum time length of the log-Mel spectrogram in the batch is denoted as max through a max () function; a step a-3, if the maximum time length max of the log-Mel spectrogram of the batch is larger than or equal to the minimum time length min of the log-Mel spectrogram, the maximum time length max of the log-Mel spectrogram of the batch is equal to the maximum time length max, otherwise, the maximum time length max is equal to the minimum time length min; A step a-4 of judging whether the time length wavtime of the input log-Mel spectrogram is equal to max, if true, returning the log-Mel spectrogram characteristics with the time length of max; A step a-5, if false, dividing the maximum time length max of the batch by the time length wavtime of the input spectrogram to obtain the length to be filled, and performing cyclic filling by using a repeat () function; Step a-6, returning to the filling feature, thereby ensuring that if the input log-Mel spectrograms are longer than wavtime, the minimum length of each log-Mel spectrogram is wavtime, which is set according to the maximum length.
  3. 3. The method for speech emotion recognition based on time-frequency attention mechanism of claim 1, wherein constructing a time-frequency convolution module comprises two sets of filters of different shapes to learn time-frequency information, wherein a first set is a time-domain convolution filter TCNN, and time-varying information is obtained along a time dimension of a log-Mel spectrogram The second group is a frequency domain convolution filter FCNN, and frequency information is obtained along the frequency dimension of the log-Mel spectrogram The frequency dimension is F, the time dimension is 1, and tcnn and FCNN together form a time-frequency convolution module TFCNN.
  4. 4. The speech emotion recognition method based on time-frequency attention mechanism of claim 1, wherein the time-frequency weighted feature graphs are spliced together by Concat to obtain a fusion feature The deep emotion feature is learned by using multi-layer convolution, wherein the multi-layer convolution consists of a convolution layer and a pooling layer.

Description

Voice emotion recognition method based on time-frequency attention mechanism Technical Field The invention relates to a voice emotion recognition method based on a time-frequency attention mechanism, and belongs to the technical field of voice emotion recognition. Background The voice emotion recognition has important application value in man-machine interaction. The traditional voice emotion recognition method is to recognize and classify the external emotion based on acoustic features and a machine learning classification model. Wherein the acoustic features are mostly features extracted from the original audio recording, including low-level descriptors (LLDs) and high-level statistical features (HSFs). On this basis, in order to further identify emotion from the extracted acoustic features, emotion classification is often performed using a variety of machine learning classification models including a markov model, a gaussian mixture model, a decision tree, and the like. However, because of the acoustic features, language emotion information cannot be well characterized, namely time dimension related emotion information is often ignored when frequency domain features are selected, and frequency domain related information is often ignored when time domain features are selected. Because the voice emotion information is distributed in the time domain and the frequency domain, and the spectrogram is a time-frequency chart, the frequency domain characteristics of the voice can be reflected while the time sequence information of the voice is maintained, so researchers try to use the spectrogram to replace acoustic characteristics for emotion classification. In the time domain, the emotion is reflected in different time frames, and in the frequency domain, different emotion information is distributed in the high-frequency and low-frequency regions of the speech. For example, emotion voices such as anger show rich acoustic features at high frequencies, and emotion voices such as sadness show rich acoustic features at low frequencies. Therefore, the importance of time frames and frequency intervals in the log-Mel spectrogram to emotion features is different, so further research is required on how to extract significant time-frequency features. Disclosure of Invention Aiming at the defects existing in the prior art, the invention provides a voice emotion recognition method based on a time-frequency attention mechanism, which has simple steps and can more accurately recognize emotion in voice by extracting time-frequency characteristics related to emotion. In order to achieve the technical purpose, the voice emotion recognition method based on the time-frequency attention mechanism comprises the following steps: extracting log-Mel spectrogram characteristics, filling the log-Mel spectrogram characteristics into a long length by using a cyclic filling method, and inputting the long length log-Mel spectrogram characteristics into a voice emotion recognition model, wherein the voice emotion recognition model comprises a time-frequency convolution module, a time-frequency attention module, a multi-layer convolution network and a full connection layer which are sequentially connected; Capturing change information of a time domain and a frequency domain for an input log-Mel spectrogram by a time-frequency convolution module, and simultaneously extracting time-frequency characteristics of the log-Mel spectrogram; step c, respectively carrying out weight learning in the time domain and the frequency domain directions on the time-frequency characteristics by using a time-frequency attention module, calibrating emotion characteristics and generating a time-frequency weighted characteristic diagram; Step d, sending the time-frequency weighted feature map into a multi-layer convolution network to learn deep emotion features; Step e, carrying out mean pooling and maximum pooling along the frequency on the deep emotion characteristics along the time dimension respectively to obtain one-dimensional emotion characteristics; And f, sending the one-dimensional emotion characteristics into a full-connection layer to project to a required dimension, and classifying different voice emotions by using a Softmax classifier. Further, the step a specifically includes: a-1, generating a log-mel spectrogram finally by sequentially carrying out pre-emphasis, framing, windowing and short-time Fourier transformation on a sound signal; A step a-2, setting an input batch, wherein the time length of the input log-Mel spectrogram is wavtime, the minimum time length of the log-Mel spectrogram in a batch is denoted as min, and the maximum time length of the log-Mel spectrogram in the batch is denoted as max through a max () function; A step a-3, if the maximum time length max of the batch is larger than or equal to the minimum time length min of the log-Mel spectrogram, the maximum time length of the batch is max, otherwise, the max is equal to the min