CN-122024770-A - Voice emotion recognition method based on multi-feature fusion

CN122024770ACN 122024770 ACN122024770 ACN 122024770ACN-122024770-A

Abstract

The invention discloses a voice emotion recognition method based on multi-feature fusion, which comprises the steps of obtaining an original voice signal and preprocessing the original voice signal to obtain a voice frame sequence, extracting multi-granularity voice features including acoustic bottom features, emotion special features and pre-training voice representation features from the voice frame sequence in parallel, processing various features to obtain a multi-granularity feature matrix, inputting the feature matrix into a hierarchical attention fusion network to perform hierarchical fusion to generate sentence-level emotion representation vectors, inputting the emotion representation vectors into a classifier, and outputting voice emotion recognition results. According to the invention, a three-layer complementary feature system is constructed, a hierarchical attention mechanism is introduced to realize deep self-adaptive fusion of multi-granularity features, the problems of single feature representation and rough fusion strategy in the traditional method are effectively solved, and the accuracy of speech emotion recognition is remarkably improved.

Inventors

ZHAO ZIYUE
ZHAO GANG
AN BIN
LI MINGYAN

Assignees

北京赛智时代信息技术咨询有限公司

Dates

Publication Date: 20260512
Application Date: 20260302

Claims (9)

1. A speech emotion recognition method based on multi-feature fusion is characterized by comprising the following steps: Step 1, acquiring an original voice signal, and preprocessing the original voice signal to obtain a preprocessed voice frame sequence; step 2, extracting multi-granularity voice features in parallel from the preprocessed voice frame sequence, wherein the multi-granularity voice features comprise acoustic bottom features, emotion special features and pre-training voice representation features; step 3, performing time sequence alignment and normalization processing on the extracted acoustic bottom layer features, emotion special features and pre-training voice representation features to obtain a group of time sequence aligned multi-granularity feature matrixes with uniform scales; step 4, inputting the multi-granularity feature matrix into a hierarchical attention fusion network, and carrying out hierarchical feature fusion through a frame-level attention mechanism and a sentence-level attention mechanism to generate a sentence-level emotion expression vector; and step 5, inputting the sentence-level emotion expression vector into an emotion classifier, and outputting a voice emotion recognition result after classification.
2. The speech emotion recognition method based on multi-feature fusion of claim 1, wherein the preprocessing of the original speech signal in step 1 specifically comprises: pre-emphasis is carried out on the original voice signal by adopting a first-order high-pass filter, and high-frequency attenuation is compensated; Cutting the pre-emphasized continuous voice signal into a short-time frame sequence, wherein the frame length is set to be 20ms to 40ms, and the frame length is 1/2 of the frame length; And applying a Hamming window to each frame of voice signal to obtain a preprocessed voice frame sequence.
3. The voice emotion recognition method based on multi-feature fusion of claim 1, wherein the acoustic bottom features comprise mel-frequency coefficients, mel-frequency spectrograms and zero crossing rates; The mel cepstrum coefficient is extracted through fast Fourier transform, mel scale triangular filter bank, logarithmic energy calculation and discrete cosine transform, and the first 12 to 13-order coefficients and the first-order difference and the second-order difference thereof are taken to jointly form a feature vector; the zero crossing rate is calculated through the number of times that each frame of signal waveform passes through the zero, and the calculation formula is as follows; ; Wherein, the Is zero-crossing rate; N is the number of sampling points contained in a frame; As a sign function.
4. The speech emotion recognition method based on multi-feature fusion of claim 1, wherein the emotion-specific features at least include a fundamental frequency profile, an energy profile, a spectral gradient, and a harmonic-to-noise ratio; The fundamental frequency outline is used for describing the change of pitch along with time; The energy profile is used to describe the change in sound intensity over time; the spectral inclination is used for describing the falling slope of the spectral envelope; The harmonic to noise ratio is used to measure the ratio of periodic to non-periodic components in the sound.
5. The speech emotion recognition method based on multi-feature fusion of claim 1, wherein the extraction process of the pre-trained speech representation features is as follows: Inputting the preprocessed original waveform or frame sequence into a pre-trained self-supervision learning model; Extracting local features through a multi-layer convolutional neural network encoder, and modeling long-term context dependence through a multi-layer transform network; Extracting the hidden state of the middle layer or the last layer of the transducer as a pre-training voice representation characteristic; the parameters of the pre-trained self-supervision learning model are selectively frozen or fine-tuned according to the available data quantity.
6. The speech emotion recognition method based on multi-feature fusion of claim 1, wherein the time sequence alignment and normalization in step 3 specifically comprises: And performing self-adaptive time sequence alignment on the feature sequences with different time resolutions, performing up-sampling on the features with lower time resolution by adopting linear interpolation or cubic spline interpolation, or performing down-sampling on the features with higher time resolution by adopting average pooling so as to unify the lengths of all the feature sequences to be the target lengths, and performing standardization processing on each dimension of each type of feature.
7. The speech emotion recognition method based on multi-feature fusion of claim 1, wherein the hierarchical attention fusion network comprises a frame-level attention mechanism and a sentence-level attention mechanism; the frame-level attention mechanism is used for dynamically weighting contributions of different feature types at each time step to generate a frame-level fusion feature sequence; The sentence-level attention mechanism is used for carrying out weighted aggregation on the importance of each time step in the frame-level fusion characteristic sequence to generate a sentence-level emotion expression vector with fixed dimension.
8. The speech emotion recognition method based on multi-feature fusion of claim 7, wherein the implementation manner of the frame-level attention mechanism is as follows: (1) For time step t, feature vectors from different feature extractors are to be used Splicing or adding to form a comprehensive vector, wherein i represents the characteristic type; (2) Inputting the integrated vector into a small feedforward neural network, outputting a scalar attention score The feedforward neural network is usually one or two full-connection layers and is matched with a Tanh or ReLU activation function; (3) Normalizing the scores of all feature types at the time step t by applying a Softmax function to obtain the attention weight of each feature at the time step ; (4) Calculating the weighted fusion characteristic of the time step: ; Wherein, the Is an optional linear transformation matrix for mapping different features to the same space; (5) Finally obtaining a fusion characteristic sequence at a frame level: 。
9. the speech emotion recognition method based on multi-feature fusion of claim 7, wherein the sentence-level attention mechanism aggregates a fixed-dimension representation of the entire sentence from a frame-level fusion sequence H; 1) Calculating an importance score u t for each frame vector H t in the fusion sequence H, which may be implemented by another feedforward neural network; 2) Carrying out Softmax normalization on the scores of all time steps to obtain sentence-level attention weight ; The whole fusion weighting summation is carried out, and a final sentence-level emotion expression vector v is obtained; 。

Description

Voice emotion recognition method based on multi-feature fusion Technical Field The invention relates to the technical field of voice processing, in particular to a voice emotion recognition method based on multi-feature fusion. Background Currently, speech emotion recognition (Speech Emotion Recognition, SER) is one of the core technologies in the fields of emotion computing and intelligent speech interaction, and its fundamental task is to analyze an input speech signal through a computing model, so as to automatically determine the emotion states currently contained in a speaker, such as happiness, sadness, anger, surprise, fear, neutrality, and the like. In the conventional technology path, the recognition process is severely dependent on a series of artificially designed acoustic features including, but not limited to, mel-frequency cepstrum coefficient (MFCC), linear Prediction Cepstrum Coefficient (LPCC), pitch frequency (F0), short-time energy, spectrum centroid and the like, which are designed according to the statistical rule of the voice signal in the time domain, the frequency domain and the time-frequency domain, and can map acoustic parameter changes caused by emotion changes to a certain extent, for example, the acoustic parameters generally accompany pitch rising and speed increasing in an excited state, and the acoustic parameters may manifest as pitch lowering and speed slowing in a sad state. However, the expression of emotion is multi-dimensional, multi-level, involving complex coupling of various acoustic and linguistic cues, such as prosody, timbre, speech speed, context, etc. Although traditional acoustic features such as mel-frequency cepstrum coefficient (MFCC) can effectively characterize the short-time spectral envelope of speech, prosodic features such as fundamental frequency and energy can describe macroscopic fluctuations of intonation, but a single type of feature can only reflect the projection of emotion in a certain dimension. In the prior art, most of the schemes adopt simple fusion strategies such as early feature splicing or late decision voting, potential and deep hierarchical complementary relations among different types of features cannot be effectively mined and utilized, for example, the contribution weights of prosodic features and spectrum detail features in decision making cannot be dynamically adjusted according to voice content, and the insufficient utilization of the features and the coarseness of the fusion strategies limit the fine discrimination capability of a model on complex emotion states. Disclosure of Invention Therefore, the invention provides a voice emotion recognition method based on multi-feature fusion, which aims to solve the problems in the prior art. In order to achieve the above object, the present invention provides the following technical solutions: a speech emotion recognition method based on multi-feature fusion comprises the following steps: Step 1, acquiring an original voice signal, and preprocessing the original voice signal to obtain a preprocessed voice frame sequence; step 2, extracting multi-granularity voice features in parallel from the preprocessed voice frame sequence, wherein the multi-granularity voice features comprise acoustic bottom features, emotion special features and pre-training voice representation features; step 3, performing time sequence alignment and normalization processing on the extracted acoustic bottom layer features, emotion special features and pre-training voice representation features to obtain a group of time sequence aligned multi-granularity feature matrixes with uniform scales; step 4, inputting the multi-granularity feature matrix into a hierarchical attention fusion network, and carrying out hierarchical feature fusion through a frame-level attention mechanism and a sentence-level attention mechanism to generate a sentence-level emotion expression vector; and step 5, inputting the sentence-level emotion expression vector into an emotion classifier, and outputting a voice emotion recognition result after classification. Further, the step 1 of preprocessing the original voice signal specifically includes: pre-emphasis is carried out on the original voice signal by adopting a first-order high-pass filter, and high-frequency attenuation is compensated; Cutting the pre-emphasized continuous voice signal into a short-time frame sequence, wherein the frame length is set to be 20ms to 40ms, and the frame length is 1/2 of the frame length; And applying a Hamming window to each frame of voice signal to obtain a preprocessed voice frame sequence. Further, the acoustic bottom layer features at least comprise a mel cepstrum coefficient, a mel spectrogram and a zero crossing rate; The mel cepstrum coefficient is extracted through fast Fourier transform, mel scale triangular filter bank, logarithmic energy calculation and discrete cosine transform, and the first 12 to 13-order coefficients and the first-order d