CN-119626262-B - Speech emotion recognition method, device, equipment and medium

CN119626262BCN 119626262 BCN119626262 BCN 119626262BCN-119626262-B

Abstract

The application provides a voice emotion recognition method, a device, equipment and a medium, which belong to the technical field of artificial intelligence; the method comprises the steps of converting a voice signal into time sequence data, extracting acoustic features of multiple dimensions from the voice signal, wherein the acoustic features are used for reflecting physical properties of the voice signal, extracting features of the time sequence data and the acoustic features of the multiple dimensions through a pre-trained convolutional neural network CNN model to obtain target features, and determining an emotion type recognition result according to the target features through the pre-trained convolutional neural network CNN model.

Inventors

Cai Baoyan
SU ZEYANG
LIU YANG
LUO YIHANG

Assignees

中电信人工智能科技(北京)有限公司

Dates

Publication Date: 20260512
Application Date: 20241114

Claims (9)

1. A method for speech emotion recognition, the method comprising: acquiring a voice signal from a user; Converting the voice signal into time sequence data, and extracting acoustic features of multiple dimensions from the voice signal, wherein the acoustic features are used for reflecting physical properties of the voice signal; performing feature extraction on the time sequence data and the acoustic features of the multiple dimensions through a pre-trained convolutional neural network CNN model to obtain target features; Determining an emotion type recognition result according to the target characteristics through a pre-trained cyclic neural network (RNN) model; Acquiring a plurality of voice signal samples; For each speech signal sample, analyzing the language expression of the speech signal sample, and determining a first matching condition between the language expression and each preset emotion type, wherein the first matching condition is the matching degree between the language expression and each preset emotion type, and each preset emotion type comprises happiness, sadness, anger, surprise, fear, calm, aversion, love, anxiety, embarrassment, confusion, contempt, craving, comfort and expectation; For each voice signal sample, analyzing the voice attribute of the voice signal sample, and determining a second matching condition between the voice attribute and each preset emotion type, wherein the second matching condition is the matching degree between the voice attribute and each preset emotion type, and the voice attribute comprises a speech and a speech speed; Determining emotion type labels corresponding to the voice signal samples from the preset emotion types according to the first matching condition and the second matching condition corresponding to the voice signal samples; Training the CNN model and the RNN model by using the voice signal samples and emotion type labels corresponding to the voice signal samples.
2. The method of claim 1, wherein analyzing the linguistic representation of the speech signal sample to determine a first match between the linguistic representation and each of the predetermined emotion types comprises: And determining a first matching condition between the language expression and each preset emotion type according to the occurrence times of the vocabulary respectively associated with each preset emotion type in the language expression.
3. The method of claim 1, wherein analyzing the sound attribute of the speech signal sample to determine a second match between the sound attribute and the respective preset emotion type comprises: and matching the attribute characteristics of the sound attribute associated with each preset emotion type with the attribute characteristics of the sound attribute of the voice signal sample to obtain a second matching condition between the sound attribute and each preset emotion type, wherein the sound attribute comprises a voice and a voice speed.
4. The method of claim 1, wherein the CNN model comprises: an input layer for receiving a feature map composed of the time series data and the acoustic features of the plurality of dimensions; The convolution unit is used for extracting the layer-by-layer characteristics from the shallow characteristics to the deep characteristics of the characteristic map; One or more full connection layers for processing the features extracted by the convolution unit to obtain the target features; and the output layer is used for outputting the target characteristics.
5. The method of claim 4, wherein the convolution unit comprises a plurality of convolution layers connected by an activation function layer and a pooling layer.
6. The method of any of claims 1-5, wherein converting the speech signal into time series data and extracting acoustic features of multiple dimensions from the speech signal comprises: preprocessing the voice signal, and converting the preprocessed voice signal into time sequence data, wherein the preprocessing comprises denoising, audio format standardization and unification of sampling rate; and extracting acoustic features with multiple dimensions from the preprocessed voice signal, wherein the acoustic features with multiple dimensions comprise a frequency spectrum feature, a Mel frequency cepstrum coefficient and a spectrogram.
7. A speech emotion recognition device, the device comprising: The signal acquisition module is used for acquiring a voice signal from a user; The signal processing module is used for converting the voice signal into time sequence data and extracting acoustic characteristics of multiple dimensions from the voice signal, wherein the acoustic characteristics are used for reflecting physical properties of the voice signal; The feature extraction module is used for extracting features of the time sequence data and the acoustic features of the multiple dimensions through a pre-trained convolutional neural network CNN model to obtain target features; The emotion recognition module is used for determining a recognition result of emotion types according to the target characteristics through a pre-trained Recurrent Neural Network (RNN) model; The model training module is used for executing the following steps: Acquiring a plurality of voice signal samples; For each speech signal sample, analyzing the language expression of the speech signal sample, and determining a first matching condition between the language expression and each preset emotion type, wherein the first matching condition is the matching degree between the language expression and each preset emotion type, and each preset emotion type comprises happiness, sadness, anger, surprise, fear, calm, aversion, love, anxiety, embarrassment, confusion, contempt, craving, comfort and expectation; For each voice signal sample, analyzing the voice attribute of the voice signal sample, and determining a second matching condition between the voice attribute and each preset emotion type, wherein the second matching condition is the matching degree between the voice attribute and each preset emotion type, and the voice attribute comprises a speech and a speech speed; Determining emotion type labels corresponding to the voice signal samples from the preset emotion types according to the first matching condition and the second matching condition corresponding to the voice signal samples; Training the CNN model and the RNN model by using the voice signal samples and emotion type labels corresponding to the voice signal samples.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to implement the speech emotion recognition method of any of claims 1-6.
9. A computer readable storage medium having stored thereon a computer program/instruction which when executed by a processor implements the speech emotion recognition method of any of claims 1 to 6.

Description

Speech emotion recognition method, device, equipment and medium Technical Field The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a medium for speech emotion recognition. Background With the rapid development of artificial intelligence technology, speech emotion recognition has shown great application potential in a plurality of fields such as man-machine interaction, intelligent customer service, psychological consultation and the like. Taking intelligent customer service as an example, the intelligent customer service system is facilitated to provide more careful and personalized service by identifying the emotion state of the user, so that the user experience is improved. However, the conventional speech emotion recognition method mainly relies on manually designed feature extraction (i.e., manually designed features) and classification algorithms, and has the problems of incomplete feature extraction and low recognition accuracy. Disclosure of Invention In view of the foregoing, embodiments of the present application provide a method, apparatus, device, and medium for speech emotion recognition to overcome or at least partially solve the foregoing problems. In a first aspect of an embodiment of the present application, a method for identifying speech emotion is provided, where the method includes: acquiring a voice signal from a user; Converting the voice signal into time sequence data, and extracting acoustic features of multiple dimensions from the voice signal, wherein the acoustic features are used for reflecting physical properties of the voice signal; performing feature extraction on the time sequence data and the acoustic features of the multiple dimensions through a pre-trained convolutional neural network CNN model to obtain target features; And determining an emotion type recognition result according to the target characteristics through a pre-trained cyclic neural network (RNN) model. As a possible implementation manner, the method further comprises: Acquiring a plurality of voice signal samples; For each of the speech signal samples, analyzing the linguistic expressions of the speech signal samples to determine a first match between the linguistic expressions and respective preset emotion types, the respective preset emotion types including happy, sad, anger, surprise, fear, calm, aversion, love, anxiety, embarrassment, confusion, contempt, craving, soothing, and expectations; for each voice signal sample, analyzing the sound attribute of the voice signal sample, and determining a second matching condition between the sound attribute and each preset emotion type; Determining emotion type labels corresponding to the voice signal samples from the preset emotion types according to the first matching condition and the second matching condition corresponding to the voice signal samples; Training the CNN model and the RNN model by using the voice signal samples and emotion type labels corresponding to the voice signal samples. As a possible implementation manner, the analyzing the language expression of the voice signal sample to determine a first matching situation between the language expression and each preset emotion type includes: And determining a first matching condition between the language expression and each preset emotion type according to the occurrence times of the vocabulary respectively associated with each preset emotion type in the language expression. As a possible implementation manner, the analyzing the sound attribute of the voice signal sample, determining the second matching situation between the sound attribute and each preset emotion type, includes: and matching the attribute characteristics of the sound attribute associated with each preset emotion type with the attribute characteristics of the sound attribute of the voice signal sample to obtain a second matching condition between the sound attribute and each preset emotion type, wherein the sound attribute comprises a voice and a voice speed. As a possible implementation manner, the CNN model includes: an input layer for receiving a feature map composed of the time series data and the acoustic features of the plurality of dimensions; The convolution unit is used for extracting the layer-by-layer characteristics from the shallow characteristics to the deep characteristics of the characteristic map; One or more full connection layers for processing the features extracted by the convolution unit to obtain the target features; and the output layer is used for outputting the target characteristics. As a possible implementation, the convolution unit comprises a plurality of convolution layers connected by an activation function layer and a pooling layer. As a possible implementation manner, the converting the voice signal into time series data and extracting acoustic features of multiple dimensions from the voice signal includes: p