CN-122024773-A - Emotion recognition method and device based on audio data and electronic equipment

CN122024773ACN 122024773 ACN122024773 ACN 122024773ACN-122024773-A

Abstract

The invention relates to the technical field of artificial intelligence and provides an emotion recognition method, device and electronic equipment based on audio data, wherein the method comprises the steps of inputting the audio data in a target service scene into an emotion recognition model to obtain an emotion type recognition result and an emotion intensity prediction result; the emotion recognition model is obtained by training based on incremental training data and an emotion label corresponding to the incremental training data and a latest historical version model corresponding to an intensity label, the incremental training data is obtained by performing incremental training based on a previous version of a historical version model of a previous version in the incremental training data, performing emotion intensity prediction on audio service data corresponding to the new version of the target service scene by utilizing a target number of historical version models, and screening and constructing the audio service data according to the predicted emotion intensity. The invention can capture subtle emotion changes in a sharp way and realize accurate recognition of audio emotion.

Inventors

YANG ZHENGBIAO

Assignees

元保科创(北京)科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (10)

1. A method for emotion recognition based on audio data, comprising: Acquiring audio data in a target service scene; The method comprises the steps of inputting audio data into a emotion recognition model to obtain an emotion type recognition result and an emotion intensity prediction result which are output by the emotion recognition model, wherein the emotion recognition model is trained based on incremental training data and emotion labels corresponding to the incremental training data and latest historical version models corresponding to intensity labels, the incremental training data are obtained by training a new target service scene in advance, emotion intensity prediction is conducted on audio service data corresponding to the new target service scene by utilizing a target number of historical version models respectively, the audio service data are screened and constructed according to the predicted emotion intensity, and a historical version model of a later version in the target number of historical version models is obtained by training a previous version of historical version models in an incremental mode.
2. The emotion recognition method based on audio data according to claim 1, characterized by comprising, before inputting the audio data into an emotion recognition model: Acquiring a historical version model of a target number, and acquiring audio service data corresponding to the newly added target service scene, wherein the audio service data comprises a plurality of audio service samples; For each audio service sample, respectively inputting the audio service samples into the historical version models of the target number to obtain the emotion intensity of the corresponding audio service sample output by each historical version model; Determining the difference of the emotion intensities among the models of the corresponding audio service samples according to the emotion intensities of the corresponding audio service samples output by each historical version model, and obtaining the total difference of the emotion among the models of the corresponding audio service samples; Screening the audio service samples according to the total difference of emotion among models of the audio service samples; And according to the screened audio service samples, combining the historical audio training data corresponding to the latest historical version model to obtain incremental training data, and training by utilizing the incremental training data corresponding to the latest historical version model.
3. The emotion recognition method based on audio data according to claim 2, wherein determining the difference in emotion intensity between models of the corresponding audio service samples according to the emotion intensity of the corresponding audio service samples output by each of the historical version models, to obtain the total difference in emotion between models of the corresponding audio service samples, comprises: Determining differences of emotion intensities of the latest historical version models in the historical version models of the target number and emotion intensities output by other historical version models in the historical version models of the target number respectively according to emotion intensities of corresponding audio service samples output by the historical version models, and obtaining emotion intensity differences among the corresponding models; and obtaining the total difference of emotion among the models corresponding to the audio service sample according to the obtained difference of emotion intensity among all the models.
4. The emotion recognition method based on audio data according to claim 2, wherein the screening of the audio service samples based on total differences in emotion among models of the audio service samples comprises: Sorting the total difference of emotion among models of the audio service samples, and screening a preset number of audio service samples according to the total difference of emotion among models from big to small, or And screening the audio service samples with the total emotion differences among the models larger than a preset difference threshold according to the total emotion differences among the models of the audio service samples.
5. The audio data-based emotion recognition method of claim 2, wherein training with the incremental training data corresponding to a latest historical version model comprises: dividing the incremental training data into audio training data and audio test data according to a preset proportion, and training a corresponding latest historical version model by utilizing the audio training data; Inputting the audio test data into a latest historical version model which is subjected to corresponding training to obtain a corresponding model output result, wherein the model output result comprises emotion types and prediction probabilities corresponding to the audio test data; Randomly selecting audio test data covering all emotion categories from the model output result as a test sample, and constructing a sample set; According to the sample set, carrying out unordered pairwise pairing on all test samples of the same emotion type, respectively carrying out emotion intensity comparison on each pair of test sample combinations according to corresponding intensity labels, and respectively carrying out emotion intensity comparison on each pair of test sample combinations according to corresponding prediction probabilities; And when the emotion intensity comparison result based on the intensity label is determined to be consistent with the emotion intensity comparison result based on the prediction probability, determining the prediction probability in the model output result as an emotion intensity value.
6. The emotion recognition method based on audio data according to claim 1, wherein acquiring audio data in a target service scene comprises: Acquiring original audio data in a target service scene; Extracting a Mel spectrogram according to the original audio data; Dividing the Mel spectrogram and flattening each block obtained by dividing to obtain a vector feature sequence; and adding target words for emotion classification at target positions of the vector feature sequences, and performing position coding on the vector feature sequences added with the target words to obtain audio data.
7. The emotion recognition method based on audio data according to claim 1, wherein the emotion recognition model includes: The encoder layer is used for capturing the context information of each audio feature in the input audio data based on a multi-head attention mechanism and combining the corresponding audio features to obtain attention features; the classification layer is used for carrying out emotion probability prediction based on the attention characteristics to obtain emotion category prediction probability distribution results; And the output layer is used for determining the emotion type according to the emotion type prediction probability distribution result, obtaining an emotion type recognition result and determining the prediction probability corresponding to the emotion type as an emotion intensity prediction result.
8. The emotion recognition method based on audio data according to claim 7, wherein the encoder layer comprises at least two layers of stacked encoders, each layer of encoder comprises a multi-headed attention sub-layer, a feedforward neural network and a random inactivation sub-layer, and the encoder layer is further used for extracting features of the audio data by the stacked at least two layers of encoders, and specifically comprises: in each layer of encoder, determining the attention weight among all audio features in the audio data by utilizing a multi-head attention mechanism, and carrying out residual connection by combining the corresponding audio features to obtain corresponding context feature vectors; Performing layer normalization processing on each context feature vector, and performing nonlinear transformation on the context feature vectors subjected to the layer normalization processing by utilizing the feedforward neural network to extract high-dimensional features; Randomly discarding part of neurons in the high-dimensional characteristics through the random inactivation sublayer according to a preset discarding probability by the high-dimensional characteristics to obtain attention characteristics; the classification layer comprises a full connection sub-layer and a Softmax activation function, and is used for: Mapping the attention features to a preset emotion category dimension space by using the fully-connected sub-layer to obtain a corresponding prediction score logits of each emotion category; And normalizing the prediction score logits by using the Softmax activation function, and converting the prediction score logits into a probability value in a preset value range to obtain an emotion category prediction probability distribution result.
9. An emotion recognition device based on audio data, comprising: the data acquisition module acquires audio data in a target service scene; The emotion recognition module inputs the audio data into an emotion recognition model to obtain an emotion type recognition result and an emotion intensity prediction result which are output by the emotion recognition model, wherein the emotion recognition model is trained based on incremental training data and an emotion label corresponding to the incremental training data and an emotion label corresponding to an intensity label corresponding to a latest historical version model, the incremental training data is obtained by carrying out emotion intensity prediction on audio service data corresponding to a new target service scene by utilizing a target number of historical version models respectively, and the audio service data is screened and constructed according to the predicted emotion intensity, and a historical version model of a later version in the target number of historical version models is obtained by carrying out incremental training based on a historical version model of a previous version.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and running on the processor, characterized in that the processor implements the audio data based emotion recognition method of any of claims 1 to 8 when executing the computer program.

Description

Emotion recognition method and device based on audio data and electronic equipment Technical Field The present invention relates to the field of artificial intelligence technologies, and in particular, to a method and an apparatus for emotion recognition based on audio data, and an electronic device. Background Along with the rapid development of artificial intelligence technology, emotion recognition technology is increasingly widely applied in the fields of man-machine interaction, customer service, intelligent medical treatment and the like, and has become a key technology for improving user experience and business intelligence level, particularly emotion recognition based on audio content, and as voice can directly reflect psychological states and emotion changes of speakers, the emotion recognition technology gradually becomes a hotspot for industrial research. At present, the industry is working to improve the accuracy and instantaneity of emotion analysis through deep learning and large model technology so as to adapt to the requirements of complex and changeable business scenes. The current emotion recognition technology mainly adopts three implementation modes, namely a method for extracting emotion labels based on keyword rules, namely converting audio into text through a voice recognition technology, and then matching text content according to a predefined emotion keyword library to judge emotion, wherein the method is based on manual design feature extraction, acoustic features such as fundamental frequency, spectral bandwidth and the like are manually designed according to professional voice knowledge, and input into a machine learning model for training and classifying, and the method is used for directly processing audio data by utilizing a general multi-mode large model supporting audio input and outputting emotion category recognition results by utilizing strong general understanding capability. However, the method for extracting emotion labels based on keyword rules only depends on text matching, but cannot capture implicit emotion such as depression, impatience and the like contained in voice recognition results, so that emotion omission and accuracy are low, a model based on manual design features is difficult to unify in feature definition, high in maintenance cost and difficult to comprehensively cover complicated and changeable emotion expression modes in business scenes, and a general multi-mode large model has audio processing capability, but has insufficient recognition precision in specific business scenes, and the occupied display memory is usually more than 24G during operation, so that the consumption of computing resources is high, the deployment cost is high, and the large-scale business landing is difficult to adapt. In addition, most of the modes only output emotion types and lack quantized results of emotion intensity, personalized requirements of different business scenes on emotion threshold adjustment cannot be met, when business scenes are changed, prediction error rate of old models rises, a large amount of data labeling is needed to be carried out again in model iteration, cost is extremely high, the models are difficult to adapt to business changes rapidly, and core requirements of high precision, low deployment cost, strong scene suitability and low iteration cost cannot be met simultaneously. Disclosure of Invention The invention provides an emotion recognition method and device based on audio data and electronic equipment, which are used for solving the defects that only a single emotion type can be recognized and the degree of emotion intensity is lack to be quantified in the prior art, capturing the fine emotion change of a user more acutely, realizing the accurate recognition of audio emotion and ensuring the provision of more targeted and humanized service. The invention provides an emotion recognition method based on audio data, which comprises the steps of obtaining the audio data in a target service scene, inputting the audio data into an emotion recognition model to obtain an emotion type recognition result and an emotion intensity prediction result which are output by the emotion recognition model, wherein the emotion recognition model is trained based on incremental training data and an emotion label corresponding to the incremental training data and a latest historical version model corresponding to an intensity label, the incremental training data is obtained by training the incremental training data based on a new target service scene in advance, the emotion intensity prediction is carried out on the audio service data corresponding to the new target service scene by utilizing a target number of historical version models respectively, the audio service data are screened and constructed according to the predicted emotion intensity, and the historical version model of a later version in the target number of historical version models is obtained by trai