CN-122024340-A - Emotion recognition method, emotion recognition device, emotion recognition apparatus, emotion recognition medium, and emotion recognition program product

CN122024340ACN 122024340 ACN122024340 ACN 122024340ACN-122024340-A

Abstract

The application provides an emotion recognition method, an emotion recognition device, emotion recognition equipment, emotion recognition media and emotion recognition program products, which can be applied to the technical field of artificial intelligence. The method comprises the steps of responding to at least two types of acquired modal data in voice, facial images or physiological data of a target object, calculating quality scores of the modal data, determining fusion weights of the corresponding modal data according to the quality scores of the modal data, positively correlating the fusion weights with the quality scores, extracting basic features of the modal data, inputting the basic features into a corresponding pre-trained feature extraction network, outputting features after strengthening the modes, positively correlating the number of convolution layer channels of the corresponding feature extraction network with the fusion weights, aligning the features after strengthening the modes, and fusing the features by utilizing the fusion weights, and inputting the fused features into a classifier to output emotion recognition results.

Inventors

Pu Haoyang

Assignees

中国建设银行股份有限公司
建信金融科技有限责任公司

Dates

Publication Date: 20260512
Application Date: 20251226

Claims (11)

1. An emotion recognition method, comprising: calculating a quality score of each modal data in response to acquiring at least two modal data of voice, facial image or physiological data of a target object, wherein the quality score reflects the quality of the modal data, and the physiological data comprises data of at least one of heart rate variability, galvanic skin reflection, respiratory rate or body surface temperature; determining a fusion weight corresponding to each modal data according to the quality score of each modal data, wherein the fusion weight is positively correlated with the quality score; extracting basic features of the modal data, inputting the basic features into a corresponding pre-trained feature extraction network, and outputting features after strengthening of each mode, wherein the number of convolution layer channels of the corresponding feature extraction network is positively correlated with the fusion weight; And after the characteristics of the modes after strengthening are aligned, inputting the characteristics into a classifier after fusing by utilizing the fusion weight, and outputting a emotion recognition result.
2. The method of claim 1, wherein determining the fusion weight for each of the modality data based on the quality score for each of the modality data comprises: Summing the quality score of the modal data with a preset equalization coefficient to obtain a modal quality equalization score of the modal; and obtaining the fusion weight corresponding to each mode data according to the ratio of the mode quality balance of the modes to the total score of the mode quality balance of each mode.
3. The method according to claim 1 or 2, wherein determining the fusion weight for each of the modality data based on the quality score of each of the modality data comprises: Acquiring a fusion weight corresponding to the modal data at the previous moment according to a preset time interval in response to acquiring the quality score of the modal data at the current moment; And obtaining the fusion weight of the modal data at the current moment based on an exponential smoothing method according to the quality score of the modal data, the fusion weight of the modal data corresponding to the last moment and the self-adaptive coefficient.
4. A method according to claim 1 or 2, characterized in that the feature extraction network for each modality is trained by: obtaining a total loss function according to the loss function of each mode and the corresponding fusion weight; and training the feature extraction network corresponding to each mode by utilizing the total loss function.
5. The method according to claim 1 or 2, wherein said inputting the base features into the corresponding feature extraction network comprises: Responding to the acquired fusion weight of the modal data, and determining the channel number of the feature extraction network according to the mapping relation between the fusion weight predetermined by the modal and the channel number of the feature extraction network; and according to the channel number, a corresponding feature extraction network is called, and the basic feature is input.
6. The method of claim 1 or 2, wherein aligning the features after the modality enhancement comprises: Inputting the reinforced characteristics of each mode to a pre-trained countermeasure generation network, and outputting semantic characteristics of each mode; And converting the semantic features of each mode into vectors of target dimensions through a lightweight multi-layer perceptron.
7. The method according to claim 1 or 2, further comprising: Generating a corresponding response strategy according to the emotion recognition result; Feeding back to the target object according to the response strategy; Acquiring at least two modal data in voice, facial images or physiological data of the target object after receiving a response strategy, and determining emotion change of the target object; and evaluating the adaptability of the response strategy according to the emotion change.
8. An emotion recognition device comprising: A quality scoring module, configured to calculate a quality score of each of at least two types of modal data in response to acquiring speech, facial images, or physiological data of a target object, where the quality score reflects a quality of the modal data, and the physiological data includes data of at least one of heart rate variability, galvanic skin response, respiratory rate, or body surface temperature; The fusion weight determining module is used for determining the fusion weight of each corresponding modal data according to the quality score of each modal data, and the fusion weight is positively correlated with the quality score; the enhanced feature extraction module is used for extracting basic features of the modal data, inputting the basic features into a corresponding pre-trained feature extraction network and outputting the features after the modal enhancement, wherein the number of convolution layer channels of the corresponding feature extraction network is positively correlated with the fusion weight, and And the emotion recognition module is used for aligning the characteristics of the enhanced modes, inputting the characteristics into the classifier after the characteristics are fused by utilizing the fusion weight, and outputting emotion recognition results.
9. An electronic device, comprising: one or more processors; storage means for storing one or more programs, Wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.
10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method according to any of claims 1-7.
11. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 7.

Description

Emotion recognition method, emotion recognition device, emotion recognition apparatus, emotion recognition medium, and emotion recognition program product Technical Field The present application relates to the field of artificial intelligence, and in particular, to a method, apparatus, device, medium, and program product for emotion recognition. Background With the development of artificial intelligence and man-machine interaction technologies, more and more intelligent devices are capable of understanding and responding to the behavioral demands of users. However, current smart interaction techniques are mostly limited to recognition and response to explicit inputs (e.g., voice commands, touch operations, etc.), and lack the ability to perceive and adapt to the emotional state of the user. This leads to limitations in the user experience, especially in scenarios where highly personalized and emotional responses are required. The emotion state of the user can be better identified by integrating the data of multiple modes, however, in some scenes, the emotion identification by utilizing the data of multiple modes also has the problem of low accuracy and the like. Disclosure of Invention In view of the foregoing, the present application provides emotion recognition methods, apparatuses, devices, media, and program products. According to a first aspect of the application, there is provided an emotion recognition method comprising, in response to obtaining at least two types of modal data in voice, facial image or physiological data of a target object, calculating a quality score of each modal data, the quality score reflecting the quality of the modal data, wherein the physiological data comprises data of at least one of heart rate variability, galvanic skin reflection, respiratory rate or body surface temperature, determining fusion weights corresponding to each modal data according to the quality score of each modal data, positively correlating the fusion weights with the quality score, extracting basic features of each modal data, inputting the basic features into a corresponding pre-trained feature extraction network, outputting the features after each mode is enhanced, wherein the number of convolved layer channels of the corresponding feature extraction network positively correlates with the fusion weights, aligning the features after each mode is enhanced, and fusing the features by utilizing the fusion weights to input a classifier, and outputting emotion recognition results. According to the embodiment of the application, the fusion weight of the corresponding modal data is determined according to the quality scores of the modal data, and the method comprises the steps of summing the quality scores of the modal data with a preset equalization coefficient to obtain the modal quality equalization score of the mode, and obtaining the fusion weight of the corresponding modal data according to the ratio of the modal quality equalization score of the mode to the total modal quality equalization score of the mode. According to the embodiment of the application, the fusion weight of the corresponding modal data is determined according to the quality score of the modal data, and the method comprises the steps of responding to the acquired quality score of the modal data at the current moment, acquiring the fusion weight of the modal data corresponding to the last moment according to a preset time interval, and acquiring the fusion weight of the modal data at the current moment based on an exponential smoothing method according to the quality score of the modal data, the fusion weight of the modal data corresponding to the last moment and the self-adaptive coefficient. According to the embodiment of the application, the feature extraction network corresponding to each mode is obtained through training in a mode of obtaining a total loss function according to the loss function of each mode and the corresponding fusion weight, and training the feature extraction network corresponding to each mode by utilizing the total loss function. According to the embodiment of the application, the basic feature is input into the corresponding feature extraction network, and the method comprises the steps of responding to the acquired fusion weight of the modal data, determining the channel number of the feature extraction network according to the mapping relation between the fusion weight predetermined by the modal and the channel number of the feature extraction network, calling the corresponding feature extraction network according to the channel number, and inputting the basic feature. According to the embodiment of the application, the characteristic alignment after strengthening of each mode comprises the steps of inputting the characteristic after strengthening of each mode into a pre-trained countermeasure generation network, outputting the semantic characteristic of each mode, and converting the semantic characteristi