CN-122024769-A - Emotion recognition and response method and system based on AI intelligent customer service

CN122024769ACN 122024769 ACN122024769 ACN 122024769ACN-122024769-A

Abstract

The application relates to an emotion recognition and response method based on AI intelligent customer service, which is characterized by comprising the steps of collecting a current session of a user, preprocessing the current session to obtain text data and voice data, coding the text data and the voice data through a pre-trained hierarchical multi-mode transducer model, dynamically distributing weights of a text coding result and a voice feature coding result according to a theme and historical emotion intensity of the current session by using a context-aware gating attention mechanism, carrying out multi-mode weighted fusion to generate an emotion state result, and outputting target response content and response actions through a personalized response strategy network based on the emotion state result, a user portrait vector and the current session state. The method improves emotion perception flexibility and robustness in complex interaction scenes, and ensures real-time response capability in high concurrency scenes.

Inventors

Jin Linglin
YU FENG
Tao Dahao

Assignees

当趣网络科技(杭州)有限公司

Dates

Publication Date: 20260512
Application Date: 20260127

Claims (10)

1. An emotion recognition and response method based on AI intelligent customer service is characterized by comprising the following steps: Collecting a current session of a user, and preprocessing the current session to obtain text data and voice data; Coding the text data and the voice data through a pre-trained hierarchical multi-mode transducer model, dynamically distributing weights of a text coding result and a voice feature coding result according to the theme and the historical emotion intensity of the current session by using a context awareness gating attention mechanism, and carrying out multi-mode weighted fusion to generate an emotion state result; And outputting target response contents and response actions through a personalized response strategy network based on the emotion state result, the user portrait vector and the current session state.
2. The method of claim 1, wherein encoding the text data and the speech data by a pre-trained hierarchical multi-modal transducer model comprises: extracting upper and lower Wen Yuyi vectors of text data in the current session by using a pre-training language model at a single-mode feature coding layer, splicing the upper and lower Wen Yuyi vectors with text shallow features, and outputting the spliced upper and lower Wen Yuyi vectors to a two-way long-short-term memory network for processing to obtain a text coding result; and extracting acoustic vectors from the voice data in the sliding window in the current session through a convolution network in combination with a multi-head attention mechanism to obtain a voice coding result, wherein the multi-head attention mechanism distributes attention according to the energy and the sound height change of the voice frame section.
3. The method of claim 2, wherein dynamically assigning weights of text encoding results and speech feature encoding results and performing multi-modal weighted fusion based on the subject matter of the current session and historical emotional intensity using a context-aware gated attention mechanism, the generating the emotional state results comprises: According to the latest multi-round conversation abstract, the business label and the historical emotion intensity, a conversation context vector is obtained, and the conversation context vector is respectively spliced with the text coding result and the voice coding result to obtain a stacking feature vector; processing the stacking feature vector through a gating network to obtain a modal score vector, and carrying out normalization processing on the modal score vector to obtain text modal weight and voice modal weight; and carrying out multi-modal weighted fusion on the text coding result and the voice coding result based on the text modal weight and the voice modal weight to obtain a feature fusion representation, and mapping the feature fusion representation into an emotion state result.
4. The method of claim 2, wherein the hierarchical multi-modal fransformer model is optimized using knowledge distillation techniques, wherein: Defining a complete model trained based on a general emotion data set and a customer service scene desensitization data set as a teacher model, and constructing a lightweight student model with parameters smaller than those of the teacher model; and migrating knowledge of the teacher model to the lightweight student model through knowledge distillation to obtain the hierarchical multi-mode transducer model for real-time emotion reasoning.
5. The method of claim 1, wherein outputting the target response content and the response action through the personalized response policy network comprises: constructing a triplet input comprising a current emotional state, a user portrait vector and a current session state, wherein the user portrait vector comprises a purchase record, loyalty and historical complaint times of a user, and the current session state comprises a current solved problem and a current unresolved problem; And inputting the triples into the personalized response strategy network for processing, outputting probability distribution of the optimal response text, and determining the target response content according to the probability distribution.
6. The method of claim 5, wherein in outputting the target response content and the response action through the personalized response policy network, the method further comprises: constructing a structured knowledge graph in the intelligent customer service field, wherein the knowledge graph comprises a problem-cause-solution triplet; in the process of generating the response content and the response action, positioning a solution related to the current session theme by utilizing the knowledge graph query technology; the solution is fused into the target response content, and a source link of the solution is attached to the target response content.
7. The method according to claim 1, wherein after outputting the target response content and the response action through the personalized response policy network, triggering a response upgrade mechanism if the emotional state result is not reduced to a first preset threshold value, specifically comprising: If the emotion type of the user is anger or anxiety, and the emotion intensity value is in an ascending trend or is larger than a second preset threshold value after receiving the response content, upgrading the response strategy from standard pacifying to deep overcoming intervention or high authority solution recommending; triggering a security audit flow, recording a session log and sending an early warning notice to a management terminal.
8. An AI-based intelligent customer service emotion recognition and response system is characterized by comprising an acquisition module, a recognition module and a response module, wherein: the acquisition module is used for acquiring a current session of a user and preprocessing the current session to extract voice characteristics and text characteristics; The recognition module is used for coding the voice characteristics and the text characteristics through a pre-trained hierarchical multi-mode transducer model, dynamically distributing the weights of the text characteristic coding result and the voice characteristic coding result and carrying out multi-mode weighted fusion according to the theme of the current session and the historical emotion intensity by utilizing a context awareness gating attention mechanism, and generating an emotion state result; the response module is used for outputting target response contents and response actions through the personalized response strategy network based on the emotion state result, the user portrait vector and the current session state.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 7.

Description

Emotion recognition and response method and system based on AI intelligent customer service Technical Field The application relates to the field of intelligent customer service, in particular to an emotion recognition and response method, system, computer equipment and computer readable storage medium based on AI intelligent customer service. Background Web applications and artificial intelligence techniques are increasingly being used in the field of customer service, especially in the context of after-sales services, pre-sales consultation and complaint handling, and the like. Traditional intelligent customer service systems are mostly based on rule matching or single-mode text analysis technology, and user consultation is replied through keyword matching or simple semantic understanding. In the related art, the intelligent customer service system has the common problems of single emotion recognition capability and hysteresis. The existing solution mainly relies on text information to perform emotion analysis, and ignores rich emotion information contained in acoustic characteristics such as voice intonation, voice speed and volume. For example, the same sentence "good bar" may be identified as neutral or agreeable in text, but in combination with a intonation of anechoic or angry, the actual meaning is quite different. This lack of modality results in a system that is difficult to capture the true emotional state of the user, especially in the complex scenarios of irony, agitation, or anger suppression, with a high false positive rate. Moreover, each round of dialog is typically treated as an independent event in the related art, lacking context awareness capabilities for long-term context and historical interactive habits. In generating a response, a standardized preset speaking technique is often adopted, and dynamic adjustment cannot be performed according to personalized images (such as loyalty and historical complaint records) and current emotion intensity of a user. The mechanical response of the 'thousand people' side is easy to excite the discontent emotion of the user, so that complaints are upgraded, and meanwhile, the intervention frequency and the workload of the artificial seat are increased. Aiming at the problem of low accuracy of Web intelligent customer service emotion recognition in the related technology, no effective solution is proposed at present. Disclosure of Invention The embodiment of the application provides an emotion recognition and response method, system, computer equipment and computer-readable storage medium based on AI intelligent customer service, which at least solve the problem of low accuracy of Web intelligent customer service emotion recognition in the related technology. In a first aspect, an embodiment of the present application provides a method for identifying and responding emotion based on AI intelligent customer service, where the method includes: Collecting a current session of a user, and preprocessing the current session to obtain text data and voice data; Coding the text data and the voice data through a pre-trained hierarchical multi-mode transducer model, dynamically distributing weights of a text coding result and a voice feature coding result according to the theme and the historical emotion intensity of the current session by using a context awareness gating attention mechanism, and carrying out multi-mode weighted fusion to generate an emotion state result; And outputting target response contents and response actions through a personalized response strategy network based on the emotion state result, the user portrait vector and the current session state. In some of these embodiments, encoding the text data and the speech data by a pre-trained hierarchical multi-modal transducer model includes: extracting upper and lower Wen Yuyi vectors of text data in the current session by using a pre-training language model at a single-mode feature coding layer, splicing the upper and lower Wen Yuyi vectors with text shallow features, and outputting the spliced upper and lower Wen Yuyi vectors to a two-way long-short-term memory network for processing to obtain a text coding result; and extracting acoustic vectors from the voice data in the sliding window in the current session through a convolution network in combination with a multi-head attention mechanism to obtain a voice coding result, wherein the multi-head attention mechanism distributes attention according to the energy and the sound height change of the voice frame section. . In some embodiments, using a context-aware gated attention mechanism, dynamically assigning weights of text encoding results and speech feature encoding results and performing multi-modal weighted fusion according to the topic and historical emotional intensity of the current session, generating emotional state results includes: According to the latest multi-round conversation abstract, the business label and the historical emotio