CN-122020421-A - Multi-mode dialogue emotion recognition and intention analysis method and system for intelligent customer service

CN122020421ACN 122020421 ACN122020421 ACN 122020421ACN-122020421-A

Abstract

The application provides a multimode dialogue emotion recognition and intention analysis method and system for intelligent customer service, which relate to the technical field of natural language processing and are characterized in that a face video stream of a target user is obtained; the method comprises the steps of carrying out double-channel feature extraction on a face video stream to obtain a micro-expression vector sequence and a color change signal, analyzing the color change signal by utilizing a photoelectric volume pulse wave technology to obtain a physiological time sequence vector sequence, obtaining a combined feature tensor sequence based on a time stamp, inputting the combined feature tensor sequence into a trained capsule network to be converted into a primary capsule vector to generate a target capsule set, and determining a target capsule vector with the largest model in the target capsule set as the target capsule vector to determine the intention of a target user, thereby improving the emotion perception and intention understanding capability of intelligent customer service in a multi-mode interaction scene.

Inventors

CAO JING
YUAN LEI
YANG YING

Assignees

五维要数智能科技(上海)有限公司
北京凌云光子技术有限公司

Dates

Publication Date: 20260512
Application Date: 20260227

Claims (10)

1. The intelligent customer service oriented multi-mode dialogue emotion recognition and intention analysis method is characterized by comprising the following steps of: Acquiring a face video stream of a target user, wherein the face video stream comprises a plurality of face image frames and time stamps corresponding to each face image frame; carrying out double-channel feature extraction on the facial video stream to obtain a micro-expression vector sequence and a color change signal; Analyzing the color change signal by utilizing a photoelectric volume pulse wave technology to obtain a pulse wave signal, and calculating heart rate variability based on the pulse wave signal to obtain a physiological time sequence vector sequence; Based on the time stamp, carrying out time sequence alignment on the micro-expression vector sequence and the physiological time sequence vector sequence, and then obtaining a combined characteristic tensor sequence by calculating an outer product between the micro-expression vector and the physiological time sequence vector; Inputting the combined characteristic tensor sequence into a trained capsule network, converting the combined characteristic tensor sequence into primary capsule vectors, and adjusting the connection relation between the primary capsule vectors by utilizing a dynamic routing algorithm to generate a target capsule set; And determining a target capsule vector with the largest model in the target capsule set as a target capsule vector, determining a emotion type corresponding to the target capsule vector as a target emotion type, and inquiring in a preset service logic library based on the target capsule vector to determine the intention of a target user.
2. The method according to claim 1, wherein after time-aligning the sequence of microexpressive vectors and the sequence of physiological time-series vectors based on the time stamps, the method further comprises, prior to the step of obtaining the sequence of joint feature tensors by calculating an outer product between microexpressive vectors and physiological time-series vectors: acquiring a voice recognition text generated in the interaction process of the intelligent customer service and the target user; determining a current dialogue stage from preset business dialogue stages based on the intention keywords and emotion words in the voice recognition text; and determining weight information in a preset weight distribution strategy according to the current dialogue stage, and respectively carrying out dynamic weighting on the micro-expression vector and the physiological time sequence vector according to the weight information, wherein the weight information comprises first weight information and second weight information.
3. The method according to claim 1, wherein the method further comprises: Respectively calculating a first change trend value of a microexpressive vector and a second change trend value of a physiological time sequence vector corresponding to each combined characteristic tensor in the combined characteristic tensor sequence; calculating a difference value between the first change trend value and the second change trend value, and constructing a target feature tensor sequence based on joint feature tensors with the difference value continuously larger than a preset conflict threshold value; the joint characteristic tensor sequence is input into a trained capsule network to be converted into primary capsule vectors, and the connection relation among the primary capsule vectors is adjusted by utilizing a dynamic routing algorithm to generate a target capsule set, wherein the method comprises the following steps of And inputting the target characteristic tensor sequence into a trained capsule network, converting the target characteristic tensor sequence into primary capsule vectors, and adjusting the connection relation between the primary capsule vectors by utilizing a dynamic routing algorithm to generate a target capsule set.
4. The method of claim 1, wherein performing a two-channel feature extraction on the facial video stream to obtain a sequence of microexpressive vectors and a color change signal, comprises: Calculating the position offset of the face key points in the face video stream between adjacent face image frames; based on the position offset, calculating micro-expression vectors corresponding to each facial image frame, and arranging all the micro-expression vectors to obtain a micro-expression vector sequence; And calculating the color change intensity of the target skin area in the face video stream on the RGB color channel to obtain a color change signal.
5. The method of claim 1, wherein the obtaining the sequence of joint feature tensors by calculating an outer product between the sequence of microexpressive vectors and the sequence of physiological time series vectors after time-series alignment of the sequence of microexpressive vectors and the sequence of physiological time series vectors based on the time stamps comprises: Pairing each micro-expression vector in the micro-expression vector sequence with a physiological time sequence vector with the same time stamp in the physiological time sequence vector sequence to obtain a plurality of vector pairs; calculating the outer product of the microexpressive vector and the physiological time sequence vector in each vector pair to obtain a plurality of combined characteristic tensors; and sequencing all the combined characteristic tensors according to the time sequence order to obtain a combined characteristic tensor sequence.
6. The method of claim 1, wherein inputting the sequence of joint feature tensors into a trained capsule network is converted into primary capsule vectors, and wherein adjusting connection relationships between the primary capsule vectors using a dynamic routing algorithm generates a set of target capsules, comprising: Inputting the combined characteristic tensor sequence into a trained capsule network to obtain a plurality of primary capsule vectors, wherein the capsule network comprises a primary capsule layer and a high-level capsule layer, and the high-level capsule layer comprises a plurality of high-level capsule vectors respectively mapped to different emotion categories; iteratively adjusting the connection weight between each primary capsule vector in the primary capsule layer and different high-layer capsule vectors in the high-layer capsule layer by using a dynamic routing algorithm to obtain an adjusted connection weight; According to the adjusted connection weight, carrying out weighted summation on all primary capsule vectors corresponding to each high-layer capsule vector to obtain a plurality of target-layer capsule vectors; And constructing a target capsule set based on all the target layer capsule vectors.
7. The method as recited in claim 1, further comprising: The preset business logic library comprises a plurality of intention mapping relations, wherein each intention mapping relation is associated with one emotion category, one matching condition of an instance parameter vector and one business intention.
8. The multi-modal dialogue emotion recognition and intention analysis system for intelligent customer service is characterized by comprising the following steps: An acquisition module, configured to acquire a face video stream of a target user, where the face video stream includes a plurality of face image frames and a timestamp corresponding to each face image frame; the extraction module is used for carrying out double-channel feature extraction on the facial video stream to obtain a microexpressive vector sequence and a color change signal; The calculation module is used for analyzing the color change signal by utilizing a photoelectric volume pulse wave technology to obtain a pulse wave signal, and calculating heart rate variability based on the pulse wave signal to obtain a physiological time sequence vector sequence; the calculation module is further used for obtaining a combined characteristic tensor sequence by calculating an outer product between the microexpressive vector and the physiological time sequence vector after the microexpressive vector sequence and the physiological time sequence vector sequence are subjected to time sequence alignment based on the time stamp; the adjustment module is used for inputting the combined characteristic tensor sequence into a trained capsule network, converting the combined characteristic tensor sequence into primary capsule vectors, and adjusting the connection relation among the primary capsule vectors by utilizing a dynamic routing algorithm to generate a target capsule set; The determining module is used for determining a target layer capsule vector with the largest model in the target capsule set as a target capsule vector, determining a emotion type corresponding to the target capsule vector as a target emotion type, and determining the intention of a target user based on the target capsule vector in a preset service logic library.
9. An electronic device, comprising: A memory for storing a computer program; a processor for implementing the steps of the intelligent customer service oriented multimodal dialog emotion recognition and intent analysis method as claimed in any of claims 1 to 7 when executing said computer program.
10. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, and when the computer program is executed by a processor, the computer program can implement the intelligent customer service oriented multi-modal dialog emotion recognition and intent analysis method as claimed in any one of claims 1 to 7.

Description

Multi-mode dialogue emotion recognition and intention analysis method and system for intelligent customer service Technical Field The application relates to the technical field of natural language processing, in particular to a multi-modal dialogue emotion recognition and intention analysis method and system for intelligent customer service. Background In non-contact interaction scenes such as intelligent video customer service, virtual teller machines, remote financial business handling and the like, the accurate perception of the user state is a key for improving the service quality. At this time, the system not only needs to respond to the dominant business instruction of the user, but also needs to deeply get insight into the emotion change of the user in the interaction process, so as to realize accurate prejudgment and active intervention on the potential business intention. The existing mainstream emotion recognition schemes generally adopt a mode of combining facial expression analysis with speech emotion recognition, and attempt to enhance the robustness of judgment through multi-modal data. These methods typically extract visual texture features and acoustic prosodic features from the video and audio streams, respectively, and then use feature stitching or underlying attention mechanisms to perform a simple modal fusion, which is then associated to a preset emotion category. However, the method that only relies on the apparent behavior mode is difficult to capture the complex psychological state of the user to either intentionally press or disguise, and misjudgment is very easy to occur when the user has different exterior and interior. Meanwhile, the existing shallow linear fusion mode cannot effectively establish a complex nonlinear coupling relation between external expression and internal physiological functions, so that the analysis depth of deep emotion is insufficient, and the severe requirement on intention prejudgment in a high-risk business scene cannot be met. Therefore, the technical problems that the accuracy of the intended analysis is low due to the fact that the masking complex emotion is difficult to accurately identify in the prior art exist. Disclosure of Invention The application aims to provide a multi-mode dialogue emotion recognition and intention analysis method and system for intelligent customer service, which are used for solving the technical problem that the accuracy of intention analysis is low due to the fact that complicated emotion with masking is difficult to accurately recognize in the prior art. In a first aspect, the present application provides a method for identifying and analyzing emotion of a multi-modal dialogue for intelligent customer service, including: Acquiring a face video stream of a target user, wherein the face video stream comprises a plurality of face image frames and time stamps corresponding to each face image frame; carrying out double-channel feature extraction on the facial video stream to obtain a micro-expression vector sequence and a color change signal; analyzing the color change signal by utilizing a photoelectric volume pulse wave technology to obtain a pulse wave signal, and calculating heart rate variability based on the pulse wave signal to obtain a physiological time sequence vector sequence; Based on the time stamp, carrying out time sequence alignment on the micro-expression vector sequence and the physiological time sequence vector sequence, and obtaining a combined characteristic tensor sequence by calculating an outer product between the micro-expression vector and the physiological time sequence vector; Inputting the combined characteristic tensor sequence into a trained capsule network, converting the combined characteristic tensor sequence into primary capsule vectors, and utilizing a dynamic routing algorithm to adjust the connection relation among the primary capsule vectors to generate a target capsule set; and determining a target layer capsule vector with the largest model in the target capsule set as a target capsule vector, determining an emotion category corresponding to the target capsule vector as a target emotion category, and inquiring in a preset service logic library based on the target capsule vector to determine the intention of the target user. Optionally, before the step of obtaining the joint feature tensor sequence by calculating an outer product between the microexpressive vector and the physiological time sequence vector after time-sequence alignment of the microexpressive vector sequence and the physiological time sequence vector sequence based on the time stamp, the method further includes: acquiring a voice recognition text generated in the interaction process of the intelligent customer service and the target user; Determining a current dialogue stage from preset business dialogue stages based on intention keywords and emotion words in the voice recognition text; According to the current dialogue