CN-122022974-A - Bank virtual digital person service system and method integrating multi-mode interaction

CN122022974ACN 122022974 ACN122022974 ACN 122022974ACN-122022974-A

Abstract

The invention provides a bank virtual digital person service system and method integrating multi-mode interaction. A bank virtual digital person service system integrating multi-modal interaction comprises a multi-modal data acquisition end, an edge calculation node, a cloud cognitive engine, a man-machine cooperative control center and a dynamic risk gating module, wherein the cloud cognitive engine is provided with a financial intention-emotion coupling perception network and is configured to respectively extract semantic features, visual features and audio emotion features through a text encoder, a visual encoder and an audio encoder, search related financial entity concepts according to the semantic features to generate knowledge embedding vectors, carry out multi-head attention calculation on the knowledge embedding vectors and the visual features and the audio emotion features respectively through a knowledge map enhanced cross attention module, and output user intention vectors and emotion load indexes. The invention obviously improves the service capability and the security of the bank virtual digital person in a complex service scene.

Inventors

CUI YONGJIE
CAI MING
CHEN HAOWEI
YE JUNBIN
FANG WEICONG

Assignees

杭州易雅通科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251205

Claims (8)

1. A bank virtual digital person service system integrating multi-mode interaction is characterized by comprising the following components: The multi-mode data acquisition end is used for acquiring voice data of a user, video data containing facial expressions and limb actions and business operation instruction data in real time; The edge computing node is in communication connection with the multi-mode data acquisition end and is used for carrying out face desensitization processing on the video data, extracting visual feature vectors and uploading the extracted visual feature vectors; The cloud cognitive engine is in communication connection with the edge computing node and is provided with a financial intention-emotion coupling perception network, and the financial intention-emotion coupling perception network is configured to extract semantic features, visual features and audio emotion features through a text encoder, a visual encoder and an audio encoder respectively; the knowledge embedding vector is respectively calculated with the visual characteristics and the audio emotion characteristics by utilizing a cross attention module with enhanced knowledge patterns, and a user intention vector and an emotion load index which are fused with knowledge in the financial field are output; The human-computer cooperative control center is provided with a dynamic risk gating module, wherein the dynamic risk gating module is used for receiving the risk level of the current service, the emotion load index and the conversation confidence coefficient, generating gating scores through weighted calculation, generating feedback according to the user intention vector by a virtual digital person driving engine when the gating scores are lower than a preset threshold value, driving a digital person model to conduct real-time rendering of voice and actions, and triggering a human-computer cooperative process when the gating scores are higher than the preset threshold value, and enabling the current conversation context and emotion labels to flow to an artificial seat end in real time.
2. The system of claim 1, wherein in the structure of the financial intent-emotion coupled perception network, the text encoder uses a BERT model pre-trained by financial corpus for extracting text semantic features, the visual encoder uses ResNet in combination with a two-way long-short-term memory network for capturing facial micro-expression sequences and limb motion features, the audio encoder uses a wav2vec2.0 model for extracting speech speed, pitch jitter and pause features, and the knowledge-graph-enhanced cross-attention module is configured to use the knowledge embedding vector as a query vector and the visual features and the audio emotion features as key vectors and value vectors to calculate an attention weight matrix so as to identify non-linguistic emotional responses of a user under a specific financial business concept.
3. The bank virtual digital person service system integrating multi-modal interaction according to claim 2, wherein the financial knowledge graph module is a heterogeneous information network comprising service nodes, risk nodes and emotion mapping nodes, the knowledge graph enhanced cross attention module is further configured to calculate shortest path distances between current service nodes corresponding to the semantic features and the risk nodes in the graph to generate service sensitivity weights, and if the service sensitivity weights exceed a preset value, attention allocation proportion of feature graphs corresponding to eyebrow areas and lip areas in the visual features is increased in the multi-head attention calculation.
4. The bank virtual digital person service system integrating multi-modal interaction according to claim 1, wherein the cloud cognitive engine is further provided with an intention consistency check module, the intention consistency check module is configured to calculate a comparison loss between a spoken expression text feature and a facial visual feature of a user, when the comparison loss exceeds a safety threshold, the user is judged to have fraud risk or stressed risk, and the man-machine cooperative control center forcibly locks a current business process and sends an abnormal early warning signal to a manual seat end.
5. The banking virtual digital person service system incorporating multimodal interactions of claim 1, wherein the computing logic of the dynamic risk gating module is configured to: the gating score Gate s is calculated by the following formula: In the formula, sigma is a Sigmoid activation function, R is a risk level value of a currently transacted business, E is the emotion load index, C is confidence level of the model for understanding the intention of a current user, and Wr and We 、 Wc are corresponding trainable weight parameters respectively; Different weight parameter combinations are preset according to the service types, wes are given higher weight under the low-risk service scene, and Wr is given higher weight under the high-risk service scene.
6. The system of claim 5 wherein the human-computer collaboration process comprises pushing a data packet to a human agent console when switching to the human agent, the data packet comprising a user base representation and historical transaction summary, a complete text record of the current session, a specific business breakpoint location triggering the switch, and a reason tag causing a gating score to exceed a threshold, the reason tag selected from the group consisting of emotional agitation, semantic understanding failure, high risk business validation, and consistency verification failure.
7. The bank virtual digital person service system integrating multi-modal interaction according to claim 6, wherein the dynamic risk gating module is further configured to execute an online feedback learning mechanism for parameter updating by using the data packet, the online feedback learning mechanism specifically comprises recording a service processing result after intervention of a manual agent, if the manual agent corrects a reason tag causing a gating score to exceed a threshold value as misjudgment, constructing a negative sample pair, and reversely updating trainable weight parameters Wr, we and Wc by using the negative sample pair through a gradient descent algorithm.
8. A bank virtual digital person service method based on a bank virtual digital person service system integrating multi-modal interactions as claimed in any one of claims 1 to 7, characterized in that the method comprises the following steps: S1, acquiring voice data of a user, video data containing facial expressions and limb actions and business operation instruction data in real time by utilizing a multi-mode data acquisition end; S2, receiving the video data through an edge computing node, performing face desensitization processing locally, extracting visual feature vectors, and uploading the visual feature vectors, the voice data and service operation instruction data to a cloud cognitive engine; s3, processing the uploaded data by utilizing a financial intention-emotion coupling perception network in a cloud cognitive engine, wherein the processing comprises the following steps of: Respectively extracting semantic features, visual features and audio emotion features through a text encoder, a visual encoder and an audio encoder; Invoking a financial knowledge graph module, and searching related financial entity concepts according to the semantic features to generate a knowledge embedding vector; Operating a knowledge graph reinforced cross attention module, respectively carrying out multi-head attention calculation on the knowledge embedding vector, the visual characteristics and the audio emotion characteristics, and outputting a user intention vector and an emotion load index which are fused with knowledge in the financial field; S5, comparing the gating score with a preset threshold, generating feedback by a virtual digital person driving engine according to the user intention vector when the gating score is lower than the preset threshold, driving a digital person model to conduct real-time rendering of voice and actions, and triggering a man-machine cooperation process when the gating score is higher than the preset threshold, and enabling the current dialogue context and emotion labels to flow to a manual seat end in real time.

Description

Bank virtual digital person service system and method integrating multi-mode interaction Technical Field The invention relates to a bank virtual digital person service system and method, in particular to a bank virtual digital person service system and method integrating multi-mode interaction, and belongs to the technical field of artificial intelligence and finance. Background Along with acceleration of banking digital transformation, virtual digital people (Virtual Digital Human) are gradually applied to channels such as remote banks (VTM), mobile banking APP and the like to replace traditional manual customer service to carry out business consultation and handling. The existing digital man-machine technology generally adopts a pipeline architecture combining voice recognition and natural language understanding with voice synthesis, and can finish standardized question-answering tasks, but has the problems of missing and misplacement of emotion perception dimensions, lack of a dynamic risk control mechanism, split feeling of man-machine cooperation and the like when complex financial business scenes are processed. In the prior art, a general emotion analysis model is mostly adopted, so that only basic emotion can be identified, and the micro-expression of a user cannot be understood by combining with a specific financial service context. For example, when a user presents frowning or tarnishing while reading a financial product risk exposure book, existing systems may misinterpret it as anger or ignore it, which is in fact a signal that the user is confused about terms of art or is worrying about risk. Emotion recognition, which lacks domain knowledge, results in digital persons not providing accurate pacifying or interpretation. Existing digital personal services generally adopt a cut-off strategy, and independent decision authority of a system is always fixed no matter whether a user inquires balance or transfers large amount of money. This results in too cumbersome interaction in low risk scenarios, but in high risk scenarios, lacking sufficient security verification, it is difficult to achieve a dynamic balance between user experience and fund security. When the digital person cannot solve the problem and needs to transfer the manual seat, the triggering is often triggered based on keywords or simply overtime. After switching, the manual agent usually needs to ask the user from the beginning, resulting in service interruption, and lacks a technical scheme capable of lossless circulation of business context, user emotional state and risk breakpoints. In addition, with the rise of deep forgery technology, it has been difficult to completely prevent video fraud by only means of passwords or face recognition, and existing digital man systems lack the ability to verify the consistency of user spoken expressions and body language in real time. Disclosure of Invention Based on the background, the invention aims to provide a bank virtual digital person service system and method integrating multi-mode interaction, which solve the problems in the background technology. In order to achieve the above object, the present invention provides the following technical solutions: a bank virtual digital person service system integrating multimodal interactions, comprising: The multi-mode data acquisition end is used for acquiring voice data of a user, video data containing facial expressions and limb actions and business operation instruction data in real time; The edge computing node is in communication connection with the multi-mode data acquisition end and is used for carrying out face desensitization processing on the video data, extracting visual feature vectors and uploading the extracted visual feature vectors; The cloud cognitive engine is in communication connection with the edge computing node and is provided with a financial intention-emotion coupling perception network, and the financial intention-emotion coupling perception network is configured to extract semantic features, visual features and audio emotion features through a text encoder, a visual encoder and an audio encoder respectively; the knowledge embedding vector is respectively calculated with the visual characteristics and the audio emotion characteristics by utilizing a cross attention module with enhanced knowledge patterns, and a user intention vector and an emotion load index which are fused with knowledge in the financial field are output; The human-computer cooperative control center is provided with a dynamic risk gating module, wherein the dynamic risk gating module is used for receiving the risk level of the current service, the emotion load index and the conversation confidence coefficient, generating gating scores through weighted calculation, generating feedback according to the user intention vector by a virtual digital person driving engine when the gating scores are lower than a preset threshold value, driving a digital person model