CN-121979378-A - Virtual human interaction method and device based on dynamic policy adjustment

CN121979378ACN 121979378 ACN121979378 ACN 121979378ACN-121979378-A

Abstract

The application discloses a virtual human interaction method and device based on dynamic strategy adjustment, wherein the method comprises the steps of collecting user multi-mode input data, extracting emotion key indexes and identifying interaction intention; the method comprises the steps of inquiring according to interaction intention, combining a scene knowledge base to optimize a communication strategy to obtain an optimized strategy framework, generating candidate responses based on the optimized strategy framework, evaluating the matching degree of the candidate responses and the user state to determine final output, and switching to a standby interaction mode to generate the responses when the matching degree is lower than a preset threshold value. According to the method, the user intention is accurately identified through multi-mode fusion analysis, a scene knowledge base dynamic optimization strategy is combined, a closed-loop evaluation and correction mechanism is established, the adaptability, the intelligence and the pertinence of virtual human interaction are remarkably improved, and the user experience is improved.

Inventors

MENG HAIBIN
DUAN HAIQING
DING SHIPENG

Assignees

云袭网络技术河北有限公司

Dates

Publication Date: 20260505
Application Date: 20251202

Claims (10)

1. The virtual human interaction method based on dynamic strategy adjustment is characterized by comprising the following steps of: Step one, acquiring multi-mode input data at least comprising user text, voice and expression image data, and extracting a final emotion key index based on the multi-mode input data; step two, based on the final emotion basic index and the multi-mode input data, identifying the interactive intention of the user so as to determine an intention category label; inquiring an emotion intention mapping table according to the intention type label, and if the intention type label is a preset state requiring strategy adjustment, acquiring a matched communication strategy set to generate a preliminary communication adjustment scheme; Analyzing the interaction history record, and judging whether the preliminary communication adjustment scheme needs to be optimized by combining a scene specific knowledge base so as to obtain an optimization strategy framework; Predicting the emotion state of the user at the next moment caused by the candidate response text through a user state transition prediction model, and calculating the similarity between the emotion state and the target emotion state as a matching degree to determine a final output response sequence; Step six, if the matching degree is lower than a preset adoption threshold value, extracting an auxiliary feature vector for judging whether the user is in a preset negative state, if so, switching to a standby interaction mode and adopting a corresponding standby strategy to generate new response content; and step seven, combining the final output response sequence determined in the step five or the new response content generated in the step six with the virtual human non-language behavior instruction to construct an interactive response chain, and driving the virtual human model to execute the output behavior.
2. The method according to claim 1, wherein the extracting the final emotion key indicator specifically comprises: fusing visual features extracted from the expression image data through a convolutional neural network and emotion features extracted from the text data and the voice data to generate a comprehensive emotion vector; evaluating the confidence level of the comprehensive emotion vector; And when the confidence coefficient is higher than a preset confidence coefficient threshold value, inputting the comprehensive emotion vector into a cyclic neural network for time sequence modeling to obtain the final emotion basic key index.
3. The method according to claim 1, wherein in the second step, the identifying the interactive intention of the user based on the final emotion key index and the multimodal input data includes: extracting acoustic features from the voice data, and fusing the acoustic features with the final emotion key indexes to form a key feature vector; inputting a voice sequence formed by the voice data and the key feature vector into a first cyclic neural network, and outputting a sequence hiding state; converting the text data into text embedded vectors through a word embedding model; adopting an attention mechanism, taking the text embedded vector as a query, taking the sequence hidden state as a key and a value, and carrying out weighted summation on the sequence hidden state to generate a fusion semantic representation; and inputting the fusion semantic representation to a classification layer, and outputting the intention category label.
4. The method of claim 1, wherein in step three, the emotion intention mapping table is a key value database, and the key is composed of an intention category label and an emotion state obtained from the final emotion key index, and the value is a corresponding communication policy category.
5. The method according to claim 1, wherein in the fourth step, the interaction history is analyzed to determine whether the optimization is required in combination with the scene specific knowledge base, and the method specifically comprises: Analyzing the occurrence frequency of keywords related to a specific scene, the complexity of questions asked by a user and the continuity of interaction topics in the interaction history record, and calculating scene adaptability indexes; if the scene adaptability index exceeds a preset scene correlation threshold, according to the context information of the current interaction, searching related content from a scene specific knowledge base corresponding to the specific scene; and fusing the searched related content with the preliminary communication adjustment scheme to form the optimization strategy framework.
6. The method according to claim 1 or 5, wherein in step four, the scenario-specific knowledge base is a structured database comprising knowledge point definitions of specific disciplines, common questions and their solutions, solution of questions step details and encouragement cases.
7. The method of claim 1, wherein in step six, the alternate interaction pattern is a mood pacifying pattern and the corresponding alternate strategy is a placebo strategy set for generating responsive content in a co-mood and pacifying language.
8. The method of claim 1, wherein in step seven, the non-language behavioral instructions include expression, gesture, and motion instructions of a virtual human model that matches the response content and emotion base phase determined by the final emotion base index.
9. The method of claim 3, wherein the word embedding model is a pre-trained language model based on a transducer architecture.
10. A virtual human interaction device based on dynamic policy adjustment, comprising: The multi-mode data acquisition module is used for acquiring multi-mode input data of a user in the interaction process; The emotion key extraction module is used for extracting a final emotion key index based on the multi-mode input data; the interaction intention recognition module is used for recognizing the interaction intention of the user based on the final emotion key index and the multi-mode input data so as to determine an intention type label; the strategy generation and optimization module is used for inquiring a preset emotion intention mapping table according to the intention category label to obtain a communication strategy set, and optimizing the communication strategy by combining the interaction history record of the user and a scene specific knowledge base to obtain an optimized strategy framework; the response generation and evaluation module is used for generating candidate response texts based on the optimization strategy framework and evaluating the matching degree of the candidate response texts and the current state of the user so as to determine a final output response sequence; The mode switching module is used for re-analyzing the multi-mode input data to extract auxiliary feature vectors when the matching degree is lower than a preset adoption threshold value, judging whether to switch to a preset standby interaction mode or not according to the auxiliary feature vectors and adopting a corresponding standby strategy; And the interaction executing module is used for constructing a complete interaction response chain according to the strategy selection or switching result and driving the virtual human model to execute corresponding output behaviors.

Description

Virtual human interaction method and device based on dynamic policy adjustment Technical Field The application relates to the technical fields of artificial intelligence, computer graphics and man-machine interaction systems, in particular to a virtual man-machine interaction method and device based on dynamic strategy adjustment. Background The virtual human interaction technology is an important development direction in the field of human-computer interaction, and has increasingly wide application fields, such as intelligent customer service, online education, virtual assistants and Virtual Reality (VR) systems, and provides more natural, immersive and personalized interaction experience for users by constructing virtual images capable of simulating human communication modes. In these applications, the dummy is not only required to act as a medium for information transfer, but is also required to have the ability to understand the deep needs of the user and guide the user's behavior in good time. However, the virtual human interaction system in the prior art still has some technical problems when processing complex interaction scenes. In one aspect, many systems rely primarily on a pre-set rule base or a fixed dialog flow template to generate responses. This approach works effectively in the face of structured, explicit user queries, but the flexibility and adaptability of the system is limited when the user's input form is diverse, intended to be ambiguous, or mood swings. For example, the system may not accurately capture states such as hesitation, confusion, anxiety, etc. that the user transmits through multimodal information such as speech utterances, facial expressions, etc., resulting in lack of pertinence in the response of the interaction, affecting the overall effect of the interaction and user satisfaction. On the other hand, the prior art has a disadvantage in achieving efficient linkage between deep understanding of user multimodal input and dynamic interaction policy selection. The interactive behavior of the user is usually the comprehensive expression of various modal information such as characters, voices, expressions and the like, and the information together form a complete portrait of the current state of the user. Although the prior art can process information of a single mode, the prior art still faces great defects in terms of how to effectively fuse heterogeneous data and extract accurate intention from the heterogeneous data, wherein the accurate intention can directly guide dynamic adjustment of an interaction strategy. For example, in an online educational scenario, students may exhibit reduced text input, a tardy speech intonation, and confusion in expression due to learning difficulties encountered. If the system cannot comprehensively analyze the signals and identify the "hesitation state" of the student, the communication strategy cannot be timely adjusted, for example, the system switches from a knowledge solution mode to an encouragement guidance mode, and effective guidance to the user is further affected. This technical gap between the deep understanding of the user state and the dynamic adjustment of the interaction strategy limits the ability of virtual persons to achieve effective guidance in complex interaction scenarios. Therefore, how to design a technical scheme capable of analyzing user multi-mode input information in real time, judging user intention and state, and dynamically selecting and optimizing communication strategies according to the user intention and state is a technical problem to be solved in the current virtual human interaction field. Disclosure of Invention The invention provides a virtual human interaction method based on dynamic strategy adjustment, which solves the technical problems that the virtual human interaction system in the background technology is difficult to flexibly adjust the strategy according to the multi-mode real-time state of a user, and the interaction pertinence and the effect are poor. The invention provides a virtual human interaction method based on dynamic strategy adjustment, which comprises the following steps: Step one, acquiring multi-mode input data of a user in the interaction process, and extracting a final emotion key index based on the multi-mode input data. The multimodal input data includes at least text data, voice data, and expression image data of a user. Firstly, the system collects expression images, voice streams and text input of a user in real time through a camera, a microphone and other devices. For the expression image data, a pre-trained convolutional neural network (Convolutional Neural Network, CNN) is used for processing. In particular, the convolutional neural network is configured to identify pixel distribution and texture information of key regions of the face from which visual feature vectors that can characterize the core emotion (e.g., happy, sad, surprised, etc.) of the user are extracted