CN-121979985-A - Multi-mode fusion dialogue management method and device, electronic equipment and medium

CN121979985ACN 121979985 ACN121979985 ACN 121979985ACN-121979985-A

Abstract

The invention discloses a multi-mode fusion dialogue management method, a multi-mode fusion dialogue management device, electronic equipment and a medium. The method comprises the steps of collecting multi-modal data in a preset target scene, wherein the multi-modal data comprises audio data, video data and touch data, identifying the multi-modal data to obtain target data, wherein the target data is data for a dialogue management flow, and generating an answer according to the target data if the target data meets preset dialogue conditions. According to the technical scheme, the method has the accurate ambiguity resolution capability, the context can be guaranteed to be coherent and smooth, and the interaction process has self-adaptive characteristics and safety and reliability.

Inventors

WU SONG
WANG BINBIN

Assignees

杭州软通天擎机器人科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260116

Claims (10)

1. A method for managing a multi-modal fusion dialog, comprising: Collecting multi-modal data in a preset target scene, wherein the multi-modal data comprises audio data, video data and touch data; Identifying the multi-mode data to obtain target data, wherein the target data is data for a dialogue management flow; And if the target data meets the preset dialogue condition, generating an answer according to the target data.
2. The method of claim 1, wherein after identifying the multi-modal data to obtain target data, the method further comprises: and if the target data does not meet the preset dialogue condition, generating a refused answer prompt.
3. The method of claim 2, wherein generating a refusal answer prompt if the target data does not satisfy a preset dialog condition comprises: and if the voice identifier in the target data is the preset voice identifier, generating a refusal answer prompt.
4. The method of claim 2, wherein generating a refusal answer prompt if the target data does not satisfy a preset dialog condition, further comprises: And if the voiceprint similarity in the target data is smaller than a preset first threshold value, generating a refused answer prompt.
5. The method of claim 2, wherein generating a refusal answer prompt if the target data does not satisfy a preset dialog condition, further comprises: and if the language information in the target data is the preset language information and the confidence coefficient in the target data is smaller than a preset second threshold value, generating a refused answer prompt.
6. The method of claim 1, wherein generating an answer from the target data comprises: coding the text information in the target data to obtain a first code, and coding the visual information in the target data to obtain a second code; Performing fusion processing on the first code and the second code to generate intent distribution probability; Inputting the intention distribution probability into a pre-configured dialogue state tracking model, and outputting corresponding intention through the dialogue state tracking model; and generating an answer matched with the intention according to the intention.
7. The method of claim 1, wherein collecting multi-modal data in a preset target scenario comprises: the method comprises the steps of collecting multi-mode data in a preset target scene based on a pre-installed intelligent terminal, wherein the intelligent terminal is composed of a microphone array and a camera.
8. A multi-modal converged dialog management device, comprising: The multi-mode data acquisition module is used for acquiring multi-mode data in a preset target scene, wherein the multi-mode data consists of audio data, video data and touch data; The target data acquisition module is used for identifying the multi-mode data to acquire target data, wherein the target data is data for a dialogue management flow; and the answer generation module is used for generating an answer according to the target data if the target data meets the preset dialogue condition.
9. An electronic device, the electronic device comprising: And a memory communicatively coupled to the at least one processor, wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform a multimodal fusion dialog management method of any of claims 1-7.
10. A computer-readable storage medium, wherein the computer-readable storage medium stores computer instructions, the computer instructions for causing a processor to perform a multimodal fusion dialog management method of any of claims 1-7 when executed.

Description

Multi-mode fusion dialogue management method and device, electronic equipment and medium Technical Field The present invention relates to the field of session management technologies, and in particular, to a method, an apparatus, an electronic device, and a medium for multi-modal fusion session management. Background With the development of man-machine interaction technology, the dialogue system is widely applied to real scenes such as intelligent home and vehicle-mounted terminals, and the smoothness and accuracy of interaction directly influence user experience. However, the conventional dialogue system still faces a plurality of technical challenges to be solved in the practical application process, namely, the intent recognition is fuzzy, a user frequently uses a referent language in the interaction process, an operation object cannot be accurately determined only by means of voice information, so that the system cannot respond to an instruction correctly, the context is broken, the conventional dialogue state tracking method only processes a text sequence, ignores non-language clues such as vision and behavior, and is difficult to fully understand the interaction intent of the user, the interaction strategy is rigid, the interaction actions (such as instruction confirmation and task execution) of the system are usually set based on static rules, the interaction logic cannot be dynamically adjusted according to the tolerance degree of the user and the complexity of the task, and fourth, the invalid request is flooded, and the system lacks an effective filtering mechanism for non-human voice input, non-registered user access and non-target language input, so that the false triggering problem is easy to be caused, and the waste of computing resources is caused. The existence of the above problems severely restricts the interactive performance and practical value of the dialogue system, so it is needed to propose a multi-round dialogue interactive optimization scheme capable of solving the above problems. Disclosure of Invention The invention provides a multi-mode fusion dialogue management method, a multi-mode fusion dialogue management device, electronic equipment and a multi-mode fusion dialogue management medium, which have accurate ambiguity resolution capability, can ensure continuity and smoothness of a context, and have self-adaptive characteristics and safety and reliability in an interaction process. According to an aspect of the present invention, there is provided a multi-modal fusion session management method, including: Collecting multi-modal data in a preset target scene, wherein the multi-modal data comprises audio data, video data and touch data; Identifying the multi-mode data to obtain target data, wherein the target data is data for a dialogue management flow; And if the target data meets the preset dialogue condition, generating an answer according to the target data. According to another aspect of the present invention, there is provided a multi-modal fusion dialog management device, the device including: The multi-mode data acquisition module is used for acquiring multi-mode data in a preset target scene, wherein the multi-mode data consists of audio data, video data and touch data; The target data acquisition module is used for identifying the multi-mode data to acquire target data, wherein the target data is data for a dialogue management flow; and the answer generation module is used for generating an answer according to the target data if the target data meets the preset dialogue condition. According to another aspect of the present invention, there is provided an electronic apparatus including: The system comprises at least one processor, and a memory communicatively connected with the at least one processor, wherein the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform a multi-modal fusion dialog management method according to any of the embodiments of the present invention. According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to execute a session management method for multimodal fusion according to any embodiment of the present invention. According to the technical scheme, the multi-mode data are collected in the preset target scene, then the multi-mode data are identified to obtain the target data, and if the target data meet the preset dialogue condition, an answer is generated according to the target data. According to the technical scheme, the method has the accurate ambiguity resolution capability, the context can be guaranteed to be coherent and smooth, and the interaction process has self-adaptive characteristics and safety and reliability. It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to