KR-20260067051-A - Apparatus and method for multi-modal chatting

KR20260067051AKR 20260067051 AKR20260067051 AKR 20260067051AKR-20260067051-A

Abstract

A multimodal dialogue device and method are provided for automatically generating and presenting a picture related to the conversation as needed in a situation where a user and a system engage in a conversation based on text or voice. A multimodal dialogue device according to one embodiment of the present invention comprises: a text response generation unit that generates a text system response to be spoken currently based on the context of the conversation between the system and the user; a picture expression generation unit that generates a picture expression text expressing the content to be expressed by the picture based on the generated text system response; and a picture generation unit that generates a picture based on the generated picture expression text.

Inventors

이기영
권오욱
류지희
서영애
성진
신종훈
이요한
임수종
허정

Assignees

한국전자통신연구원

Dates

Publication Date: 20260512
Application Date: 20241105

Claims (20)

A text system response step that generates a text system response to be spoken currently based on the context of the conversation between the system and the user; A picture representation generation step that generates picture representation text expressing the content to be represented by the picture based on the generated text system response; and A picture generation step that generates a picture based on the generated picture representation text. A multimodal conversation method comprising
In paragraph 1, the above-mentioned figure representation generation step is, A step of generating a prompt for generating image representation text, and Step of generating pictorial text by inputting the generated prompt into a generative language model A multimodal dialogue method including
In paragraph 2, the above prompt is A command to determine whether to output the generated text system response as is or to generate it as an image, and if to generate it as an image, to generate image representation text; User information including user characteristics; Previous conversation context; and The above text system response A multimodal dialogue method including
In paragraph 3, the step of inputting the generated prompt into a generative language model to generate picture representation text is In the above generative language model, if it is determined that the system utterance content will not be displayed as a picture, a signal indicating that picture generation will not occur is output, and if it is determined that the system utterance content will be constructed as a picture, picture representation text is generated. Multimodal conversation method.
In paragraph 4, If it is determined in the above picture representation generation step not to display the system utterance content as a picture, the text system response generated in the above text system response step is output, and If it is determined in the above-mentioned picture representation generation step to display the system utterance content as a picture, output the text system response generated in the above-mentioned text system response step and the picture generated in the above-mentioned picture generation step, Multimodal conversation method.
In paragraph 1, the above-mentioned picture generation step is, A picture search step for searching for the picture most similar to the picture expression text using the picture expression text above, and A picture generation decision step that determines whether to use a searched picture or generate a new picture, and If it is decided to generate a new picture, a new picture is generated and output using an artificial intelligence image generation model based at least on the picture representation text, and if it is decided to use a retrieved picture, the retrieved picture is output. A multimodal dialogue method including
In paragraph 6, the above-mentioned figure generation determination step is, A decision prompt is generated to determine whether to generate a new picture using the picture retrieved in the above picture search step, user information, the above picture expression text, and a picture expression context including conversation history, and Deciding whether to use the retrieved figure or generate a new figure based on the generated decision prompt and the retrieved figure, Multimodal conversation method.
In Paragraph 7, In the above image search step, multiple images most similar to the image representation text are output, and In the step of generating the above decision prompt, the plurality of drawings and the drawing representation context are combined to generate the plurality of decision prompts, and In the above image generation decision step, a decision is made as to whether to use the image with the highest similarity among the searched images based on the similarity between the plurality of images and the image representation context. Multimodal conversation method.
In paragraph 6, the above-mentioned figure generation determination step is, If the similarity of the searched image is higher than a predetermined threshold, it is decided to generate a new image based on the image representation text and the image, and if the similarity of the searched image is lower than the predetermined threshold, it is decided to generate a new image based at least on the image representation text. Multimodal conversation method.
In paragraph 1, Image-reflecting text generation step that corrects the above text system response to generate text reflecting the image generated in the above image generation step A multimodal conversation method further equipped with
A text response generation unit that generates a text system response to be spoken currently based on the context of the conversation between the system and the user; A picture representation generation unit that generates picture representation text expressing the content to be expressed by the picture based on the generated text system response; and A picture generation unit that generates a picture based on the generated picture representation text. A multimodal conversational device equipped with
In Clause 11, the above-mentioned figure representation generating unit is, A prompt generation unit that generates a prompt for generating image representation text, and A generative language model that takes a generated prompt as input and generates image representation text. A multimodal conversational device including
In Clause 12, the above prompt is A command to determine whether to output the generated text system response as is or to generate it as an image, and if to generate it as an image, to generate image representation text; User information including user characteristics; Previous conversation context; and The above text system response A multimodal conversational device including
In Paragraph 13, the above command of the above prompt is, If it is determined that the system utterance content will not be displayed as a picture, output a signal indicating that no picture will be generated, and if it is determined that the system utterance content will be composed as a picture, instruct to generate picture representation text. Multimodal conversation device.
In paragraph 14, the multimodal dialogue device is, If it is determined by the above-mentioned picture representation generation unit not to display the system utterance content as a picture, the text system response generated by the above-mentioned text response generation unit is output, and If it is determined in the above-mentioned picture representation generation unit to display the system utterance content as a picture, output the text system response generated in the above-mentioned text response generation unit and the picture generated in the above-mentioned picture generation step, Multimodal conversation device.
In Clause 11, the above-mentioned figure generating unit, An image search unit that searches for the image most similar to the image representation text using the image representation text, and A picture generation decision unit that determines whether to use a searched picture or generate a new picture, and An image generation model that, when it is decided to generate a new image, generates and outputs a new image based at least on the image representation text, and when it is decided to use a retrieved image, outputs the retrieved image. A multimodal conversational device including
In Clause 16, the above-mentioned figure generation decision unit is, A decision prompt is generated to determine whether to generate a new picture using a picture retrieved from the image search unit, user information, the picture expression text, and a picture expression context including conversation history. Deciding whether to use the retrieved figure or generate a new figure based on the generated decision prompt and the retrieved figure, Multimodal conversation device.
In Paragraph 17, The above image search unit outputs multiple images most similar to the image representation text, and The above-mentioned image generation decision unit combines the plurality of images and the image representation context to generate a plurality of decision prompts, and determines whether to use the image with the highest similarity among the retrieved images based on the similarity between the plurality of images and the image representation context. Multimodal conversation device.
In Clause 16, the above-mentioned figure generation decision unit is, If the similarity of the searched image is higher than a predetermined threshold, it is decided to generate a new image based on the image representation text and the image, and if the similarity of the searched image is lower than the predetermined threshold, it is decided to generate a new image based at least on the image representation text. Multimodal conversation device.
In Paragraph 11, A picture-reflecting text generation unit that corrects the response of the above text system to generate text reflecting the picture generated by the above picture generation unit A multimodal conversational device further equipped with

Description

Apparatus and method for multi-modal chatting The present invention relates to a multimodal conversation device and method. Text- or voice-based conversation is one of the most fundamental modes of human communication. However, when the goal is to convey more complex concepts or situations rather than simple meanings, relying solely on text or voice does not facilitate efficient and rapid understanding between the parties involved. This phenomenon occurs in various ways in practice. In educational conversations between students and teachers, teachers often draw pictures on the blackboard or notebooks to make explanations easier to understand. In such cases, pictures become a much more efficient means than text for helping students grasp the concept. This is also particularly evident in conversations with the elderly or socially vulnerable. For instance, when a son or daughter attempts to explain the functions of an air conditioner or TV remote to elderly parents who do not live with them, it is difficult to explain the button functions in detail, efficiently, and in an easy-to-understand manner using only simple text or voice. Furthermore, even in the rapidly advancing field of dialogue processing, communication relying solely on text or voice often struggles to help users understand specific concepts or facts when interacting with systems against socially vulnerable individuals, including the elderly. The development of deep learning-based artificial intelligence technology has brought a significant advancement to various technologies in the field of natural language processing. The field of dialogue processing is no exception, with notable progress being made not only in simple chatting with systems but also in goal-oriented conversation. For this reason, there have been many attempts to apply dialogue processing models to various fields. Examples include care services for the socially vulnerable, including the elderly; tutoring services for language or mathematical problems; medical services; and product sales services. However, conversations that rely solely on text or voice face difficulties in maintaining efficient communication between the system and the user. For example, when conversing with the elderly, using only text to explain specific concepts, facts, or how to use objects has distinct limitations. In some cases, the desired efficiency can be achieved by showing the spoken content as a picture and providing explanations via text while sharing the image. The same applies to the tutoring domain. Generally, when attempting to solve mathematical problems, many people actually understand what the problem represents by drawing a picture. For example, even a teacher teaching math problems helps students understand by presenting a picture—such as a drawing on the blackboard or a notebook—when explaining verbally but feeling that the student's understanding is insufficient. FIG. 1 is a block diagram showing the overall configuration of a multimodal conversation device according to one embodiment of the present invention. Figure 2 shows one example of generating a picture during a conversation between a system and a user in one embodiment of the present invention. Figure 3 shows another example of generating a picture during a conversation between a system and a user in one embodiment of the present invention. Figure 4 shows two examples of generating and displaying an object described by a tutor in real time during a tutor's explanation in one embodiment of the present invention. Figure 5 shows an example of the present invention being applied to a conversation between an elderly user and an artificial intelligence assistant. Figure 6 is a block diagram showing the configuration of the figure representation generation unit. Figure 7 is an example of a prompt and a picture representation generated in the picture representation generation unit. Figure 8 is a block diagram showing the configuration of the picture generation unit. FIG. 9 is a flowchart showing the operation flow of a multimodal conversation method according to one embodiment of the present invention. The aforementioned objectives of the present invention, as well as other objectives, advantages, and features, and the methods for achieving them, will become clear from the embodiments described in detail below together with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below but can be implemented in various different forms, and the following embodiments are provided merely to easily inform those skilled in the art of the purpose, structure, and effects of the invention, and the scope of the rights of the present invention is defined by the description in the claims. Meanwhile, the terms used in this specification are for describing the embodiments and are not intended to limit the invention. In this specification, the singular form includes the plural form unless specifically stated ot