US-20260129010-A1 - MULTI-MODAL CHATTING APPARATUS AND METHOD

US20260129010A1US 20260129010 A1US20260129010 A1US 20260129010A1US-20260129010-A1

Abstract

Provided are a multi-modal chatting apparatus and method which automatically generate and present a picture related to a conversation, if necessary, in a situation in which a user and a system have a conversation based on text or a voice. The multi-modal chatting apparatus includes a text response generation unit configured to generate a text system response that needs to be now spoken based on conversation context between a system and a user, a picture expression generation unit configured to generate picture expression text that expresses contents to be expressed by a picture based on the generated text system response, and a picture generation unit configured to generate a picture based on the generated picture expression text.

Inventors

Ki Young Lee
Oh Woog KWON
Jihee RYU
Young-Ae Seo
Jin Seong
JONG HUN SHIN
Yo Han Lee
Soojong Lim
Jeong Heo

Assignees

ELECTRONICS AND TELECOMMUNICATIONS RESEARCH INSTITUTE

Dates

Publication Date: 20260507
Application Date: 20250312
Priority Date: 20241105

Claims (20)

1 . A multi-modal chatting method comprising: a text system response generation step of generating a text system response that needs to be now spoken based on conversation context between a system and a user; a picture expression generation step of generating picture expression text that expresses contents to be expressed by a picture based on the generated text system response; and a picture generation step of generating a picture based on the generated picture expression text.
2 . The multi-modal chatting method of claim 1 , wherein the picture expression generation step comprises steps of: generating a prompt for generating the picture expression text; and generating the picture expression text by inputting the generated prompt to a generative language model.
3 . The multi-modal chatting method of claim 2 , wherein the prompt comprises: a command that determines whether to output the generated text system response without any change or to generate the generated text system response in picture and that enables the picture expression text to be generated when the generated text system response needs to be generated in picture; user information comprising characteristics of the user; previous conversation context; and the text system response.
4 . The multi-modal chatting method of claim 3 , wherein the step of generating the picture expression text by inputting the generated prompt to the generative language model comprises: outputting, by the generative language model, a signal indicating that no picture is to be generated when it is determined that system speech contents are to be not displayed in picture, and generating, by the generative language model, the picture expression text when it is determined to construct the system speech contents in picture.
5 . The multi-modal chatting method of claim 4 , wherein: when it is determined that the system speech contents are to be not displayed in picture in the picture expression generation step, the text system response generated in the text system response step is output, and when it is determined that the system speech contents are to be displayed in picture in the picture expression generation step, the text system response generated in the text system response step and the picture generated in the picture generation step are output.
6 . The multi-modal chatting method of claim 1 , wherein the picture generation step comprises: a picture search step of searching for a picture most similar to the picture expression text based on the picture expression text; a picture generation determination step of determining whether to use the retrieved picture or to generate a new picture; and a step of generating and outputting a new picture at least based on the picture expression text by using an AI image generation model when it is determined that the new picture is to be generated and outputting the retrieved picture when it is determined that the retrieved picture is to be used.
7 . The multi-modal chatting method of claim 6 , wherein the picture generation determination step comprises steps of: generating a determination prompt for determining whether to generate a new picture, based on the picture retrieved in the picture search step and picture expression context comprising user information, the picture expression text, and a conversation history, and determining whether to use the retrieved picture or to generate the new picture based on the generated determination prompt and the retrieved picture.
8 . The multi-modal chatting method of claim 7 , wherein: in the picture search step, a plurality of pictures most similar to the picture expression text is output, in the step of generating the determination prompt, a plurality of determination prompts is generated by combining the plurality of pictures and the picture expression context, and in the picture generation determination step, whether to use a picture having a greatest similarity, among the retrieved pictures, without any change is determined based on similarity between the plurality of pictures and the picture expression context.
9 . The multi-modal chatting method of claim 6 , wherein the picture generation determination step comprises: determining to generate the new picture based on the picture expression text and the retrieved picture when similarity of the retrieved picture is higher than a predetermined threshold value, and determining to generate the new picture at least based on the picture expression text when the similarity of the retrieved picture is lower than the predetermined threshold value.
10 . The multi-modal chatting method of claim 1 , further comprising a picture reflected text generation step of generating text into which the picture generated in the picture generation step has been reflected by correcting the text system response.
11 . A multi-modal chatting apparatus comprising: a text response generation unit configured to generate a text system response that needs to be now spoken based on conversation context between a system and a user; a picture expression generation unit configured to generate picture expression text that expresses contents to be expressed by a picture based on the generated text system response; and a picture generation unit configured to generate a picture based on the generated picture expression text.
12 . The multi-modal chatting apparatus of claim 11 , wherein the picture expression generation unit comprises: a prompt generation unit configured to generate a prompt for generating the picture expression text, and a generative language model configured to generate the picture expression text by receiving the generated prompt.
13 . The multi-modal chatting apparatus of claim 12 , wherein the prompt comprises: a command that determines whether to output the generated text system response without any change or to generate the generated text system response in picture and that enables the picture expression text to be generated when the generated text system response needs to be generated in picture; user information comprising characteristics of the user; previous conversation context; and the text system response.
14 . The multi-modal chatting apparatus of claim 13 , wherein the command of the prompt is an instruction that outputs a signal indicating that no picture is to be generated when it is determined that system speech contents are to be not displayed in picture and that enables the picture expression text to be generated when it is determined that the system speech contents are to be constructed in picture.
15 . The multi-modal chatting apparatus of claim 14 , wherein the multi-modal chatting apparatus outputs the text system response generated by the text response generation unit when the picture expression generation unit determines that the system speech contents are to be not displayed in picture, and outputs the text system response generated by the text response generation unit and the picture generated by the picture generation unit when the picture expression generation unit determines to display the system speech contents in picture.
16 . The multi-modal chatting apparatus of claim 11 , wherein the picture generation unit comprises: an image search unit configured to search for a picture most similar to the picture expression text based on the picture expression text; a picture generation determination unit configured to determine whether to use the retrieved picture or to generate a new picture; and an image generating model configured to generate and output a new picture at least based on the picture expression text when it is determined that the new picture is to be generated and outputting the retrieved picture when it is determined that the retrieved picture is to be used.
17 . The multi-modal chatting apparatus of claim 16 , wherein the picture generation determination unit generates a determination prompt for determining whether to generate a new picture, based on the picture retrieved by the image search unit and picture expression context comprising user information, the picture expression text, and a conversation history, and determines whether to use the retrieved picture or to generate the new picture based on the generated determination prompt and the retrieved picture.
18 . The multi-modal chatting apparatus of claim 17 , wherein: the picture search unit outputs a plurality of pictures most similar to the picture expression text, and the picture generation determination unit generates a plurality of determination prompts by combining the plurality of pictures and the picture expression context, and determines whether to use a picture having a greatest similarity, among the retrieved pictures, without any change based on similarity between the plurality of pictures and the picture expression context.
19 . The multi-modal chatting apparatus of claim 16 , wherein the picture generation determination unit determines to generate the new picture based on the picture expression text and the retrieved picture when similarity of the retrieved picture is higher than a predetermined threshold value, and determines to generate the new picture at least based on the picture expression text when the similarity of the retrieved picture is lower than the predetermined threshold value.
20 . The multi-modal chatting apparatus of claim 11 , further comprising a picture reflected text generation unit configured to generate text into which the picture generated in the picture generation unit has been reflected by correcting the text system response.

Description

CROSS-REFERENCE TO RELATED APPLICATION(S) This application claims priority from and the benefit of Korean Patent Application No. 10-2024-0155301, filed on Nov. 5, 2024, which is hereby incorporated by reference for all purposes as if set forth herein. BACKGROUND 1. Technical Field The present disclosure relates to a multi-modal chatting apparatus and method. 2. Description of Related Art A text or voice-oriented conversation is one of the most basic methods of communication for humans. However, when a purpose is to convey more complicated concepts or situations, rather than just simple meanings, using only text or voice does not aid efficient and fast understanding between the individuals involved. Such a phenomenon actually occurs in various contexts. In an educational conversation between a student and a teacher, the teacher draws and explains a picture on a chalkboard or a scratch pad in order to help the student understand more easily. In this case, the picture is a more efficient tool than text in helping the student grasp a problem. Furthermore, this is especially noticeable in conversations with the elderly or socially disadvantaged individuals. For example, when explaining the functions of an air conditioner or a TV remote controller to elderly parents who do not live their children, it is difficult to explain the functions of the remote controller buttons in detail and efficiently using only text or voice so that the elderly parents can easily understand the functions of the remote controller buttons. Moreover, even in a conversation processing field that is rapidly developed recently, in conversations between the socially disadvantaged including old men and a system, there are many cases in which it is difficult to make a user understand a specific concept or fact in conversations through only text or a voice. The development of a deep learning-based AI technology has brought significant advancements in various technologies of a natural language processing field. The conversation processing field is not exceptional, and has made a clear progress even in an object-oriented conversation in addition to simple chatting with a system. For such a reason, there have been many attempts to apply a conversation processing model to various fields. For example, examples of such attempts include a care service for the socially disadvantaged including the elderly, tutoring services for language or mathematical problems, a medical service, and a commodity sales service. However, a conversation simply using only text or a voice has a difficulty in maintaining an efficient conversation between a system and a user. For example, in the case of a conversation with elderly people, to use only text for a specific concept or fact or a method of using a thing has a clear limit. In some cases, desired efficiency may be obtained by explaining oral contents in text along with the sharing of a picture while showing the oral contents in picture. The same is true in a tutoring domain. In general, when trying to solve mathematical problems, many people actually understand what the problems represent by drawing pictures. For example, a teacher who teaches mathematical problems help students understand by drawing pictures on a blackboard or presenting the pictures on a practice book when the teacher feels that the students lack understanding while explaining the students in a spoken language. SUMMARY Various embodiments are directed to providing a multi-modal chatting apparatus and method which may help the understanding of a user more efficiently by automatically generating and presenting a picture related to a conversation, if necessary, in a situation in which a user and a system have a conversation based on text or a voice. A multi-modal chatting method according to an embodiment of the present disclosure includes a text system response generation step of generating a text system response that needs to be now spoken based on conversation context between a system and a user, a picture expression generation step of generating picture expression text that expresses contents to be expressed by a picture based on the generated text system response, and a picture generation step of generating a picture based on the generated picture expression text. In an embodiment, the picture expression generation step includes steps of generating a prompt for generating the picture expression text and generating the picture expression text by inputting the generated prompt to a generative language model. In an embodiment, the prompt includes a command that determines whether to output the generated text system response without any change or to generate the generated text system response in picture and that enables the picture expression text to be generated when the generated text system response needs to be generated in picture, user information including characteristics of the user, previous conversation context, and the text system response.