KR-20260067770-A - DEVICE AND METHOD FOR RESOLVING TEXTUAL AMBIGUITY THROUGH VISUAL LANGUAGE INFERENCE MODEL

KR20260067770AKR 20260067770 AKR20260067770 AKR 20260067770AKR-20260067770-A

Abstract

The present invention relates to a text ambiguity resolution device through a visual language inference model, comprising: a fun identification unit that receives an image and a source text, uses the image as a clue to identify fun phrases in the source text, and generates a plurality of candidate translations; a fun semantic interpretation unit that inputs the plurality of candidate translations into a visual language inference model and determines a fun translation based on consistency with the image; and a fun reconstruction unit that reconstructs a fun translation by reflecting the intent of the source text.

Inventors

유영재
정지완

Assignees

연세대학교 산학협력단

Dates

Publication Date: 20260513
Application Date: 20241106

Claims (10)

A pun identification unit that receives an image and a source text as input, uses the image as a clue to identify pun phrases in the source text, and generates multiple candidate translations; A fun semantic interpretation unit that inputs the plurality of candidate fun translations into a visual language inference model and determines a fun translation based on consistency with the image; and A text ambiguity resolution device through a visual language inference model comprising a fun reconstruction unit that reconstructs the fun translation by reflecting the intent of the original text.
In paragraph 1, the above fun identification part A text ambiguity resolution device using a visual language reasoning model characterized by identifying the association between visual information in the image above and the original text above to detect important phrases in the original text.
In paragraph 2, the above fun identification part A text ambiguity resolution device using a visual language reasoning model, characterized by determining a multimodal context for the above important phrase to calculate a fun possibility, and determining the above important phrase as the fun phrase if the fun possibility is greater than or equal to a specific criterion.
In paragraph 3, the above fun identification part A text ambiguity resolution device using a visual language reasoning model, characterized by generating a plurality of candidate translations by interpreting each of the above-mentioned fun phrases according to the above-mentioned multimodal context.
In paragraph 1, the above fun meaning interpretation unit A text ambiguity resolution device using a visual language inference model, characterized by inputting each of the above-mentioned plurality of candidate fun translations into the above-mentioned visual language inference model to detect visual cues in the image.
In paragraph 5, the above fun meaning interpretation unit A text ambiguity resolution device using a visual language inference model, characterized by determining whether the image can be interpreted as a Pun Disambiguator Image reflecting the corresponding candidate Pun translation through the detected visual cues.
In paragraph 6, the above fun meaning interpretation unit A text ambiguity resolution device using a visual language inference model, characterized by adopting the corresponding candidate fun translation as the fun translation when the above image is interpreted as a fun interpretation image reflecting the corresponding candidate fun translation.
In paragraph 1, the above-mentioned fun reconstruction part A text ambiguity resolution device using a visual language inference model, characterized by inferring the intent of the original text based on visual information of the image above and rearranging the translated text based on the delivery of key keywords in the translated text.
In paragraph 8, the above-mentioned fun reconstruction part A text ambiguity resolution device using a visual language inference model, characterized by inputting the above-mentioned rearranged fun translation into the above-mentioned visual language inference model to re-determine consistency with the above-mentioned image.
In a text ambiguity resolution method performed in a text ambiguity resolution device, A fun identification step that receives an image and a source text as input, identifies fun phrases in the source text using the image as a clue, and generates multiple candidate translations; A fun semantic interpretation step in which the plurality of candidate fun translations are input into a visual language inference model to determine a fun translation based on consistency with the image; and A method for resolving text ambiguity through a visual language inference model including a fun reconstruction step that reconstructs the fun translation to reflect the intent of the original text.

Description

Device and Method for Resolving Textual Ambiguity Through Visual Language Inference Model The present invention relates to a technology for resolving text ambiguity through a visual language inference model, and more specifically, to a visual argument inference device and method capable of inputting a plurality of candidate fun translations into a visual language inference model to determine a fun translation based on consistency with an image and reconstructing a fun translation by reflecting the intent of the original text. In order for a natural language processing system to fully mimic human language comprehension, it must be able to resolve various types of ambiguity. Types of ambiguity include the following: 1. Lexical ambiguity: Cases where a word can have multiple meanings. 2. Syntactic ambiguity: Cases where a word arrangement can be interpreted as various grammatical structures. 3. Scope Ambiguity: Cases where a sentence contains multiple quantifiers or scope expressions and their relative order is ambiguous. 4. Ambiguity of Omission: Cases where the identity of an omitted word or phrase is ambiguous. 5. Set/Distributive Ambiguity: Cases where a plural expression can be interpreted collectively or distributively 6. Implicative Ambiguity: Cases where the implied meaning of a sentence is ambiguous 7. Premise Ambiguity: Cases where the premise implied by a sentence is ambiguous 8. Idiomatic Ambiguity: Cases where a combination of words can be interpreted as both an idiom and a literal expression. 9. Referential Ambiguity: Cases where the referent of a pronoun is ambiguous 10. General/Non-general Ambiguity: Cases where it is ambiguous whether a sentence describes a general characteristic or a specific event. 11. Type/Entity Ambiguity: When it is ambiguous whether a term refers to a type or an entity. Various technological innovations are required for natural language processing systems to effectively resolve ambiguity. First, models capable of better understanding and interpreting context are needed. To this end, methods to improve self-attention mechanisms, such as those in Transformers, or to enhance the performance of language models like BERT should be considered. Second, blended learning methods combining supervised and unsupervised learning techniques should be considered to strengthen the ability to handle polysemy. Additionally, methods to integrate external knowledge graphs or knowledge bases that aid in resolving ambiguity should also be taken into account. Korean Published Patent No. 10-2022-7005746 (September 3, 2020) relates to resolving natural language ambiguities regarding a simulated reality setting. In an exemplary embodiment, a simulated reality setting having one or more virtual objects is displayed. A stream of gaze events is generated from the simulated reality setting and a stream of gaze data. A speech input is received within a time period, and a domain is determined based on the text representation of the speech input. Based on a plurality of event times for the time period and the stream of gaze events, one or more gaze events are identified from the stream of gaze events. The identified one or more gaze events are used to determine parameter values for unresolved parameters of the domain. A set of actions representing user intent regarding the speech input is determined based on the parameter values, and the set of actions is performed. FIG. 1 is a diagram illustrating a text ambiguity resolution device using a visual language reasoning model according to an embodiment of the present invention. Figure 2 is a diagram illustrating the system configuration of a text ambiguity resolution device through the visual language reasoning model of Figure 1. FIG. 3 is a flowchart illustrating a method for resolving text ambiguity through a visual language reasoning model according to the present invention. Figure 4 is a comparison of homomorphic fun (left) and heteromorphic fun (right) in the UNPIE dataset according to the present invention and a visual annotation drawing for resolution corresponding to each. FIG. 5 is a diagram showing the diversity of themes appearing in the visual premises and conclusions in the person VisArgs according to the present invention. FIG. 6 is an example diagram of the process for generating a fun explanatory image according to the present invention. The description of the present invention is merely an example for structural or functional explanation, and therefore the scope of the present invention should not be interpreted as being limited by the examples described in the text. That is, since the examples are subject to various modifications and may take various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical concept. Furthermore, the objectives or effects presented in the present invention do not imply that a specific example must include all of them or only such effects