US-20260127381-A1 - DEVICE AND METHOD FOR RESOLVING TEXTUAL AMBIGUITY THROUGH VISUAL LANGUAGE INFERENCE MODEL

US20260127381A1US 20260127381 A1US20260127381 A1US 20260127381A1US-20260127381-A1

Abstract

The present disclosure relates to a device for resolving textual ambiguity through a visual language inference model, wherein the device includes: a pun identification unit that input an image and an original text, identifies a pun phrase in the original text utilizing the image as a clue, and generates a plurality of candidate translations; a pun semantic interpretation unit that input the plurality of candidate pun translations into the visual language inference model and decides a pun translation based on consistency with the image; and a pun reconstruction unit that reconstructs the pun translation by reflecting an intention of the original text.

Inventors

Youngjae YU
Jiwan CHUNG

Assignees

UIF (UNIVERSITY INDUSTRY FOUNDATION), YONSEI UNIVERSITY

Dates

Publication Date: 20260507
Application Date: 20250331
Priority Date: 20241106

Claims (10)

1 . A device for resolving textual ambiguity through a visual language inference model, the device comprising: a pun identification unit that input an image and an original text, identifies a pun phrase in the original text utilizing the image as a clue, and generates a plurality of candidate translations; a pun semantic interpretation unit that input the plurality of candidate pun translations into the visual language inference model and decides a pun translation based on consistency with the image; and a pun reconstruction unit that reconstructs the pun translation by reflecting an intention of the original text.
2 . The device of claim 1 , wherein the pun identification unit detects an important phrase in the original text by understanding correlation between visual information in the image and the original text.
3 . The device of claim 2 , wherein the pun identification unit decides multimodal context for the important phrase to compute pun possibility, and decides the important phrase as the pun phrase when the pun possibility is higher than a specific standard.
4 . The device of claim 3 , wherein the pun identification unit interprets the pun phrase according to the multimodal context to generate the plurality of candidate translations.
5 . The device of claim 1 , wherein the pun semantic interpretation unit input each of the plurality of candidate pun translations into the visual language inference model to detect a visual clue in the image.
6 . The device of claim 5 , wherein the pun semantic interpretation unit decides whether the image is able to be interpreted as a pun interpretation image reflecting the corresponding candidate pun translation through the detected visual clue.
7 . The device of claim 6 , wherein the pun semantic interpretation unit adopts the corresponding candidate pun translation as the pun translation when the image is interpreted as the pun interpretation image reflecting the corresponding candidate pun translation.
8 . The device of claim 1 , wherein the pun reconstruction unit infers the intention of the original text based on visual information of the image and rearranges the pun translation centered on delivery of a core keyword in the pun translation.
9 . The device of claim 8 , wherein the pun reconstruction unit input the rearranged pun translation into the visual language inference model to re-decide the consistency with the image.
10 . A method for resolving textual ambiguity through a visual language inference model performed by a device for resolving the textual ambiguity, the method comprising: a pun identification stage that input an image and an original text, identifies a pun phrase in the original text utilizing the image as a clue, and generates a plurality of candidate translations; a pun semantic interpretation stage that input the plurality of candidate pun translations into the visual language inference model and decides a pun translation based on consistency with the image; and a pun reconstruction stage that reconstructs the pun translation by reflecting an intention of the original text.

Description

CROSS-REFERENCE TO RELATED PATENT APPLICATION This application claims the benefit of Korean Patent Application No. 10-2024-0156315, filed on Nov. 6, 2024, in the Korean Intellectual Property Office, the disclosure of which is incorporated herein in its entirety by reference. TECHNICAL FIELD The present disclosure relates to a textual ambiguity resolution technique through a visual language inference model, and more specifically, to a visual argumentation inference device and method that input a plurality of candidate pun translations into a visual language inference model, decides a pun translation based on the consistency with an image, and reconstructs the pun translation by reflecting an intention of the original text. BACKGROUND In order for a natural language processing system to fully mimic human language comprehension ability, various types of ambiguity need to be resolved. The following are the types of ambiguity: 1. Lexical ambiguity: When a word has multiple meanings2. Syntactic ambiguity: When a word arrangement is interpreted in multiple grammatical structures3. Scope ambiguity: When a sentence includes multiple quantifiers or scope expressions, and their relative order is ambiguous4. Omission ambiguity: When the identity of an omitted word or phrase is ambiguous5. Collective/distributive ambiguity: When plural expressions are interpreted collectively or distributively6. Implication ambiguity: When the meaning implied by a sentence is ambiguous7. Presupposition ambiguity: When the premise implied by a sentence is ambiguous8. Idiomatic ambiguity: When a combination of words is interpreted either literally or as an idiom9. Referential ambiguity: When the referent of a pronoun is ambiguous10. General/non-general ambiguity: When it is ambiguous whether a sentence describes a general characteristic or a specific event11. Type/entity ambiguity: When it is ambiguous whether a term refers to a type or an entity In order for the natural language processing system to effectively resolve ambiguity, various technological innovations are needed. First, a model that may better understand and interpret the context is needed. To this end, methods such as improving self-attention mechanisms such as Transformer or improving the performance of language models such as BERT need to be considered. Second, mixed learning methods that combine supervised and unsupervised learning techniques need to be considered to enhance the ability to handle equivocality. In addition, methods for integrating external knowledge graphs or knowledge bases that help resolve ambiguity also need to be considered. Korean Patent Application Publication No. 10-2022-7005746 (Sep. 3, 2020) relates to resolving natural language ambiguities with respect to a simulated reality setting. In an exemplary embodiment, a simulated reality setting having one or more virtual objects is displayed. A stream of gaze events is generated from the simulated reality setting and a stream of gaze data. A speech input is received within a time period and a domain is determined based on a text representation of the speech input. Based on the time period and a plurality of event times for the stream of gaze events, one or more gaze events are identified from the stream of gaze events. The identified one or more gaze events is used to determine a parameter value for an unresolved parameter of the domain. A set of tasks representing a user intent for the speech input is determined based on the parameter value and the set of tasks is performed. RELATED ART DOCUMENT Patent Document Korean Patent Application Publication No. 10-2022-7005746, Sep. 3, 2020 SUMMARY An embodiment of the present disclosure provides a device and method for resolving textual ambiguity through a visual language inference model capable of receiving an image and an original text, identifying a pun phrase in the original text utilizing the image as a clue, and generating a plurality of candidate translations. An embodiment of the present disclosure provides a device and method for resolving textual ambiguity through a visual language inference model capable of inputting a plurality of candidate pun translations into the visual language inference model and deciding a pun translation based on consistency with the image. An embodiment of the present disclosure provides a device and method for resolving textual ambiguity through a visual language inference model capable of reconstructing a pun translation reflecting an intention of the original text. According to embodiments, the device for resolving textual ambiguity through a visual language inference model includes: a pun identification unit that input an image and an original text, identifies a pun phrase in the original text utilizing the image as a clue, and generates a plurality of candidate translations; a pun semantic interpretation unit that input the plurality of candidate pun translations into the visual language inference model and decides a