KR-20260067858-A - APPARATUS AND METHOD FOR GENERATING TEXT FOR HIGH-DIMENSIONAL IMAGE BY INTEGRATION OF SCENE GRAPHS AND VISION-LANGUAGE MODELS

KR20260067858AKR 20260067858 AKR20260067858 AKR 20260067858AKR-20260067858-A

Abstract

The present invention relates to a text generation device for high-dimensional images through the integration of a scene graph and a vision-language model. More specifically, the device for generating text for high-dimensional images comprises: a scene graph information extraction unit that extracts relationship information of a scene graph in a triplet structure from a scene graph of an input image; an object detail information extraction unit that generates a sentence about the input image using a vision-language model; and an information combining unit that combines the scene graph information extracted in the triplet structure and the sentence generated using the vision-language model through a large language model to generate a natural sentence describing the input image. Furthermore, the present invention relates to a method for generating text for a high-dimensional image through the integration of a scene graph and a vision-language model, and more specifically, to a method for generating text for a high-dimensional image in which each step is performed in a text generation device, comprising: (1) a scene graph information extraction step of extracting relationship information of a scene graph into a triplet structure from a scene graph of an input image; (2) an object detail information extraction step of generating a sentence about the input image using a vision-language model; and (3) an information combination step of combining the scene graph information extracted into the triplet structure and the sentence generated using the vision-language model through a large language model to generate a natural sentence describing the input image. According to the device and method for generating text for high-dimensional images through the integration of a scene graph and a vision-language model proposed in the present invention, by analyzing an image by integrating detailed feature information of an object obtained from a vision-language model that richly extracts detailed features of an image by simultaneously learning visual information of the image and linguistic information describing it, with relationship information extracted from a scene graph, the complex scene of the image can be precisely understood and the accuracy of the analysis can be improved. In addition, according to the device and method for generating text for high-dimensional images through the integration of a scene graph and a vision-language model proposed in the present invention, by performing analysis by integrating relational information extracted from a scene graph and detailed feature information of an object obtained through a vision-language model into a large-scale language model, the relational information and feature information can be combined complementarily to generate natural and contextually appropriate text that describes the input image.

Inventors

고병철
김동영
김성진

Assignees

계명대학교 산학협력단

Dates

Publication Date: 20260513
Application Date: 20241106

Claims (16)

As a text generation device (100) for high-dimensional images, A scene graph information extraction unit (110) that extracts relationship information of the scene graph from the scene graph of an input image in a triplet structure; An object detail extraction unit (120) that generates a sentence for the input image using a vision-language model; and A text generation device (100) for a high-dimensional image through the integration of a scene graph and a vision-language model, characterized by including an information combining unit (130) that combines scene graph information extracted with the above triplet structure and a sentence generated using the above vision-language model through a large language model to generate a natural sentence describing the above input image.
In paragraph 1, the above scene graph is, A text generation device (100) for high-dimensional images through the integration of a scene graph and a vision-language model, characterized by inferring and representing the relationship between objects in the image.
In paragraph 2, the above scene graph information is, A text generation device (100) for a high-dimensional image through the integration of a scene graph and a vision-language model, characterized by being a triplet structure composed of a label of a first object, a relationship label for the first object and a second object, and a label of the second object.
In claim 1, the object detail information extraction unit (120) is, A text generation device (100) for high-dimensional images through the integration of a scene graph and a vision-language model, characterized by generating a sentence containing detailed information of an object for the above input image.
In paragraph 4, the above-mentioned vision-language model is, Text generation device (100) for high-dimensional images through the integration of a scene graph and a vision-language model, characterized by being a multimodal model that maps images and text into the same embedding space and learns associations through contrast learning.
In paragraph 5, the above-mentioned vision-language model is, A text generation device (100) for high-dimensional images through the integration of a scene graph and a vision-language model, characterized by being CLIP (Contrastive Language-Image Pre-training).
In paragraph 1, the information combining part (130) is, A text generation device (100) for high-dimensional images through the integration of a scene graph and a vision-language model, characterized by using a combined sentence generated by combining a set of triplet relationship labels in the scene graph and a sentence generated in the vision-language model as an input prompt for the large language model.
In paragraph 7, the above-mentioned giant language model is, Text generation device (100) for high-dimensional images through the integration of a scene graph and a vision-language model, characterized by being a GPT (Generative pre-trained transformer) model.
A method for generating text for a high-dimensional image, wherein each step is performed in a text generation device (100), (1) A scene graph information extraction step (S110) for extracting relationship information of the scene graph into a triplet structure from the scene graph of the input image; (2) An object detail extraction step (S120) for generating a sentence for the input image using a vision-language model; and (3) A method for generating text for a high-dimensional image through the integration of a scene graph and a vision-language model, characterized by including an information combining step (S130) that combines the scene graph information extracted with the triplet structure and the sentence generated using the vision-language model through a large language model to generate a natural sentence describing the input image.
In Clause 9, the above scene graph is, A method for generating text for high-dimensional images through the integration of a scene graph and a vision-language model, characterized by inferring and representing the relationship between objects in the image.
In item 10, the above scene graph information is, A method for generating text for a high-dimensional image through the integration of a scene graph and a vision-language model, characterized by a triplet structure composed of a label of a first object, a relationship label for the first object and a second object, and a label of the second object.
In claim 9, in the object detail extraction step (S120), A method for generating text for high-dimensional images through the integration of a scene graph and a vision-language model, characterized by generating a sentence containing detailed information of an object for the above-mentioned input image.
In Clause 12, the above-mentioned vision-language model is, A method for generating text for high-dimensional images through the integration of a scene graph and a vision-language model, characterized by being a multimodal model that maps images and text to the same embedding space and learns associations through contrast learning.
In Paragraph 13, the above-mentioned vision-language model is, A method for generating text for high-dimensional images through the integration of a scene graph and a vision-language model, characterized by CLIP (Contrastive Language-Image Pre-training).
In claim 9, in the information combining step (S130), A method for generating text for a high-dimensional image through the integration of a scene graph and a vision-language model, characterized by using a combined sentence generated by combining a set of triplet relationship labels in the scene graph and a sentence generated by the vision-language model as an input prompt for the large language model.
In paragraph 15, the above-mentioned giant language model is, A method for generating text for high-dimensional images through the integration of a scene graph and a vision-language model, characterized by being a GPT (Generative pre-trained transformer) model.

Description

Apparatus and Method for Generating Text for High-Dimensional Images by Integration of Scene Graphs and Vision-Language Models The present invention relates to a text generation device and method, and more specifically, to a text generation device and method for high-dimensional images through the integration of a scene graph and a vision-language model. The content described in this section merely provides background information regarding an embodiment of the present invention and does not constitute prior art. Image analysis is a very important field in computer vision research and plays a key role in various real-world applications, including image captioning (Non-patent Literature 0001) and Vision Question Answering (VQA) (Non-patent Literature 0002), as well as autonomous driving, medical image analysis, and surveillance systems. The most basic step in the process of image analysis is recognizing objects contained within an image, but simply recognizing objects alone cannot adequately describe complex scenes. To address this, it is necessary to understand the interrelationships between each object and to analyze the scene more deeply based on these relationships. Against this backdrop, recent studies have introduced the concept of a scene graph (Non-patent Literature 0003, 0004). A scene graph is a method that helps in the structural understanding of an image by explicitly expressing not only the objects within the image but also the relationships between the objects. However, because scene graphs focus only on relationships, they have limitations in that they cannot sufficiently reflect detailed feature information of the objects themselves. The aforementioned background technology is technical information that the inventor possessed for the derivation of the present invention or acquired during the process of deriving the present invention, and it cannot be considered as publicly known technology disclosed to the general public prior to the filing of the present invention. FIG. 1 is a diagram illustrating the configuration of a text generation device for high-dimensional images through the integration of a scene graph and a vision-language model according to an embodiment of the present invention. FIG. 2 is a diagram illustrating the structure of a text generation device for high-dimensional images through the integration of a scene graph and a vision-language model according to an embodiment of the present invention. FIG. 3 is a diagram illustrating an example of an input image used in a text generation device for high-dimensional images through the integration of a scene graph and a vision-language model according to an embodiment of the present invention. FIG. 4 is a diagram illustrating, for example, a scene graph in a text generation device for high-dimensional images through the integration of a scene graph and a vision-language model according to an embodiment of the present invention. FIG. 5 is a diagram illustrating an example of scene graph information extracted by a scene graph information extraction unit of a text generation device for high-dimensional images through the integration of a scene graph and a vision-language model according to an embodiment of the present invention. FIG. 6 is a diagram illustrating an object detail extraction unit of a text generation device for high-dimensional images through the integration of a scene graph and a vision-language model according to an embodiment of the present invention. FIG. 7 is a diagram illustrating the configuration of a vision-language model in a text generation device for high-dimensional images through the integration of a scene graph and a vision-language model according to an embodiment of the present invention. FIG. 8 is a diagram illustrating an information combination unit of a text generation device for high-dimensional images through the integration of a scene graph and a vision-language model according to an embodiment of the present invention. FIG. 9 is a diagram illustrating the flow of a method for generating text for a high-dimensional image through the integration of a scene graph and a vision-language model according to an embodiment of the present invention. FIG. 10 is a diagram illustrating experimental results using a text generation device and method for high-dimensional images through the integration of a scene graph and a vision-language model according to an embodiment of the present invention. Embodiments of the present invention are described below with reference to the attached drawings so that those skilled in the art can easily implement the invention. However, the present invention may be embodied in various different forms and is not limited to the embodiments described herein. Furthermore, in order to clearly explain the present invention in the drawings, parts unrelated to the explanation have been omitted, and similar parts throughout the specification are denoted by similar reference numerals. Througho