KR-20260064082-A - SYNTHETIC IMAGE GENERATION METHOD USING SCENE GRAPH FROM TEXT PROMPT
Abstract
A method for generating a synthetic image using a scene graph extracted from a text prompt includes the steps of: a data processing device receiving a text prompt; the data processing device generating a scene graph from the text prompt; the data processing device generating embedding vectors by embedding the scene graph; the data processing device generating a scene layout embedding by inputting the embedding vectors of objects among the embedding vectors into a layout network; and the data processing device generating a synthetic image by inputting the scene layout embedding into a diffusion model.
Inventors
- 문성원
- 남도원
- 유원영
- 이정수
- 이지원
Assignees
- 한국전자통신연구원
Dates
- Publication Date
- 20260507
- Application Date
- 20241031
Claims (1)
- Step of the data processing device receiving a text prompt; The step of the data processing device generating a scene graph from the text prompt; The step of the data processing device embedding the scene graph to generate embedding vectors; The above data processing device inputs the embedding vectors of objects among the embedding vectors into a layout network to generate a scene layout embedding; and The above data processing device includes the step of inputting the scene layout embedding into a diffusion model to generate a synthetic image, wherein A method for generating a composite image using a scene graph extracted from a text prompt, wherein the scene graph includes nodes which are objects included in the text prompt and edges which are relationship information between the objects included in the text prompt.
Description
Synthetic Image Generation Method Using Scene Graph Extracted from Text Prompt The technology described below relates to a technique for generating a training dataset for a separate artificial intelligence model using a generative model. Stable Diffusion is a representative text-to-image generation model. Text-to-image generation models generate a target image by receiving a specific text prompt from the user. However, for high performance, text-to-image generation models must be trained in advance using a sufficient training dataset. Figure 1 is an example of a system for building a deep learning model. Figure 2 is an example of a synthetic image generation process. Figure 3 is an example illustrating a neural network model involved in the process of generating a synthetic image. Figure 4 is the result of generating a composite image using the PASCAL VOC class. Figure 5 is an example of a data processing device that generates a composite image. The technology described below is subject to various modifications and may have various embodiments, and specific embodiments are illustrated in the drawings and described in detail. However, this is not intended to limit the technology described below to specific embodiments, and it should be understood that it includes all modifications, equivalents, and substitutions that fall within the spirit and scope of the technology described below. Terms such as first, second, A, B, etc., may be used to describe various components, but such components are not limited by the said terms and are used solely for the purpose of distinguishing one component from another. For example, without departing from the scope of rights of the technology described below, the first component may be named the second component, and similarly, the second component may be named the first component. The term "and/or" includes a combination of multiple related described items or any of the multiple related described items. In terms used in this specification, singular expressions should be understood to include plural expressions unless the context clearly indicates otherwise, and terms such as “includes” should be understood to mean that the described features, number, steps, actions, components, parts, or combinations thereof exist, and not to exclude the existence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof. Before providing a detailed description of the drawings, it is to clarify that the classification of components in this specification is merely based on the primary function each component is responsible for. That is, two or more components described below may be combined into a single component, or a single component may be divided into two or more components based on more subdivided functions. Furthermore, each component described below may additionally perform some or all of the functions of other components in addition to its own primary function, and it goes without saying that some of the primary functions of each component may be exclusively performed by other components. Furthermore, in performing the method or operation method, each process constituting the method may occur differently from the specified order unless a specific order is clearly indicated in the context. That is, each process may occur in the same order as specified, may be performed substantially simultaneously, or may be performed in the reverse order. The technique described below is a method for generating synthetic images where the placement of objects is important based on a text prompt. The technique described below is a method for generating training data for building a separate deep learning model that performs image generation, object recognition, object detection, etc., using a generative model. Various types of text-to-image generation models have been developed. For example, text-to-image generation models include Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and diffusion model-based models. Meanwhile, a representative stable diffusion model is the Latent Diffusion Model (LDM). The technology described below can be applied to any one of these various types of text-to-image generation models. The following description explains that the data processing unit generates synthetic data using a generative model. The data processing unit can be physically implemented as various types of devices, such as PCs, smart devices, network servers, and chipsets dedicated to data processing. FIG. 1 is an example of a system (100) for building a deep learning model. In FIG. 1, an example is shown where the data processing device (110) is a computer terminal or a server. The data processing device (110) receives a certain text prompt from the user. The data processing device (110) generates a certain composite image based on the input text prompt. The data processing device (110) generates a scene graph from the text prompt. The