KR-102963107-B1 - Device and Method for Generating Video Based on Prompt

KR102963107B1KR 102963107 B1KR102963107 B1KR 102963107B1KR-102963107-B1

Abstract

A prompt-based video generation device and method are disclosed. The disclosed method comprises the steps of: converting a prompt input by a user and inputting it into an LLM, and obtaining frame-by-frame text and frame-by-frame graph information for the prompt input by the user from the LLM (a); generating a scene graph using the frame-by-frame text information (b); classifying each attribute of the scene graph into a subject, an object, and an interaction (c); obtaining bounding box coordinate information of the subject and bounding box coordinate information of the object from the frame-by-frame graph information (d); calculating bounding box information of the interaction using the bounding box coordinate information of the subject and the bounding box coordinate information of the object (e); inputting each attribute of the scene graph into a graph neural network to obtain an embedding vector for each attribute (f); synthesizing the features of the embedding vector for each attribute and the bounding box coordinates of each attribute to generate a synthetic token (g); and inputting the synthetic token into a video generation neural network to generate a video (h). According to the disclosed device and method, there is an advantage in that a video reflecting the user's intent can be generated based on a prompt written by a non-expert user, and a video in which the interaction between the subject and the object is clearly displayed can be generated.

Inventors

김은솔
이은기
온석준

Assignees

한양대학교 산학협력단

Dates

Publication Date: 20260508
Application Date: 20241213

Claims (14)

In a prompt-based video generation method performed by a prompt-based video generation device, A step (a) of converting a prompt entered by a user and inputting it into an LLM, and obtaining frame-by-frame text and frame-by-frame graph information for the prompt entered by the user from the LLM; Step (b) of generating a scene graph using the above-mentioned frame-by-frame text information; Step (c) of classifying each attribute of the above scene graph into subject, object, and interaction; Step (d) of obtaining bounding box coordinate information of the subject and bounding box coordinate information of the object from the frame-by-frame graph information; A step (e) of calculating bounding box information of the interaction using bounding box coordinate information of the subject and bounding box coordinate information of the object; A step (f) of inputting each attribute of the above scene graph into a graph neural network to obtain an embedding vector for each attribute; A step (g) of generating a composite token by synthesizing the features of the embedding vector for each of the above attributes and the bounding box coordinates of each of the above attributes; and The method includes the step (h) of generating a video by inputting the synthetic token into a video generation neural network, wherein The above step (e) is a prompt-based video generation method in which the larger value of the left coordinate of the upper-left corner of the subject bounding box and the object bounding box is determined as the upper-left left coordinate of the interaction bounding box, the smaller value of the upper coordinate of the upper-left corner of the subject bounding box and the object bounding box is determined as the upper-left upper coordinate of the interaction bounding box, the smaller value of the right coordinate of the lower-right corner of the subject bounding box and the object bounding box is determined as the lower-right right coordinate of the interaction bounding box, and the larger value of the lower coordinate of the lower-right corner of the subject bounding box and the object bounding box is determined as the lower-right lower coordinate of the interaction bounding box.
In paragraph 1, The above step (a) is a prompt-based video generation method for converting a prompt entered by a user to include a detailed request prompt requesting frame-by-frame text and frame-by-frame graph information and an example prompt showing examples of frame-by-frame text and frame-by-frame graphs.
In paragraph 1, A prompt-based video generation method in which the above-mentioned frame-by-frame graph information includes object information included in the above-mentioned frame-by-frame text and bounding box coordinate information of each object.
In paragraph 3, The above step (d) is a prompt-based video generation method that determines the object as a subject or an object based on the classification result of the above step (c), and then obtains the subject bounding box coordinate information and the object bounding box coordinate information.
delete
In paragraph 1, The above step (e) is a prompt-based video generation method that swaps the left coordinate and the right coordinate when the left coordinate of the interaction bounding box is determined to be larger than the right coordinate, and swaps the lower coordinate and the upper coordinate when the lower coordinate of the interaction bounding box is determined to be larger than the upper coordinate.
In paragraph 1, A prompt-based video generation method in which the bounding box coordinates of step (g) are obtained by Fourier mapping of the bounding box coordinates.
processor; It includes memory connected to the above processor, The above processor is, A step (a) of converting a prompt entered by a user and inputting it into an LLM, and obtaining frame-by-frame text and frame-by-frame graph information for the prompt entered by the user from the LLM; Step (b) of generating a scene graph using the above-mentioned frame-by-frame text information; Step (c) of classifying each attribute of the above scene graph into subject, object, and interaction; Step (d) of obtaining bounding box coordinate information of the subject and bounding box coordinate information of the object from the frame-by-frame graph information; A step (e) of calculating bounding box information of the interaction using bounding box coordinate information of the subject and bounding box coordinate information of the object; A step (f) of inputting each attribute of the above scene graph into a graph neural network to obtain an embedding vector for each attribute; A step (g) of generating a composite token by synthesizing the features of the embedding vector for each of the above attributes and the bounding box coordinates of each of the above attributes; and Execute the step (h) of generating a video by inputting the above synthetic token into a video generation neural network, wherein The above step (e) is a prompt-based video generation device that determines the larger left coordinate value between the upper-left left coordinates of the subject bounding box and the object bounding box as the upper-left left coordinate of the interaction bounding box, determines the smaller upper coordinate value between the upper-left upper coordinates of the subject bounding box and the object bounding box as the upper-left upper coordinate of the interaction bounding box, determines the smaller right coordinate value between the lower-right left coordinates of the subject bounding box and the object bounding box as the lower-right right coordinate of the interaction bounding box, and determines the larger lower coordinate value between the lower-right lower coordinates of the subject bounding box and the object bounding box as the lower-right lower coordinate of the interaction bounding box.
In paragraph 8, The above step (a) is a prompt-based video generation device that converts the prompt entered by the user to include a detailed request prompt requesting frame-by-frame text and frame-by-frame graph information and an example prompt showing examples of frame-by-frame text and frame-by-frame graphs.
In paragraph 8, The above-mentioned frame-by-frame graph information is a prompt-based video generation device that includes object information included in the above-mentioned frame-by-frame text and bounding box coordinate information of each object.
In Paragraph 10, The above step (d) is a prompt-based video generation device that determines the object as a subject or an object based on the classification result of the above step (c), and then obtains the subject bounding box coordinate information and the object bounding box coordinate information.
delete
In paragraph 8, The above step (e) is a prompt-based video generation device that swaps the left coordinate and the right coordinate when the left coordinate of the interaction bounding box is determined to be larger than the right coordinate, and swaps the lower coordinate and the upper coordinate when the lower coordinate of the interaction bounding box is determined to be larger than the upper coordinate.
In paragraph 8, A prompt-based video generation device in which the bounding box coordinates of step (g) are obtained by Fourier mapping of the bounding box coordinates.

Description

Device and Method for Generating Video Based on Prompt The present invention relates to a video generation device and method, and more specifically, to a device and method for generating video based on a form. Various methods for generating videos based on prompts (text) entered by the user are being studied, and various types of video generation neural networks are generating videos based on prompts. However, user input prompts are abbreviated and fail to fully reflect the user's intent in text, and there are clear limitations to generating the user's intended video using prompts that do not fully reflect the intended meaning. For videos to be accurately generated through neural networks, the specific locations and interactions of subjects and objects must be clearly defined; however, since the text input by general users does not include such definitions, prompt-based video generation fails to meet user requirements. In particular, when generating videos using prompts entered by users who lack sufficient knowledge of prompt creation, the unnaturalness of the video becomes even more pronounced when creating long videos. Furthermore, if the object's behavior is not clearly defined in the prompt, problems may arise where the attributes of the relationship between the subject and the object are reversed or omitted, as the relationship and attributes between the subject and the object are not clearly defined. Therefore, a video generation device and method are required that can generate a video intended by the user even through a prompt written by a non-expert. FIG. 1 is a block diagram illustrating the overall structure of a prompt-based video generation device according to one embodiment of the present invention. FIG. 2 is a diagram showing an example of a conversion prompt converted in a prompt conversion module according to an embodiment of the present invention. FIG. 3 is a diagram showing an example of information output by an LLM when a converted prompt according to an embodiment of the present invention is input to the LLM. FIG. 4 is a block diagram illustrating the structure of a graph generation module according to one embodiment of the present invention. FIG. 5 is a diagram showing an example of a scene graph generated by a scene graph generation module according to an embodiment of the present invention. FIG. 6 is a drawing showing an example of interaction bounding boxes set according to an embodiment of the present invention. FIG. 7 is a block diagram showing the detailed structure of a bounding box coordinate acquisition module according to one embodiment of the present invention. FIG. 8 is a diagram showing the operation of a feature synthesis module according to an embodiment of the present invention. FIG. 9 is a flowchart illustrating the overall flow of a prompt-based video generation method according to an embodiment of the present invention. Hereinafter, specific embodiments according to embodiments of the present invention will be described with reference to the drawings. The following detailed description is provided to facilitate a comprehensive understanding of the methods, devices, and/or systems described herein. However, this is merely illustrative and the present invention is not limited thereto. In describing the embodiments of the present invention, if it is determined that a detailed description of known technology related to the present invention may unnecessarily obscure the essence of the embodiments, such detailed description will be omitted. Furthermore, the terms described below are defined in consideration of their functions in the present invention, and these may vary depending on the intentions or practices of the user or operator. Therefore, such definitions should be based on the content throughout this specification. Terms used in the detailed description are intended merely to describe specific embodiments and should not be limiting. Unless explicitly stated otherwise, expressions in the singular form include the meaning of the plural form. In this description, expressions such as “include” or “comprising” are intended to refer to certain characteristics, numbers, steps, actions, elements, parts thereof, or combinations thereof, and should not be interpreted to exclude the existence or possibility of one or more other characteristics, numbers, steps, actions, elements, parts thereof, or combinations thereof other than those described. FIG. 1 is a block diagram illustrating the overall structure of a prompt-based video generation device according to one embodiment of the present invention. Referring to FIG. 1, a prompt-based video generation device according to one embodiment of the present invention includes a prompt conversion module (100), a graph generation module (200), a bounding box coordinate acquisition module (300), a graph neural network (400), a feature synthesis module (500), and a video generation neural network (600). In addition, the present invention