KR-20260062383-A - KINETIC TYPOGRAPHY VIDEO GENERATION SYSTEM AND CONTROL METHOD THEREOF, AND LEARNING METHOD OF KINETIC TYPOGRAPHY VIDEO GENERATION SYSTEM

KR20260062383AKR 20260062383 AKR20260062383 AKR 20260062383AKR-20260062383-A

Abstract

The present invention relates to a kinetic typography image generation system, a method for controlling the same, and a learning method for a kinetic typography image generation system. More specifically, the present invention relates to a learning method for a kinetic typography image generation model of a typography image generation system.

Inventors

박선미
전해곤
배인환
신승현

Assignees

광주과학기술원

Dates

Publication Date: 20260507
Application Date: 20241029

Claims (10)

Step of inputting the target typography video into the encoder; A step of obtaining a latent vector for the image from the encoder; A step of inputting the above latent vector into Temporal-Spatial Processing Blocks that process spatial information and temporal information; A step of injecting static and dynamic captions for text included in the image into the time-space processing block; A step of obtaining a latent vector reflecting spatial information and a latent vector reflecting temporal information from the above time-space processing block; A step of inputting the latent vector reflecting the spatial information and the latent vector reflecting the temporal information into a decoder; A step of obtaining a final typography image for the input from the decoder; and A learning method for a kinetic typography image generation system characterized by including the step of learning a typography generation model using the above-mentioned final typography image and the above-mentioned learning target typography image.
In paragraph 1, The above time-space processing block is, It includes a spatial block for processing the above spatial information and a temporal block for processing the above temporal information, A learning method for a kinetic typography image generation system characterized in that the above-mentioned spatial block and the above-mentioned time block exist as a pair.
In paragraph 2, The above injection step is, A step of injecting the static caption describing the spatial information of the text into the spatial block; and The method includes the step of injecting the dynamic caption describing the temporal information of the text into the time block. The above static caption is, It includes a description of at least one of the visual appearance of the above text and the characteristics of the background included in the above image, and The above dynamic caption is, A learning method for a kinetic typography image generation system characterized by including an explanation of at least one of the motion, appearance order, and change pattern of the text according to the temporal change of the image.
In paragraph 3, The above space block is, It includes a spatial downsampling block and a spatial upsampling block, The above time block is, A learning method for a kinetic typography image generation system characterized by including a time downsampling block and a time upsampling block.
In paragraph 3, The above injection step is, The method further includes the step of injecting word captions for the text into the time-space processing block. The above word caption is, A learning method for a kinetic typography image generation system characterized by including an explanation of the overall structure and meaning of the above text.
In paragraph 5, The step of injecting the above word caption is, A step of specifying at least one block among the above time-space processing blocks to which the word caption is to be injected; and A learning method for a kinetic typography image generation system characterized by including the step of injecting the word caption into the specified block.
In paragraph 2, The above space block is, Outputs a latent vector reflecting the above spatial information through a pre-configured spatial diffusion mechanism, and The above time block is, It outputs a latent vector reflecting the above temporal information through a pre-configured Temporal Diffusion Mechanism, and The above decoder is, A learning method for a kinetic typography image generation system characterized by generating the final typography image by combining the latent vectors output from each of the spatial block and the time block.
In paragraph 1, In the above learning step, Using a preset loss function, calculate the loss between the training target typography image and the final typography image generated by the decoder, and A training method for a kinetic typography image generation system characterized by training the typography generation model so that the above loss is reduced.
In a kinetic typography image generation system, The above system includes memory and at least one processor, The above memory and the above processor cooperate, Input the typography video to be trained into the encoder, and From the encoder above, a latent vector for the image is obtained, and The above latent vector is input into Temporal-Spatial Processing Blocks that process spatial information and temporal information, and Static and dynamic captions for the text included in the image are injected into the above time-space processing block, and From the above time-space processing block, a latent vector reflecting the spatial information and a latent vector reflecting the temporal information are obtained, and The latent vector reflecting the above spatial information and the latent vector reflecting the above temporal information are input to the decoder, and From the above decoder, a final typography image for the above input is obtained, and A kinetic typography image generation system characterized by training a typography generation model using the above-mentioned final typography image and the above-mentioned learning target typography image.
A program that is executed by one or more processes in an electronic device and stored on a computer-readable recording medium, The above program is, Step of inputting the target typography video into the encoder; A step of obtaining a latent vector for the image from the encoder; A step of inputting the above latent vector into Temporal-Spatial Processing Blocks that process spatial information and temporal information; A step of injecting static and dynamic captions for text included in the image into the time-space processing block; A step of obtaining a latent vector reflecting spatial information and a latent vector reflecting temporal information from the above time-space processing block; A step of inputting the latent vector reflecting the spatial information and the latent vector reflecting the temporal information into a decoder; A step of obtaining a final typography image for the input from the above decoder; A program stored on a computer-readable recording medium characterized by including instructions for performing a step of training a typography generation model using the above-mentioned final typography image and the above-mentioned learning target typography image.

Description

Kinetic typography video generation system and control method thereof, and learning method of kinetic typography video generation system The present invention relates to a kinetic typography image generation system, a control method thereof, and a learning method for a kinetic typography image generation system. More specifically, the present invention relates to a learning method for a kinetic typography image generation model of a typography image generation system. Kinetic Typography is a dynamic motion graphic design technique that conveys information using the visual effect of text moving within a video. The primary goal of kinetic typography is to make a message visually stand out and improve its memorability. Examples include the way text appears or moves on screen in advertisements or movie openings. This type of kinetic typography adjusts the shape (glyph), color, and texture of letters, and transforms their positions over time. Traditionally, designers created kinetic typography images by manually setting the shape, color, and position of text and applying animation effects using commercial software (e.g., Adobe After Effects). However, this process is time-consuming and presents difficulties due to the complexity involved. Meanwhile, with the recent advancement of artificial intelligence, active research is being conducted to automate the process of generating kinetic typography images using AI-based generative models. However, conventional studies have focused on applying styles to a single character or generating multiple characters simultaneously. Furthermore, these studies have concentrated only on animable effects without actual character movement. These methods present limitations in that it is difficult to move or edit characters individually. In other words, conventional generative models are not specialized for kinetic typography because they lack an understanding of character shapes and movements during image generation; additionally, since all characters are generated together, the ability to edit each character individually or apply animation effects is limited. Accordingly, the present invention proposes a method for automatically generating visually attractive and easy-to-read kinetic typography images by handling various elements such as the shape, color, size, position, and movement of characters. FIG. 1 is a conceptual diagram illustrating a kinetic typography image generation system according to the present invention. FIG. 2 is a conceptual diagram illustrating a kinetic typography learning dataset according to the present invention. FIG. 3 is a flowchart illustrating a learning method of a kinetic typography image generation system according to the present invention. FIGS. 4, FIGS. 5 and FIGS. 6 are conceptual diagrams for explaining a learning method of a kinetic typography image generation system according to the present invention. Hereinafter, embodiments disclosed in this specification will be described in detail with reference to the attached drawings. Identical or similar components are assigned the same reference number regardless of the drawing symbols, and redundant descriptions thereof will be omitted. The suffixes "module" and "part" used for components in the following description are assigned or used interchangeably solely for the ease of drafting the specification and do not have distinct meanings or roles in themselves. Furthermore, in describing the embodiments disclosed in this specification, if it is determined that a detailed description of related prior art could obscure the essence of the embodiments disclosed in this specification, such detailed description will be omitted. Additionally, the attached drawings are intended only to facilitate understanding of the embodiments disclosed in this specification; the technical concept disclosed in this specification is not limited by the attached drawings, and it should be understood that they include all modifications, equivalents, and substitutions that fall within the spirit and technical scope of the present invention. Terms including ordinal numbers, such as first, second, etc., may be used to describe various components, but said components are not limited by said terms. These terms are used solely for the purpose of distinguishing one component from another. When it is stated that one component is "connected" or "connected" to another component, it should be understood that while it may be directly connected or connected to that other component, there may also be other components in between. On the other hand, when it is stated that one component is "directly connected" or "directly connected" to another component, it should be understood that there are no other components in between. A singular expression includes a plural expression unless the context clearly indicates otherwise. In this application, terms such as “comprising” or “having” are intended to specify the existence of the features, numbers, ste