KR-102961820-B1 - METHOD AND SYSTEM FOR LEARNING MODULAR MODEL FOR COLORING LINE IMAGE USING REFERENCE IMAGE IN DIFFUSION-BASED IMAGE GENERATION

KR102961820B1KR 102961820 B1KR102961820 B1KR 102961820B1KR-102961820-B1

Abstract

A method and system for training a modular model for coloring line art images using a reference image in diffusion-based image generation are disclosed. A model training method according to one embodiment comprises the steps of: receiving a pre-trained diffusion-based image generation model for coloring a line art image; generating a tag with added trigger words by adding a trigger word representing the shape of the image to a tag extracted from the image, and further training the pre-trained diffusion-based image generation model using the generated tag; extracting a first weight of a text encoder of the further trained diffusion-based image generation model; transforming a first latent space of the pre-trained diffusion-based image generation model into a second latent space; generating a student model that performs coloring on the second latent space using a teacher model that performs coloring on the first latent space; extracting a second weight of a U-net included in the student model; and combining the first weight with a third weight of a text encoder included in the pre-trained diffusion-based image generation model and combining the second weight with a fourth weight of a U-net included in the pre-trained diffusion-based image generation model, and the text encoder included in the pre-trained diffusion-based image generation model It may include steps to update UNet.

Inventors

김나영

Assignees

네이버웹툰 유한회사

Dates

Publication Date: 20260507
Application Date: 20241127

Claims (16)

In a method for learning a model of a computer device comprising at least one processor, A step of receiving a pre-trained diffusion-based image generation model for coloring a line art image by the above-mentioned at least one processor; A step of generating a tag with the added trigger word by adding a trigger word representing the shape of the image to a tag extracted from an image by at least one processor, and further training the pre-trained diffusion-based image generation model using the generated tag; A step of extracting a first weight of a text encoder of the additionally learned diffusion-based image generation model by the at least one processor; A step of transforming a first latent space of a pre-trained diffusion-based image generation model into a second latent space by the above at least one processor; A step of generating a student model that performs coloring for the second latent space using a teacher model that performs coloring for the first latent space by the at least one processor; A step of extracting a second weight of the UNET included in the student model by the at least one processor; and A step of updating the text encoder and the U-net included in the pre-trained diffusion-based image generation model by combining the first weight with the third weight of the text encoder included in the pre-trained diffusion-based image generation model and combining the second weight with the fourth weight of the U-net included in the pre-trained diffusion-based image generation model by the at least one processor. A model learning method characterized by including
In paragraph 1, The above image includes a line art image including a line art layer, a reference image including a line art layer and a color layer, and a color image including a color layer, and The step of further training the aforementioned pre-trained diffusion-based image generation model is, A step of adding a first trigger word indicating that the line art layer is included to a tag extracted from the line art image, adding a second trigger word indicating that the line art layer and the color layer are included to a tag extracted from the reference image, and adding a third trigger word indicating that the color layer is included to a tag extracted from the color image; and Step of further training the pre-trained diffusion-based image generation model using the above image and tags to which the above trigger word has been added A model learning method characterized by including
In paragraph 1, The above-mentioned modifying step is, A step of calculating the weighted sum of each of a first matrix generated using an alpha mask of an input image and a second matrix obtained by inverting the first matrix; and A step of transforming the first latent space of the pre-trained diffusion-based image generation model into a second latent space based on the calculated weighted sum. A model learning method characterized by including
In paragraph 3, The step of calculating the weighted sum above is, A model learning method characterized by setting alpha and beta values to reduce noise applied to the line art layer of the input image, and assigning the alpha value to the first matrix and the beta value to the second matrix as weights, respectively.
In paragraph 4, The above alpha value is set to a random value between 0.1 and 0.9, and The above beta value is set to 1. A model learning method characterized by
In paragraph 3, The step of calculating the weighted sum above is, A model learning method characterized by dilating the boundary of the alpha mask of the input image and calculating the weighted sum using the first matrix generated using the alpha mask with the dilated boundary.
In a method for coloring line art of a computer device comprising at least one processor, The step of coloring the input line art image using a text encoder and an updated diffusion-based image generation model in UNET. Includes, The text encoder of the above-described diffusion-based image generation model is updated by adding a trigger word representing the shape of the image to a tag extracted from the image to generate a tag with the added trigger word, further training a pre-trained diffusion-based image generation model using the generated tag, extracting a first weight of the text encoder of the further-trained diffusion-based image generation model, and combining the extracted first weight with a second weight of the text encoder included in the pre-trained diffusion-based image generation model. The U-net of the above diffusion-based image generation model is updated by transforming the first latent space of the above-mentioned pre-trained diffusion-based image generation model into a second latent space, generating a student model that performs coloring on the second latent space using a teacher model that performs coloring on the first latent space, extracting a third weight of the U-net included in the student model, and combining the third weight with a fourth weight of the U-net included in the above-mentioned pre-trained diffusion-based image generation model. A line art coloring method characterized by
In Paragraph 7, The step of coloring the above line art image is, Step of adding a trigger word indicating that a line art layer and a color layer are included in a positive text prompt among the text prompts input to the above diffusion-based image generation model. A line drawing coloring method characterized by including
In Paragraph 7, The above-mentioned pre-trained diffusion-based image generation model is further trained using the above-mentioned image and tags to which the above-mentioned trigger word has been added, and The above image includes a line art image including a line art layer, a reference image including a line art layer and a color layer, and a color image including a color layer, and A first trigger word indicating that the line art layer is included is added to a tag extracted from the line art image, a second trigger word indicating that the line art layer and the color layer are included is added to a tag extracted from the reference image, and a third trigger word indicating that the color layer is included is added to a tag extracted from the color image. A line art coloring method characterized by
In Paragraph 7, A line coloring method characterized in that the first latent space is transformed into the second latent space based on the weighted sum of a first matrix generated using an alpha mask of an input image and a second matrix obtained by inverting the first matrix.
In Paragraph 10, A line art coloring method characterized by the fact that the above weighted sum is calculated by assigning the alpha value to the first matrix and the beta value to the second matrix as weights, respectively, among the alpha and beta values set to reduce noise applied to the line art layer of the input image.
In Paragraph 10, A line coloring method characterized in that the above weighted sum is calculated using the first matrix generated using an alpha mask with a dilated boundary.
A computer program stored on a computer-readable recording medium to be combined with a computer device to execute the method of any one of claims 1 to 12 on the computer device.
At least one processor implemented to execute readable instructions on a computer device Includes, By the above at least one processor, A pre-trained diffusion-based image generation model is received as input for coloring line art images, and A trigger word representing the shape of the image is added to a tag extracted from an image to generate a tag with the added trigger word, and a pre-trained diffusion-based image generation model is further trained using the generated tag. Extract the first weight of the text encoder of the above additionally trained diffusion-based image generation model, and The first latent space of the above-mentioned pre-trained diffusion-based image generation model is transformed into a second latent space, and A student model that performs coloring on the second latent space is created using a teacher model that performs coloring on the first latent space, and Extract the second weight of the UNet included in the above student model, and Combining the first weight with the third weight of the text encoder included in the pre-trained diffusion-based image generation model, and combining the second weight with the fourth weight of the U-net included in the pre-trained diffusion-based image generation model to update the text encoder and U-net included in the pre-trained diffusion-based image generation model. A computer device characterized by
In Paragraph 14, The above image includes a line art image including a line art layer, a reference image including a line art layer and a color layer, and a color image including a color layer, and To further train the above-mentioned pre-trained diffusion-based image generation model, by the above-mentioned at least one processor, A first trigger word indicating that the line art layer is included is added to a tag extracted from the line art image, a second trigger word indicating that the line art layer and the color layer are included is added to a tag extracted from the reference image, and a third trigger word indicating that the color layer is included is added to a tag extracted from the color image. Further training the pre-trained diffusion-based image generation model using the above image and tags to which the above trigger word has been added. A computer device characterized by
In Paragraph 14, In order to transform the above first potential space into the above second potential space, by the above at least one processor, Calculate the weighted sum of each of the first matrix generated using the alpha mask of the input image and the second matrix obtained by inverting the first matrix, and Transforming the first latent space of the pre-trained diffusion-based image generation model into the second latent space based on the above calculated weighted sum A computer device characterized by

Description

Method and System for Learning a Modular Model for Coloring Line Images Using Reference Images in Diffusion-Based Image Generation The following description relates to a method and system for training a modular model to color line art images using a reference image in diffusion-based image generation. There is an increasing number of generative model-based technologies that automatically color line art images for the remainder of the time given a few reference colored line art images. While the emergence of these technologies has led to significant performance improvements regarding coloring, there are the following limitations. (1) No matter how well the model is trained, if the input data provided for coloring has a different trend from the data used for training, the performance will be poor. (2) Usually, if the length of the video to be colored is N, reference images are provided at intervals of k, but a method to effectively use the provided data is not considered. (3) Coloring data has an alpha mask, so lines and non-line parts are separated, but a method to utilize this is not considered. [Prior Art No.] Korean Registered Patent No. 10-2527900 FIG. 1 is a diagram illustrating an example of a general view of a model learning system in one embodiment of the present invention. FIG. 2 is a diagram illustrating an example in which a matrix A generated using an alpha mask and a matrix B obtained by inverse of matrix A are each visualized, in an embodiment of the present invention. FIG. 3 is a diagram illustrating an example of expanding the boundary of an alpha mask using dilation in an embodiment of the present invention. FIG. 4 is a flowchart illustrating an example of a model learning method according to an embodiment of the present invention. FIG. 5 is a flowchart illustrating an example of a line drawing coloring method according to an embodiment of the present invention. FIG. 6 is a block diagram illustrating an example of a computer device according to an embodiment of the present invention. Hereinafter, embodiments will be described in detail with reference to the attached drawings. A model learning system according to embodiments of the present invention may be implemented by at least one computer device. In this case, a computer program according to one embodiment of the present invention may be installed and run on at least one computer device, and at least one computer device may perform a model learning method according to embodiments of the present invention under the control of the run computer program. The above-described computer program may be stored on a computer-readable recording medium to be combined with at least one computer device to execute the model learning method on a computer. Low-Rank Adaptation (LoRA) technology can be integrated into the standard learning process for stable diffusion by modifying only a small subspace of weights for the original model during fine-tuning. The following describes embodiments utilizing a stable diffusion model, but the embodiments of the present invention can be applied to all images generated through a diffusion-based image generation model. Through LoRA technology, weights ΔW can be learned using input latent noise and a time interval t , and can be applied to the standard weights W of the stable diffusion as shown in Equation 1 below. When LoRA is applied, the predicted noise εθ(xt , t ) can be modified as shown in Equation 2 below. At this time, the loss function of LoRA can be expressed as Equation 3 below. At this time, the parameters included in ΔW that are updated are the parameters of the Unet layer included in the stable diffusion model and the parameters of the text encoder. Meanwhile, for video coloring, if the total video length is M, reference images reflecting coloring on line art images can be provided at intervals of N. The images provided at this time are line art images and color images that can be expressed in RGBA (Red, Green, Blue, Alpha). In particular, the color images can be provided in a form where the color layer added to the line art image can be separated. Since three channels of RGB are input to the network, transparency alpha is not considered, and therefore the RGB of the image can be converted as shown in Equation 4 below. α is the alpha channel value, which can have a value between 0 (completely transparent) and 1 (completely opaque). R bg , G bg , and B bg represent the RGB values of the background color, and for example, when the background is white, they can be set to values of 255, 255, and 255. FIG. 1 is a diagram illustrating an example of a general view of a model learning system in one embodiment of the present invention. In the embodiment of FIG. 1, a model learning system (100), a line art coloring model (110), and a line art coloring system (120) are shown. The line art coloring model (110) may be a pre-trained stable diffusion model according to the prior art, which receives an input line art image