CN-121999090-A - Image coloring model construction method, image coloring method, device and medium

CN121999090ACN 121999090 ACN121999090 ACN 121999090ACN-121999090-A

Abstract

The invention discloses an image coloring model construction method, an image coloring device and a medium, wherein the method comprises the steps of generating corresponding category descriptions based on a reference gray level image, a target color image and text descriptions; the method comprises the steps of constructing a basic image generation model, constructing grid latent features and binary masks by using reference gray image data and target color image data codes, extracting color priori representation, injecting the binary masks into reference areas of the grid latent features, conducting mask forward noise adding and prediction on target upper areas, calculating a basic denoising loss function, extracting cross attention force diagram of text description color vocabularies and object vocabularies in the denoising process, restraining weight distribution through color-object alignment loss function, and finally combining optimization model parameters to obtain an image coloring model. The invention can realize accurate pixel-level region positioning, automatically enrich and complement the whole-picture semantic color coverage, and explicitly eliminate the dislocation interference of color objects in the training reasoning process.

Inventors

BAO BINGKUN
YOU SISI
Niu Chaochao

Assignees

南京邮电大学

Dates

Publication Date: 20260508
Application Date: 20260409

Claims (10)

1. An image coloring model construction method is characterized by comprising the following steps: S1, acquiring reference gray image data, corresponding target color image data and text description data, and generating category description of the target color image data by using an existing image annotation model; S2, constructing a basic image generation model based on a diffusion model, encoding reference gray image data and target color image data into latent features, combining and constructing grid latent features in a preset grid arrangement mode, and generating a binary mask corresponding to the grid latent features; S3, based on text description data and category description, extracting semantic rich color priori representation by combining a color priori prediction module containing two categories of learnable query vectors, and injecting the semantic rich color priori representation into a reference area of grid latent features by combining a binary mask, wherein the two categories of learnable query vectors comprise color attribute object query vectors for aligning text designated colors and color unspecified object query vectors for deducing unspecified colors in the category priors; S4, performing mask forward noise adding processing on a target upper color area of the grid latent feature based on the binary mask, predicting noise subjected to the noise adding processing by using a basic image generation model, and calculating a basic denoising loss function value; S5, extracting cross attention force diagrams of color vocabularies and corresponding object vocabularies in corresponding levels in text description data in the process of training a basic image generation model to remove noise, and restricting weight distribution of the cross attention force diagrams through a color-object alignment loss function so as to align the weight distribution with a target real vision segmentation mask; s6, based on the basic denoising loss function value and the color-object alignment loss function, jointly training parameters of the optimized basic image generation model, and combining to obtain the text-guided gray image coloring model.
2. The method for constructing an image coloring model according to claim 1, wherein the specific operation procedure of step S3 includes: s31, coding text description data to obtain first word embedded features, and coding category description data to obtain second word embedded features; S32, constructing two types of feature interaction branches, in a first feature interaction branch, splicing a first word embedded feature and a color attribute object query vector, then calculating a first interaction feature representation, and extracting a first color priori feature; in a second feature interaction branch, splicing the second word embedded feature and the color unspecified object query vector, then calculating a second interaction feature representation, and extracting a second color priori feature; splicing the first color prior feature and the second color prior feature and generating color prior representation through a mapping network; S33, multiplying the generated color prior representation with a binary mask with a reference area value of 1 and a target upper color area value of 0, and then injecting the color prior representation into a designated reference area of the grid latent features in a latent space feature addition mode.
3. The method for constructing an image coloring model according to claim 2, wherein the specific operation procedure of step S4 includes: s41, keeping the characteristic value of a designated reference area in the grid latent characteristic unchanged; s42, based on a diffusion time step of random sampling, gaussian noise is added to the whole space of the grid latent features to generate intermediate latent features, then the noisy target colored region features are extracted by using a binary mask, and space fusion splicing is carried out on the noisy target colored region features and the non-noisy reference region features, so that local noisy grid features are finally constructed; S43, predicting added noise distribution through a basic image generation model based on local noise grid characteristics, injected color priori representation and corresponding text condition description; S44, carrying out space truncation on the loss area to be calculated by utilizing a binary mask, calculating the mean square error based on the predicted noise in the area on the target and the truly added Gaussian noise, and taking the mean square error as a basic denoising loss function value.
4. The method for constructing an image coloring model according to claim 3, wherein the specific operation procedure of step S5 includes: s51, extracting target color words and corresponding target object words in text description data, and constructing a color-object word pair set containing explicit semantic corresponding relations; s52, acquiring a real binary segmentation map of each target object in the color-object word pair set in a corresponding image by using a group of visual segmentation models; s53, respectively carrying out interpolation downsampling and binarization processing on the real binary segmentation map according to the spatial resolutions of different cross attention layers in the basic image generation model to obtain target mask areas matched with cross attention try dimensions of each corresponding level; S54, extracting cross attention force diagrams of the target color words and the target object words in the corresponding cross attention layers, and calculating attention weight space integral of the cross attention force diagrams in the target mask area and proportion of the attention weight space integral to total attention weight of the whole diagram; step S55, calculating an alignment loss term of the color-object based on the proportion of the attention weight, wherein the alignment loss term is used for giving punishment to the attention weight outside the target mask area.
5. The image coloring model construction method according to claim 4, wherein the calculation formula of the alignment loss term is as follows: Wherein, the The alignment loss term is represented as such, For the total number of color-object word pairs, Is the first The individual objects are in the target mask area of the current corresponding layer, Represent the first Cross-attention of individual words strives to be in spatial position Is a scalar attention weight of (c) in the set, Is the total number of spatial positions of the latent feature; The calculation of the total loss function present at this time is: Wherein, the Representing the total loss function of the device, For the fundamental denoising penalty of the diffusion model, To at the first The alignment loss terms calculated by the cross-attention layer, In order to balance the super-parameters, The number of network layers lost to applying the color-object alignment.
6. The image coloring model construction method according to claim 1, further comprising an evaluation operation for verifying the image coloring model, specifically comprising: acquiring an extended image data set containing a complex semantic scene and a multi-instance image data set, wherein each image in the image data set is marked with a corresponding text description; Setting perception fidelity, color richness and vision-text consistency as image coloring task evaluation dimensions of a text-guided gray image coloring model; constructing a test prompt word set containing different color-object binding relations and different color variants of the same object based on the image data set; The test prompt word set is used for carrying out staged evaluation of color-object alignment effect in the model training process and carrying out comprehensive evaluation of local color accurate control capability and semantic decoupling capability of the model after model training is completed.
7. A method of image coloring, characterized by being applied to an image coloring model constructed by the method of any one of claims 1 to 6, comprising: Step L1, receiving a target gray level image and corresponding target text description data, wherein the target text description data comprises color attribute description of a specific object in the target gray level image; Step L2, generating target class description for a target gray image by using an existing image annotation model, coding the target gray image into latent features, copying and constructing initial grid latent features in a preset grid arrangement mode, setting a target upper color area of the initial grid latent features as random noise, setting a reference area as the latent features, and generating corresponding binary target masks; step 3, inputting the target text description data and the target category description into a color priori prediction module, extracting target color priori representation, and injecting the target color priori representation into the reference area of the initial grid latent feature; step 4, inputting the initial grid latent features and the target text description data which are injected with the target color prior representation into a basic image generation model obtained through training, and performing iterative mask denoising sampling on the target coloring area based on the binary target mask, wherein the features of the reference area remain unchanged in the sampling process; And step 5, decoding the grid latent features after denoising, cutting out the target coloring area according to a binary target mask, and finally generating a target color image which is consistent with the semantic meaning of the target text description data and accurate in area positioning.
8. The image coloring method according to claim 7, wherein the step L4 is performed in the process of performing iterative mask denoising sampling on the target colored region based on the binary target mask, the latent feature is updated at each sampling time step as follows: Wherein, the Is the first The grid potential characteristics after the step updating, The generated target latent features are denoised at the current step for the diffusion model, The resulting and noiseless initial reference grid latent features are encoded for the target gray scale image, For a priori representation of the target color injected, Is the binary target mask, and the mask value of the reference area is 1, the mask value of the target coloring area is 0, Is the Hadamard product.
9. A computer-readable storage medium, characterized in that a computer program is stored, which, when being executed by a processor, causes the processor to perform the steps of the image coloring method according to any one of claims 7 to 8.
10. A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the image colouring method according to any one of claims 7 to 8.

Description

Image coloring model construction method, image coloring method, device and medium Technical Field The invention relates to the technical field of computer vision data processing, in particular to an image coloring model construction method, an image coloring device and a storage medium for text semantic guidance and attention alignment by adopting an image generation and editing technology based on a diffusion model. Background With the rapid development of deep learning and multi-mode generation technologies, a text-guided image coloring technology is widely applied in the field of computer vision, the technology aims to convert gray images into color images according to text descriptions input by users, has huge application values in the fields of black and white old photo restoration, digital art creation, advertisement industry and the like, and compared with the traditional automatic coloring technology, the text descriptions are introduced to enable users to control the coloring process more intuitively and flexibly, so that customized image editing becomes more convenient and user-friendly. However, the current technology still has a number of bottlenecks in generating high-fidelity, semantically consistent color images: Firstly, accurate pixel-level region positioning is difficult to realize, the existing method generally relies on extra introduced network branches to encode gray image conditions, the implicit feature fusion-based mode is easy to generate space dislocation when a complex structure is processed, inaccurate target region positioning is caused, and significant calculation cost overhead is brought to training of an extra control network; Secondly, the color is incomplete due to extremely limited semantic coverage, the text description provided by a user is often incomplete, only part of obvious objects in an image are usually involved, the existing model can only allocate colors for explicitly mentioned objects, the inference and semantic guidance of the colors of unreferenced areas are lacking, and the color deletion of the background or secondary objects is easily caused; Thirdly, color overflow and dislocation are caused by lack of explicit semantic alignment supervision, the existing method based on the pre-training diffusion model mainly optimizes the basic noise prediction target, the spatial correspondence between the color descriptors and the specific object areas is not constrained, and when the method faces a multi-object complex scene, the cross attention of the model is extremely easy to diverge, so that serious distortion phenomena such as color tense plum wear or color overflow are caused. Therefore, the application provides an image coloring model construction method, an image coloring method, image coloring equipment and a medium, which can effectively realize accurate pixel-level region positioning, automatically enrich and complement whole-picture semantic color coverage, and can explicitly eliminate color object dislocation interference in the training reasoning process so as to solve the technical problems. Disclosure of Invention The invention mainly aims to provide an image coloring model construction method, an image coloring method, equipment and a medium, so as to solve the technical problems of inaccurate region positioning, missing semantic coverage, color overflow and semantic dislocation in a multi-object scene and the like in the text-guided gray level image coloring, which are proposed in the background art, and further realize accurate pixel-level positioning and complete color inference of a target region under text guidance, and remarkably improve the color fidelity and text semantic consistency of a generated color image. The invention adopts the following technical scheme to solve the technical problems: An image coloring model construction method, which is executed by a computer device, comprises the following steps: S1, acquiring reference gray image data, corresponding target color image data and text description data, and generating category description of the target color image data by utilizing an existing image annotation model, wherein the category description is used for extracting target semantic color priori features by combining the text description data; S2, constructing a basic image generation model based on a diffusion model, encoding reference gray image data and target color image data into latent features, combining and constructing grid latent features in a preset grid arrangement mode, and generating a binary mask corresponding to the grid latent features, wherein the binary mask is used for spatially distinguishing a reference area which is kept unchanged from a target colored area to be denoised; S3, extracting semantic rich color priori representation based on text description data and category description and combining a color priori prediction module of a learnable query vector, and injecting the color priori representation