CN-121982145-A - Multi-scale controllable image generation method and system based on latent space diffusion model

CN121982145ACN 121982145 ACN121982145 ACN 121982145ACN-121982145-A

Abstract

The invention relates to a multi-scale controllable image generation method and system based on a latent space diffusion model, comprising the steps of acquiring an input image, a content condition image and a style condition image corresponding to a target image processing task, and preprocessing; the method comprises the steps of mapping an input image to a latent space, executing a diffusion modeling process in the latent space, gradually recovering latent space features in a back diffusion process by a denoising network, inputting a content condition image and a style condition image into a condition injector, fusing the content feature and the style feature under a plurality of feature scales, injecting a constructed multi-scale condition feature representation into the denoising network for guiding the back diffusion process of the latent space, and finally completing condition controlled decoding based on the latent space denoising feature and the multi-scale condition feature to generate a target image. Compared with the prior art, the method and the device have the advantages that the controllability and the accuracy of image generation are improved, and meanwhile, the suitability of different task scenes and the quality stability of the generated images are enhanced.

Inventors

Yue Menghan
WU HAO
YUAN JINGRONG

Assignees

北京师范大学

Dates

Publication Date: 20260505
Application Date: 20260112

Claims (10)

1. The multi-scale controllable image generation method based on the latent space diffusion model is characterized by comprising the following processing stages: the method comprises the steps of obtaining an input image, a content condition image and a style condition image corresponding to a target image processing task, and carrying out necessary preprocessing operation on the image to obtain a multi-source condition image suitable for subsequent feature modeling; Mapping the input image to a latent space, and executing a diffusion modeling process in the latent space, wherein the denoising network gradually restores the latent space characteristics in the back diffusion process; meanwhile, inputting the content condition image and the style condition image into a condition injector, fusing the content characteristics and the style characteristics under a plurality of characteristic scales, and injecting the constructed multi-scale condition characteristic representation into a denoising network for guiding a back diffusion process of a latent space; And thirdly, performing conditional controlled decoding on the latent space features based on the latent space features obtained in the latent space denoising process and the multi-scale conditional feature representation, and generating a target image consistent with the input image in spatial resolution.
2. The method for generating the multi-scale controllable image based on the latent space diffusion model according to claim 1, wherein when the target image processing task is an image coordination task, the input image is a composite image, the content condition image is a foreground region of the composite image, the style condition image is a complete background image after image restoration processing, and the image restoration processing is used for supplementing background information blocked by a foreground in the composite image.
3. The method for generating a multi-scale controllable image based on a latent space diffusion model according to claim 1, wherein the preprocessing comprises removing salt and pepper noise and Gaussian noise of the input image by Gaussian filtering, enhancing image details by adaptive histogram equalization, and the resolution unification processing comprises adjusting the content condition image and the style condition image to the same resolution as the input image by bilinear interpolation.
4. The method for generating a multi-scale controllable image based on a latent spatial diffusion model according to claim 1, wherein the expression of the step-wise noise adding operation is: In the formula, Is that Features after step-noising Is that The characteristics of the noise after the step of adding, In order to control the parameters of the noise strength, Is that Step-injected gaussian noise.
5. The method for generating a multi-scale controllable image based on a latent space diffusion model according to claim 1, wherein the condition injector is a multi-scale content-style condition injector, which performs standardized fusion on content features and style features at different scales, and the fusion process can satisfy the following expression: In the formula, Is the first The fusion characteristics of the layers are such that, Is the first The characteristics of the style of the layer, Is the first The characteristics of the content of the layer, Representing the standard deviation calculation function, Representing the mean calculation function.
6. The method for generating a multi-scale controllable image based on a latent spatial diffusion model according to claim 1, wherein the latent spatial mapping module is a VAE encoder, the decoder is a VAE decoder, and input features of each layer of the decoder can satisfy the following relationship: In the formula, Is that Is provided with an input feature of the decoder layer, Is the first The fusion characteristics of the layers are such that, Is decoder No The output characteristics of the layer-1, Representing a zero-initialization convolution operation, For the latent layer input features of the decoder, Is a latent layer fusion feature of MCSI, Is an initial latent spatial feature.
7. The method for generating multi-scale controllable images based on the latent space diffusion model according to claim 5-6, wherein the MCSI and VAE encoder have the same layer number and initial weight configuration, the feature fusion process covers shallow texture features and deep semantic features, the shallow features are used for retaining image detail texture information, the deep features are used for guaranteeing image semantic consistency, and the jump connection is used for accurately aligning and superposing the multi-scale fusion features and the corresponding layer features of the decoder.
8. The method of generating a multi-scale controllable image based on a latent spatial diffusion model according to claim 6, wherein the denoising network is a pre-trained denoising U-Net network, and the optimization objective thereof satisfies the following expression: In the formula, For the initial latent spatial feature of the VAE encoder output, For the fusion information of the content condition and the style condition, As a result of the gaussian noise, To pretrain the denoising U-Net network, Is that The characteristics of the latent space after the step of noise addition, In order to add the number of noise steps, In order for the calculation to be desirable, Is the square of the L2 norm.
9. The method for generating a multi-scale controllable image based on a latent space diffusion model according to claim 1, wherein when the target image processing task is a fidelity style migration task, a white balance correction task or a medical image conversion task: When the image is a fidelity style migration task, the input image and the content condition image are both original images to be migrated, the style condition image is a target style image, and the target image reserves the semantic content of the original images to be migrated and has the style characteristics of the target style image; when the white balance correction task is performed, the input image and the content condition image are both images with color deviation, the style condition image is a reference image with normal white balance, and the reference image and the images with color deviation do not need to meet the matching relation; when the medical image conversion task is performed, the input image and the content condition image are both source-mode medical images, the style condition image is a target-mode medical image, and the source mode and the target mode comprise any two combinations of a T1 weighted MRI image, a T1 weighted enhanced MRI image, a T2 weighted MRI image and a T2-Flair MRI image.
10. A system for implementing a multi-scale controllable image generation method based on a latent space diffusion model according to any of claims 1-9, comprising: The multi-source image acquisition and characteristic preprocessing module is used for acquiring an input image, a content condition image and a style condition image, and executing preprocessing operations such as noise removal, uniform resolution and the like; The multi-scale condition fusion and guide denoising module is used for mapping an input image to a latent space, performing denoising processing, and performing multi-scale feature fusion on the content condition and the style condition; And the condition controlled injection and target image reconstruction module is used for generating and outputting a target image with original resolution by fusing the denoising feature and the multi-scale fusion feature through a decoder.

Description

Multi-scale controllable image generation method and system based on latent space diffusion model Technical Field The invention relates to the fields of computer vision, deep learning and image generation and editing, in particular to a multi-scale controllable image generation method and system based on a latent space diffusion model. Background The image generation and editing technology is an important research direction in the field of computer vision, and is widely applied to scenes such as digital content creation, medical image processing, film and television production and the like. The early image generation method mainly depends on manual rules or traditional machine learning models, and is difficult to simultaneously consider the generation quality and the control precision. With the development of deep learning technology, a generation countermeasure network (GAN) is widely used for image generation tasks, but the method is easy to have problems of mode collapse, unstable generation result and the like in the training process, and limits the application of the method in high-reliability scenes. In recent years, diffusion models have been the core technology in the field of image generation by virtue of high-quality generation capability. In order to solve the defect, a latent space diffusion model (LDM) is proposed, the LDM compresses an image to a low-dimensional latent space through a Variational Automatic Encoder (VAE), and then the diffusion process is executed in the latent space, so that the calculation cost is greatly reduced, and meanwhile, the high generation quality is reserved. However, the existing latent space diffusion model still has the defects of controllability and multitasking suitability, namely, firstly, a single condition guiding mechanism is adopted, most LDMs only support single-scale condition injection, and it is difficult to simultaneously consider the detail textures and global semantics of images, so that the problem of 'content distortion' or 'style fusion hardness' of a generated result is caused; Secondly, the multi-task suitability is poor, the condition module is required to be redesigned aiming at different tasks such as image coordination, style migration and the like, the development cost is high, and the generalization capability is weak; Thirdly, the balance between the content and the style is difficult to control, and in the tasks of style migration, medical image conversion and the like, the situation that the style covers the content or the style is lost easily occurs, so that the dual requirements of the actual application on the fidelity and the controllability cannot be met. Disclosure of Invention The invention aims to overcome the defects of the prior art and provide a multi-scale controllable image generation method and system based on a latent space diffusion model. The aim of the invention can be achieved by the following technical scheme: the multi-scale controllable image generation method based on the latent space diffusion model comprises the following processing stages: the method comprises the steps of obtaining an input image, a content condition image and a style condition image corresponding to a target image processing task, and carrying out necessary preprocessing operation on the image to obtain a multi-source condition image suitable for subsequent feature modeling; Mapping the input image to a latent space, and executing a diffusion modeling process in the latent space, wherein the denoising network gradually restores the latent space characteristics in the back diffusion process; meanwhile, inputting the content condition image and the style condition image into a condition injector, fusing the content characteristics and the style characteristics under a plurality of characteristic scales, and injecting the constructed multi-scale condition characteristic representation into a denoising network for guiding a back diffusion process of a latent space; And thirdly, performing conditional controlled decoding on the latent space features based on the latent space features obtained in the latent space denoising process and the multi-scale conditional feature representation, and generating a target image consistent with the input image in spatial resolution. Further, when the target image processing task is an image coordination task, the input image is a composite image, the content condition image is a foreground region of the composite image, the style condition image is a complete background image after image restoration processing, and the image restoration processing is used for supplementing background information which is blocked by the foreground in the composite image. Further, preprocessing comprises removing salt and pepper noise and Gaussian noise of an input image by Gaussian filtering, and enhancing image details by self-adaptive histogram equalization, wherein resolution unification processing adopts bilinear interpolati