CN-122023552-A - Diffusion model single image generation method and system based on frequency domain guidance

CN122023552ACN 122023552 ACN122023552 ACN 122023552ACN-122023552-A

Abstract

The application discloses a diffusion model single image generation method and system based on frequency domain guidance, and relates to the technical field of computer vision, wherein the method comprises the steps of acquiring a single Zhang Xunlian image and defining a reference network based on a diffusion model; establishing a composite frequency domain constraint function, establishing an iterative structure refining strategy, performing end-to-end training on the reference network based on the composite frequency domain constraint function, and executing reverse denoising sampling by combining the iterative structure refining strategy after training is completed to generate a new image similar to the distribution of the training image. According to the application, through the frequency domain constraint of the training stage and the structure refining dual guidance of the image generation stage, the structural integrity and visual fidelity of the generated image can be improved on the premise of not changing the core architecture of the reference network.

Inventors

WANG ZELONG
LI JIAN
Ling Chengyang
WANG XINGWANG
WANG YINGYING
ZHANG CHI
ZHENG ZHONGHUA
HAN RUI

Assignees

中国人民解放军国防科技大学

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (10)

1. The diffusion model single image generation method based on the frequency domain guidance is characterized by comprising the following steps of: acquiring a single Zhang Xunlian image, and defining a reference network based on a diffusion model, wherein the reference network adopts a neural network architecture for noise prediction so as to realize a reverse denoising process from a noisy image to a clear image; Establishing a composite frequency domain constraint function, wherein the composite frequency domain constraint function is to increase explicit constraint on image frequency domain components on the basis of noise prediction loss in a training stage of the reference network, and comprises constraint items for image low-frequency components and constraint items for image high-frequency components; Establishing an iterative structure refining strategy, wherein the iterative structure refining strategy is used for correcting an image generation process in real time by utilizing low-frequency information of an original training image in each reverse denoising step of an image generation stage so as to inhibit error accumulation; and performing end-to-end training on the reference network based on the composite frequency domain constraint function, and after the training is completed, executing reverse denoising sampling by combining the iterative structure refining strategy to generate a new image similar to the training image distribution.
2. The frequency domain guidance-based diffusion model single image generation method according to claim 1 is characterized in that the reference network is a noise prediction network based on a U-Net architecture, the training process of the reference network comprises a fixed forward noise adding process and a reverse noise removing process needing learning, gaussian noise is gradually added to the training image in a plurality of discrete time steps in the forward noise adding process, and the reference network receives the noise adding image and the time steps as inputs in the reverse noise removing process to predict the added noise.
3. The frequency domain guidance-based diffusion model single image generation method according to claim 1, wherein the constraint term for the image low frequency component includes a low frequency spatial domain structure loss and a low frequency domain amplitude loss; the low-frequency spatial domain structure loss is realized by applying constraint on a low-frequency part of a predicted image in a spatial domain, separating a frequency spectrum low-frequency component by adopting two-dimensional fast Fourier transform, and calculating loss by combining a preset low-pass filter mask and an L1 norm; The low frequency domain amplitude loss is achieved by imposing constraints on the fourier amplitude spectrum of the low frequency components for supervising the energy distribution and contrast of the low frequency components.
4. The frequency domain guidance-based diffusion model single image generation method according to claim 3, wherein the calculation formula of the low frequency spatial domain structure loss is: ; Wherein, the Representing the loss of the low frequency spatial domain structure; representing a two-dimensional fast fourier transform; representing a predictive training image; Representing a real training image; Representing element-by-element multiplication; Masking for a preset ideal low pass filter; Representing the L1 norm.
5. The frequency domain guidance-based diffusion model single image generation method according to claim 3, wherein the calculation formula of the low-frequency domain amplitude loss is: ; Wherein, the Representing low frequency domain amplitude loss; a fourier amplitude spectrum representing the image; representing a predictive training image; Representing a real training image; the frequency coordinates after the Fourier transform of the image are obtained; Representing mathematical expectations for the frequency coordinates; Is a preset ideal low-pass filter mask.
6. The frequency domain guidance-based diffusion model single image generation method according to claim 1, wherein the constraint term for the image high-frequency component is a high-frequency domain logarithmic amplitude loss; The Gao Pinpin-domain log-amplitude loss employs a log-loss penalty for the relative difference in high-frequency amplitude to encourage the generation of clear detail.
7. The frequency domain guidance-based diffusion model single image generation method according to claim 6, wherein the Gao Pinpin-domain logarithmic amplitude loss is calculated by the following formula: ; Wherein, the Representing high frequency domain logarithmic amplitude loss; epsilon is a tiny constant for ensuring the stability of the numerical value; the frequency coordinates after the Fourier transform of the image are obtained; a fourier amplitude spectrum representing the image; representing a predictive training image; Representing a real training image; Representing mathematical expectations for the frequency coordinates.
8. The method for generating a single image of a diffusion model based on frequency domain guidance according to claim 1, wherein the real-time correction process of the iterative structure refining strategy comprises the steps of calculating a difference between a predicted training image predicted in a current denoising step and a real training image on a low-frequency amplitude spectrum, updating the predicted training image by taking the difference as a structure correction gradient, and using the updated image for calculation of a subsequent denoising step.
9. The frequency domain guidance-based diffusion model single image generation method according to claim 1, wherein in the end-to-end training, a total loss function is formed by weighting a composite frequency domain loss corresponding to the composite frequency domain constraint function and a pixel-level noise prediction loss, and an optimizer is adopted to perform iterative training on the reference network, wherein the pixel-level noise prediction loss is an L1 loss or an L2 loss between prediction noise and real noise.
10. A frequency domain guidance-based diffusion model single image generation system for implementing the frequency domain guidance-based diffusion model single image generation method of any one of claims 1 to 9, the frequency domain guidance-based diffusion model single image generation system comprising: The reference network module is used for acquiring a single Zhang Xunlian image and defining a reference network based on a diffusion model, wherein the reference network adopts a neural network architecture for noise prediction so as to realize a reverse denoising process from a noisy image to a clear image; The system comprises a reference network, a composite frequency domain constraint module and a frequency domain prediction module, wherein the composite frequency domain constraint module is used for establishing a composite frequency domain constraint function, and the composite frequency domain constraint function is used for increasing explicit constraint on image frequency domain components on the basis of noise prediction loss in a training stage of the reference network; The structure refining module is used for establishing an iterative structure refining strategy, wherein the iterative structure refining strategy is used for correcting the image generation process in real time by utilizing low-frequency information of an original training image in each reverse denoising step of the image generation stage so as to inhibit error accumulation; And the training and generating module is used for carrying out end-to-end training on the reference network based on the composite frequency domain constraint function, and after the training is finished, carrying out reverse denoising sampling by combining the iterative structure refining strategy to generate a new image similar to the distribution of the training image.

Description

Diffusion model single image generation method and system based on frequency domain guidance Technical Field The application relates to the technical field of computer vision, in particular to a diffusion model single image generation method and system based on frequency domain guidance. Background The single image generation is an important research direction in the field of computer vision, and the core aim is to learn the intrinsic vision statistics rule based on a single image sample so as to generate a series of brand new images. In the prior art, a single image generation method is mainly divided into a non-parameter method and a parameter method, wherein the non-parameter method is used for synthesizing a new image through sampling and recombining source image blocks, local textures can be reserved, but a novel global structure is difficult to generate, the parameter method comprises a method based on a generation countermeasure network (GAN) and a method based on a denoising diffusion model, the GAN method has the problems of unstable training, mode collapse, damaged structural integrity and the like, the diffusion model method has the advantages of high generation quality and stable training, but a small receptive field model is usually adopted for preventing the single sample from being overfitted, so that the perception capability of the model to the global structure is insufficient, and the problems of macroscopic structure distortion, key content collapse, high-frequency detail blurring and the like of the generated image are easy to occur. The frequency domain information can effectively decouple the low-frequency structural component and the high-frequency detail component of the image, and provides powerful support for improving the image generation quality. In the prior art, the frequency domain information is applied to the tasks of field self-adaption, image super-resolution, style migration and the like, and the effectiveness of the frequency domain information in the aspects of guiding the image generation process and optimizing the specific attribute of the image is proved. Therefore, how to make up the structural perception defect of the diffusion model caused by the limited receptive field in the single image generation by utilizing the frequency domain information becomes a key technical problem for improving the single image generation quality. Disclosure of Invention The application aims to provide a diffusion model single image generation method and system based on frequency domain guidance, which are used for remarkably improving the structural integrity and visual fidelity of a generated image on the premise of not changing a reference network core architecture by means of frequency domain constraint in a training stage and structure refining dual guidance in an image generation stage. In order to achieve the above object, the present application provides the following. In a first aspect, the present application provides a diffusion model single image generating method based on frequency domain guidance, which includes the following steps. And acquiring a single Zhang Xunlian image, and defining a reference network based on a diffusion model, wherein the reference network adopts a neural network architecture for noise prediction so as to realize a reverse denoising process from a noisy image to a clear image. The method comprises the steps of establishing a composite frequency domain constraint function, wherein the composite frequency domain constraint function is to increase explicit constraint on image frequency domain components on the basis of noise prediction loss in a training stage of a reference network, and comprises constraint items aiming at image low-frequency components and constraint items aiming at image high-frequency components. And establishing an iterative structure refining strategy, wherein the iterative structure refining strategy is used for correcting the image generation process in real time by utilizing low-frequency information of an original training image in each reverse denoising step of the image generation stage so as to inhibit error accumulation. And performing end-to-end training on the reference network based on the composite frequency domain constraint function, and after the training is completed, executing reverse denoising sampling by combining the iterative structure refining strategy to generate a new image similar to the training image distribution. The reference network is a noise prediction network based on a U-Net architecture, the training process of the reference network comprises a fixed forward noise adding process and a reverse noise removing process needing to be learned, gaussian noise is gradually added to the training image in a plurality of discrete time steps in the forward noise adding process, and the reference network receives the noise adding image and the time steps as inputs in the reverse noise removing process to