CN-118674636-B - High-quality image synthesis method based on diffusion model
Abstract
The invention provides a high-quality image synthesis method based on a diffusion model, which is characterized in that a plurality of foreground and background images are fused and converted into a synthesized image with high sense of reality through a mode of gradually adding and removing noise by the diffusion model, an edge contour correction model is introduced, the diffusion process of the edge contour information is stabilized by using the edge contour information of the foreground as a prompt, the excessive smoothness of the synthesized image edge under the condition of overlapping objects is ensured, a color conversion model based on a control Net is also introduced, the color information of the foreground and the background is processed and fed back to the control Net, and the color of the synthesized image is more close to reality under the condition of having a plurality of foreground images as a prompt of a diffusion model decoder.
Inventors
- PENG XUAN
- LI HAOWEN
- ZHU JIASHENG
- HUANG KAI
- Long Bairui
- LIAN DONGHUI
- HUANG WENNING
- SHI SIHUA
- ZENG XIANGHUI
- XU HUIYING
- ZHOU ANRAN
- ZHENG SIYING
- YANG CHENGEN
- ZHOU WENJUN
Assignees
- 广东省机场集团物流有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20240530
Claims (10)
- 1. The high-quality image synthesis method based on the diffusion model is characterized by comprising the following steps of: s1, acquiring a background image and at least 1 foreground image, and extracting a mask image of each foreground image; pasting each foreground image into the background image, obtaining a primary fusion image, and extracting an edge contour map of the primary fusion image; S2, establishing an edge contour correction model, wherein the edge contour correction model comprises an adaptive encoder and an edge feature fusion module which are sequentially connected; the mask image of each foreground image and the edge contour image of the primary fusion image are input into the self-adaptive encoder together, and a contour feature image is obtained; inputting the primary fusion image into a U-Net coder of a diffusion model to obtain an intermediate feature map; the contour feature map and the intermediate feature map are input into the edge feature fusion module together, and a contour fusion feature map is obtained; s3, establishing a color conversion model, wherein the color conversion model comprises a Lab space color conversion module and a control Net module which are sequentially connected; Inputting the background image, each foreground image and the mask image thereof into a Lab space color conversion module together to obtain a synthetic color adaptation image; Inputting the synthesized color adaptation diagram into the control Net module to obtain a color adaptation characteristic diagram; S4, taking the contour fusion feature map as a prompt of a U-Net encoder of a diffusion model, taking the color adaptation feature map as a prompt of a U-Net decoder of the diffusion model, and obtaining an improved diffusion model; And inputting the primary fusion image into an improved diffusion model to obtain a high-quality synthetic image.
- 2. The diffusion model-based high-quality image synthesis method according to claim 1, wherein in the step S1, the background image is specifically a baggage X-ray image, and the foreground image is specifically a forbidden X-ray image.
- 3. The method according to claim 1, wherein in the step S1, an edge contour map of the preliminary fusion image is extracted by using a Canny edge detector.
- 4. The method according to claim 1, wherein in the step S2, the adaptive encoder includes Pixel Unshuffle downsampling blocks, a first feature extraction block, a first downsampling block, a second feature extraction block, a second downsampling block, a third feature extraction block, a third downsampling block, and a fourth feature extraction block, which are sequentially connected, and each of the feature extraction blocks includes one convolution layer and two residual blocks, which are sequentially connected.
- 5. The method according to claim 1, wherein in the step S2, the edge feature fusion module includes a first transducer layer, a second transducer layer, and a full connection layer; The input of the first transducer layer is an intermediate feature diagram, the input of the second transducer layer is a contour feature diagram, and the output of the first transducer layer and the output of the second transducer layer are respectively connected with the input of the full connection layer; And after the output of the full-connection layer is spliced with the intermediate feature map and the contour feature map, acquiring the contour fusion feature map.
- 6. The method according to claim 1, wherein in the step S3, the step of obtaining the synthesized color adaptation map using the Lab spatial color conversion module comprises: s3.1, synthesizing the background image and a mask image of each foreground image to obtain images of each foreground position in the background; S3.2, respectively converting the images of each foreground position in the background and each front Jing Tuxiang from an RGB color space to a Lab color space, and respectively correspondingly acquiring a background Lab image and each foreground Lab image; S3.3, matching the background Lab image with the mean value and standard deviation of each foreground Lab image in the Lab color space respectively, and correspondingly acquiring a color adaptation map of each foreground Lab image respectively; S3.4, converting the color adaptation map of each foreground Lab image from Lab color space to RGB color space, and respectively splicing and synthesizing the foreground Lab images with the background images to obtain images which are subjected to color adaptation conversion at the foreground positions in the background, and taking the images as the synthesized color adaptation map.
- 7. The method of high-quality image synthesis based on diffusion model according to claim 6, wherein the step S3.3 comprises: Calculating the average mu C of the background Lab image and each foreground Lab image in the Lab color space according to a color average calculation formula, wherein the color average calculation formula is as follows: Wherein, C i represents the value of the ith pixel point, and n is the number of the pixel points; And calculating a corresponding standard deviation sigma C according to the mean mu C , wherein a calculation formula is specifically as follows: Obtaining a color adaptation map of each foreground Lab image according to the following formula: Wherein L ′ 、a ′ and B ′ are respectively the brightness channel value, the A color channel value and the B color channel value of the foreground Lab image color adaptation map, and L, a and B are respectively the brightness channel value, the A color channel value and the B color channel value of the foreground Lab image; And Respectively a brightness channel mean value, an A color channel mean value and a B color channel mean value of the background Lab image; And Respectively a brightness channel mean value, an A color channel mean value and a B color channel mean value of the foreground Lab image; And The standard deviation of the brightness channel, the standard deviation of the A color channel and the standard deviation of the B color channel of the background Lab image; And The standard deviation of the brightness channel, the standard deviation of the A color channel and the standard deviation of the B color channel of the foreground Lab image are respectively shown.
- 8. The method of claim 1, wherein in the step S3, the step of obtaining the color adaptation feature map by using the ControlNet module comprises: Firstly, a preset encoder is used for converting the synthesized color adaptation diagram into a feature space, so that the size and the dimension of the feature space are matched with the input layer of the diffusion model, and the formula is as follows: c f =E(c) Wherein c f is a prompt condition, E (c) is a preset encoder, c is a synthetic color adaptation chart; The weight Θ of the U-Net encoder of the diffusion model is copied into a copy to form a new trainable copy Θ c : The ControlNet module comprises a first zero convolution layer, a trainable copy Θ c and a second zero convolution layer which are connected in sequence; Taking the prompt condition c f as an input of a control net module, the control net module outputs a color adaptation feature map y c , specifically: y c =F(x;Θ)+Z(F(x+Z(c f ;Θ z1 );Θ c );Θ z2 ) Wherein F represents a U-Net encoder of a diffusion model, x is an input of the U-Net encoder, Θ is a parameter of the U-Net encoder, Z represents a zero convolution layer, Θ z1 is a parameter of a first zero convolution layer, and Θ z2 is a parameter of a second zero convolution layer.
- 9. The high-quality image synthesis method based on the diffusion model according to claim 1, wherein in the U-Net encoder of the diffusion model, an intermediate layer of the U-Net encoder outputs the intermediate feature map, and the intermediate layer of the U-Net encoder is remapped to a cross attention layer of the U-Net encoder input by the contour fusion feature map output by the edge feature fusion module as a prompt of the U-Net encoder; In the U-Net decoder of the diffusion model, a control Net module is respectively arranged behind each decoding block of the U-Net decoder, and a color adaptation characteristic diagram output by the control Net module is used as a prompt of the U-Net decoder of the diffusion model to acquire the improved diffusion model.
- 10. The high-quality image synthesis method based on a Diffusion model according to any one of claims 1 to 9, wherein the Diffusion model is specifically a Stable Diffusion model.
Description
High-quality image synthesis method based on diffusion model Technical Field The invention relates to the technical field of image processing and fusion, in particular to a high-quality image synthesis method based on a diffusion model. Background In modern airport security operations, it is important to quickly and efficiently identify contraband in baggage using X-ray machines. To improve the performance of the X-ray detection system, it is necessary to continuously optimize the detection algorithm and improve the recognition accuracy of the operator, both of which are improved depending on high-quality image data. However, high quality X-ray image datasets containing diverse contraband are relatively scarce, and developing efficient and accurate automated detection algorithms is a challenge due to inadequate training data. Although image synthesis technologies such as GAN and Stable Diffusion provide a possible solution for synthesizing an X-ray image, there are still problems in synthesizing an X-ray image that a difference in color between a foreground image and a background image is large and an edge is excessive and unnatural, and satisfactory realism cannot be achieved. Image synthesis methods based on deep neural networks have made significant progress in the last few years, and these methods typically use large amounts of data to train models to learn how to effectively reconcile images. For example, the depth image harmony network (DIH) automatically adjusts foreground objects by predicting global color transforms to match statistical features of the background. In recent years, with the rise of synthetic models such as a synthetic countermeasure network (GANs) and a variation self-encoder (VAEs), image harmony techniques have been newly developed. These models enable high quality images to be synthesized and better capture complex interactions between foreground and background through end-to-end training. These approaches exhibit superior performance, particularly in terms of style migration and content retention. However, despite the breakthrough of GANs in image synthesis, they still present challenges in maintaining the authenticity and detail of the image content. Furthermore, instability of GANs during training often results in varying quality of the composite image. Under the background, the diffusion model provides a brand new thought as an emerging image synthesis technology. Compared to conventional GANs and other synthetic models, the diffusion model synthesizes images through a gradual noise addition and removal process, which is more controllable and the model is more stable during training. Diffusion models show great potential in the field of image harmony, are more excellent in image quality, and have obvious advantages in model training and stability. The prior patent document discloses an image harmonious image editing method and system based on a diffusion model, which comprises the steps of collecting an image construction foreground data set and a background data set, constructing a synthetic image set, constructing a self-adaptive encoder based on a foreground mask image and a synthetic image by utilizing a pixel inverse recombination layer and a downsampling block, improving the structure of a U-Net encoder in the diffusion model, acquiring a denoising feature image by utilizing the improved diffusion model, fusing the denoising feature image with the composite feature image, processing the fused feature image by utilizing RFFT and IRFFT, mixing the processed feature image with a corresponding synthetic image by utilizing the foreground mask image to obtain a global feature image, inputting the global feature image into a VGG model for training, and optimizing output by utilizing a style loss function to obtain the optimized global feature image, wherein the prior art scheme still has the defects that 1) the edge is excessively unsmooth and lacks local adaptability, local image characteristics such as texture and illumination change need to be considered in an edge region, the prior art has the advantages of being very good in sense of color difference and the difference between the background and the color, and the color of the background can be greatly influenced by the fact that the color of the prior art is not well overlapped, and the color of the color is not well-matched, and the color of the background is not influenced, and the color of the prior art is greatly influenced. Disclosure of Invention The invention provides a high-quality image synthesis method based on a diffusion model for overcoming the defects of excessive unsmooth edges and low color sense of reality in the images synthesized by the prior art, which can effectively fuse forbidden articles images into luggage X-ray scanning images while keeping clear outlines and high color sense of reality. In order to solve the technical problems, the technical scheme of the invention is as follows: a high-quality