CN-121982128-A - Image generation method based on diffusion model
Abstract
The invention discloses an image generation method based on a diffusion model, and relates to the field of computer vision and remote sensing image processing. The invention realizes the cooperative and accurate control of the position, the direction and the size of the target in the remote sensing image generation by a direction and size sensing attention mechanism, remarkably improves the geometric consistency of the generated image and the input layout, can automatically and high-quality synthesize the data set based on complete layout rebalancing, conditional generation and quality screening processes, obviously improves the precision of a downstream rotating target detector on a plurality of reference data sets, effectively solves the problems of data scarcity, unbalanced distribution and high labeling cost in the remote sensing detection task, and provides reliable data support for the remote sensing target detection task. The method can be used for generating the composite remote sensing data set with consistent layout and accurate geometric attributes, and is particularly suitable for generating the remote sensing target detection training data with various orientations and sizes.
Inventors
- YUE JIAN
- YE MAO
Assignees
- 电子科技大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260401
Claims (9)
- 1. An image generation method based on a diffusion model is characterized by comprising the following steps: step 1, carrying out distribution statistics on layout labels of an original remote sensing dataset, and extracting category distribution, target position distribution, target size distribution and target direction angle distribution; Carrying out distribution adjustment treatment on the layout labels to obtain balanced layout label distribution, namely adopting inverse frequency weighting to adjust class sampling probability on class distribution, adopting approximately uniform distribution to sample space positions on target position distribution, keeping the target size distribution unchanged, distributing target direction angles based on uniform distribution to sample direction angles; Step 2, randomly determining the number of targets for each synthetic image to be generated, and sequentially sampling the category, the position coordinates, the target size and the target direction angle of each target based on the obtained balanced layout label distribution to obtain the layout labels of the targets of the synthetic image to be generated; Step 3, converting the layout labels of the targets obtained in the step 2 into space layout embedding and semantic text embedding, and embedding the space layout embedding and the semantic text embedding as multi-mode conditions; Step 4, embedding according to multi-mode conditions, gradually converting Gaussian noise into hidden variables of a remote sensing image through a constructed diffusion model sensitive to directions and angles, and then converting the finally obtained hidden variables of the remote sensing image into the remote sensing image to obtain a generated synthetic image; The direction and angle sensitive diffusion model is a U-shaped network based on a residual error module and a layout cross attention module, and comprises an encoder part formed by alternately stacking the residual error module and the layout cross attention module, and a decoder part formed by alternately stacking the residual error module and the layout cross attention module, wherein the output of the decoder part outputs hidden variables obtained by denoising each step through the direction and scale attention module; And 5, performing quality evaluation on the composite image generated in the step 4 based on the semantic-layout alignment degree and the target classification confidence degree, taking the composite image which simultaneously meets the semantic-layout alignment degree threshold and the target classification confidence degree threshold as a high-quality composite image, and obtaining an enhanced remote sensing data set based on the original remote sensing data set and all the high-quality composite images.
- 2. The image generating method based on a diffusion model as claimed in claim 1, wherein in step1, performing a distribution adjustment process on the layout label specifically includes: The class sampling probability is adjusted by adopting inverse frequency weighting, and class distribution after representing balance is obtained : ; Wherein, the The distribution of the extracted categories is indicated, And The sizes of the original data set and the composite data set respectively, Is the total category number; sampling the space position by adopting approximate uniform distribution to obtain balanced position distribution : ; Wherein, the The total area of the image is represented, The width and height of a single image respectively, Representing pixel coordinates; based on the evenly distributed sampling direction angles, the balanced direction angle distribution is obtained , wherein, Indicating the direction angle.
- 3. The image generation method based on a diffusion model as claimed in claim 1, wherein in step3, the spatial layout is embedded as: ; Wherein, the As position parameters, coordinates of four corner points of a rotation boundary frame of the target, which are determined based on position coordinates of the target, target size and target direction angle; representing the position code of the sine and the cosine, A text encoder representing a contrast language-image pre-trained CLIP model, Representing a multi-layer perceptron.
- 4. The diffusion model-based image generation method of claim 1, wherein in step 3, semantic text is embedded The method comprises the following steps: ; Wherein, the The large language model selected is represented by a representation, A layout label representing the object obtained in step 2, A hint template that is set for a large language model.
- 5. The method of claim 1, wherein in step 4, the layout cross attention module is: ; Wherein, the Representing the cross-attention feature of the layout, Output characteristics for residual modules Mapping characteristics of (a), i.e , , The mapped values and key features are embedded for the layout, 、 ; , The mapped values and key features are embedded for the text, 、 , wherein, For a query mapping matrix with respect to residual features, 、 The value mapping matrix and the key mapping matrix embedded with respect to the layout, 、 A value mapping matrix and a key mapping matrix for text embedding respectively, Representing the set scaling factor.
- 6. The method of claim 1, wherein in step 4, the direction and scale attention module is configured to: Respectively acquiring height through three prediction networks based on multi-layer convolutional neural network Width of the container And direction angle Convolution kernel parameters of (a) Wherein the inputs to the multi-layer convolutional neural network include outputs of a direction and angle sensitive diffusion model And spatial layout embedding Subscript of For identifying height Width of the container And direction angle ; The convolution kernel parameters Scaling to a preset value range to obtain scaled parameters And according to the height Sum width of Corresponding parameters Constructing sampling grid of current characteristic sampling point based on direction angle Corresponding parameters Rotating the sampling grid, and performing convolution operation on the rotated sampling grid to obtain a corresponding output characteristic diagram ; Layout label of the target obtained in the step 2 Conversion of a rotation bounding box of a determined object into a binary spatial mask The mask of the pixel points in the rotation boundary box is set to be 1, and the rest is set to be 0; Based on Obtaining the output of the direction and scale attention module , wherein, For setting up a scaling factor for defining The calculated result of (c) is in the normal range, The number of the object is indicated and, For target number of single image, query Key and key Value of The method comprises the following steps of: 、 、 , wherein, 、 、 Mapping matrices of queries, keys and values, respectively.
- 7. The method for generating an image based on a diffusion model according to claim 6, wherein the network structure of the prediction network comprises, in order, a convolution layer 1, an activation function 1, a discard layer 1, a convolution layer 2, an activation function 2, a discard layer 2, a convolution layer 3, and an activation function.
- 8. The diffusion model-based image generation method of claim 6, wherein constructing a sampling grid of current feature sampling points is: ; Wherein, the Representing the coordinates of the sampling grid, , Respectively the heights of Sum width of Corresponding parameters 。
- 9. The image generation method based on a diffusion model according to claim 1, wherein in step 5, the semantic-layout alignment is set as: ; Wherein, the Layout label for the object obtained in step 2 Is described in the text of (a) and (b), Representing the image encoder in the CLIP model, Representing the text encoder in the CLIP model, () Representing the computed cosine similarity.
Description
Image generation method based on diffusion model Technical Field The invention relates to the field of computer vision and remote sensing image processing, in particular to an image generation method based on a diffusion model. Background The remote sensing image target detection has important application value in the fields of urban planning, environmental monitoring and the like, but the performance of the remote sensing image target detection is limited to a large extent by the availability of large-scale high-quality annotation data. At present, the problems of high data acquisition cost, complicated manual labeling process, unbalanced distribution of target categories in a data set and the like generally restrict the training and actual deployment of a detection model. In order to relieve the challenge of scarcity of training data, a synthetic-based data enhancement method becomes an effective solution idea, and the core of the method is to expand a data set and balance sample distribution by generating a vivid image so as to improve the generalization performance and the robustness of a downstream detection model. The technical evolution in the direction is approximately undergoing the development process from a general image generation method to a remote sensing field special generation method. In terms of general image generation methods, the technological evolution has undergone a process from traditional data enhancement, generation of an antagonistic network to a diffusion model. Traditional data enhancement methods such as flipping, scaling, etc., are essentially based on the transformation of existing images, cannot create new objects or complex layouts, and have limited diversity gain. Later, the generation countermeasure network (GANs) can generate richer content, but the problems of inaccurate generation target positions and pattern collapse are common, so that the generation quality is unstable. The diffusion model of the current mainstream realizes higher image fidelity through the gradual noise adding process, and can be controlled by receiving conditional inputs such as texts or layouts. However, these generic models suffer from a fundamental disadvantage of lacking precise spatial control capability. When applied to remote sensing scenes requiring strict geometric consistency, the remote sensing scenes often generate targets which are misplaced with a designated layout, are not consistent with directions or are distorted in size, and the strict requirements of remote sensing target detection on positions, directions and scales cannot be met. Aiming at the defects of the general method, the remote sensing field develops a special data generation method, and aims to improve the positioning accuracy of a generated target by introducing space conditions such as layout, bounding boxes and the like. Early studies such as LostGAN initially verified the feasibility of layout conditions to generate remote sensing images. In recent years, a special framework based on a diffusion model has been the mainstream, and representative works such as CC-Diff, MMO-IG, CRS-Diff and AeroGen have appeared. Although these methods have advanced in basic position control, the core attention mechanism and feature sampling mode still do not fully consider the intrinsic geometric characteristics of the remote sensing target. They generally ignore modeling of arbitrary directions and precise dimensions of targets, resulting in deviations of the generated images from the real scene in direction and scale, and poor layout consistency. The lack of the control capability of the direction and the scale makes it difficult to effectively improve the performance of the rotation target detection model by the synthetic data generated by the existing special method. In the patent application with publication number CN116051683A and application name of "a remote sensing image generation method, storage medium and device based on style self-organization", the scheme can quickly generate a remote sensing image with labels, but has the inherent defects that the method highly depends on a predefined accurate target mask contour, so that the flexibility is poor, a target with natural geometric morphology is difficult to generate, and more importantly, a precise control mechanism for the direction and the size of the target is completely lacking, so that the geometric consistency of the generated target and the real scene layout cannot be ensured. In addition, the denoising iteration times are strictly limited for pursuing the generation speed, so that the improvement of the final image quality is restricted, and the practical value of the synthesized data in a downstream task is influenced. In a patent application with publication number of CN116486077A and application name of "remote sensing image semantic segmentation model sample set generation method and device", a pseudo tag is constructed through an unsupervised semantic segme