CN-121860871-B - Multi-mode image synthesis method with decoupling of contrast parameters

CN121860871BCN 121860871 BCN121860871 BCN 121860871BCN-121860871-B

Abstract

The invention belongs to the technical field of deep learning, and discloses a multi-mode image synthesis method for decoupling comparison parameters. The multi-modal image synthesis method based on contrast parameter decoupling solves the problems that the current multi-modal image is scarce and the perception capability of a single-modal image is limited in a complex environment, the multi-modal image generated by the method can fully exert the complementary advantages among modal semantics, effectively expands a multi-modal data set, provides high-quality data support and drive for downstream tasks such as object classification, target detection and disaster monitoring, and improves the robustness and reliability of the model in the complex environment.

Inventors

ZHAO WENDA
ZHANG YU

Assignees

大连理工大学

Dates

Publication Date: 20260508
Application Date: 20260318

Claims (4)

1. The multi-mode image synthesis method for decoupling the contrast parameters is characterized by being constructed based on a diffusion model and mainly comprising a perception compression module, a potential denoising UNet model and a contrast parameter decoupling module, and comprises the following steps of: image encoder using pre-trained variational self-encoder as perceptual compression module Responsible for imaging the original pixels in high dimensions Projection into low-dimensional latent feature space to obtain latent feature x, and diffusion forward noise adding process of the latent feature x, which is modeled as Markov chain, is gradually injected with Gaussian noise according to time step t To obtain noise characteristics x t , then inputting the noise characteristics x t into a denoising UNet model integrated with a contrast parameter decoupling module for inverse denoising, wherein the denoising UNet model learns denoising distribution by minimizing the difference between predicted noise and actual noise to obtain denoised characteristics x 0 , and in the inverse denoising process, text conditions T are introduced by a cross attention mechanism and are compared with time steps Embedding the compressed image as guiding signal to control the generation of image semantic content, and finally, sensing the image decoder of the compression module Responsible for reconstructing denoised features x 0 back into image pixel space, outputting the final generated image 。
2. The method for multi-modal image synthesis with contrast parameter decoupling as claimed in claim 1, The contrast parameter decoupling module is inserted into each attention block of the potential denoising UNet model, and establishes a new parameter space based on the parameter weight of each attention block, wherein the optimized structure is composed of 1 semantic matrix A and 3 independent attribute matrices Construction, wherein the dimension of the semantic matrix A is recorded as Attribute matrix Dimension of (2) is recorded as ,m {1,2,3},1 Represents the attribute of modality 1,2 represents the attribute of modality 2, and 3 represents the attribute of modality 3.
3. The method for multi-modal image synthesis with contrast parameter decoupling as claimed in claim 2, For the contrast parameter decoupling module inserted by the ith attention block, using A pre-trained parameter matrix representing the ith attention block, first in the pre-trained parameter matrix Performing singular value decomposition operation on the three matrices to obtain three matrices , , wherein, A singular matrix representing the output is presented, Representing a diagonal matrix of stored singular value intensities, A singular matrix representing the input, expressed as follows: Wherein K=min% ,n), And n is respectively a pre-trained parameter matrix Is defined by a height and a width of (a), The kth left singular vector for the ith attention block, The kth singular value for the ith attention block, The kth right singular vector that is the ith attention block; Then, based on the singular value distribution of each attention block, from the pre-trained parameter matrix Extracting core parameter matrix from the obtained data The method comprises the following steps: Wherein, the As a core parameter matrix Is used in the manufacture of a printed circuit board, Representation of truncated left singular matrices Diagonal matrix And right singular matrix The first r parameter dimensions of (a); Matrix core parameters Decomposing into two orthogonal bases of different directions, wherein the decomposed diagonal matrix And right singular matrix Is used as the orthogonal basis of the product of the input terminal, left singular matrix Two orthogonal bases in different directions are used as geometric prior of parameter updating space; then, a learnable semantic matrix A is introduced on the orthogonal basis of the input end, and an attribute matrix is introduced on the orthogonal basis of the output end Respectively taking the semantic matrix A as a semantic untangling device to capture the semantic information of the image and the attribute matrix As an attribute adapter to store the modal attribute information of the multi-modal image, the overall optimization framework form of the contrast parameter decoupling module is as follows: Wherein, the Representing the structural paradigm of the ith attention block, diagonal matrix during optimization Right singular matrix And left singular matrix Is kept frozen, only the semantic matrix a and the attribute matrix are optimized and updated 。
4. A method for multi-modal image synthesis with contrast parameter decoupling as claimed in claim 3, Training of a contrast parameter decoupling module: Training a contrast parameter decoupling module using semantic loss L s and attribute loss L a ; Semantic loss L s , adopting self-supervision contrast learning method to make any potential characteristics Defined as the input feature of the ith attention block, the manner of processing the features of the semantic matrix a is as follows: Wherein, the Representing semantic features of the semantic matrix A processing output; Based on the mode of processing the features of the semantic matrix A, respective semantic features of three modes are obtained, wherein the semantic features obtained from images with the same category and different modes are designated as positive sample feature pairs, and the semantic features obtained from images with different categories and different modes are regarded as negative sample feature pairs; thus, for any pairwise modality u, the contrast constraint loss between v is defined as follows: Wherein, the (.) Represents an exponential function, sim (.T) represents cosine similarity calculation between any two semantic features under the guidance of text condition T, And Representing semantic features obtained by processing the images with the category a in the mode u and the mode v through a semantic matrix A, Representing semantic features obtained by processing an image with a category b in a mode v through a semantic matrix A; And To form positive sample feature pairs And Then negative sample feature pairs are formed, wherein a is equal to B, τ represents the temperature hyper-parameter, and B represents the total batch of samples; Since the contrast parameter decoupling module is injected into all attention blocks in the latent denoising UNet model and denoted as i e {1, 2.. The number of all attention blocks of the latent denoising UNet model, L represents the number of all attention blocks of the latent denoising UNet model, the contrast constraint loss terms of all layers of the latent denoising UNet model are counted, and the final semantic loss L s is expressed as: Wherein, the Representing the sum-up calculation, Representing the contrast constraint loss of modality 1 and modality 2 optimizing the ith block of attention under the guidance of text condition T, Representing the contrast constraint loss of modality 1 and modality 3 optimizing the ith block of attention under the guidance of text condition T, Representing contrast constraint loss of modality 2 and modality 3 optimizing the ith block of attention under the guidance of text condition T; attribute loss L a semantic matrix a has learned to map out unchanged semantics under the guidance of text condition T by contrast optimization between multiple modes, at which stage semantic matrix a is frozen to keep the learned semantics unchanged, then by optimizing the attribute matrix To accommodate different modal properties, constraining the attribute matrix using standard diffuse noise predictive loss To achieve the reconstruction of the attributes, the attribute loss calculation form is as follows: Wherein, the Representing the mathematical expectation that the data will be, Is Gaussian noise added in the diffusion forward noise adding process and obeys The distribution of the particles is carried out, Representing a latent denoising UNet model under text conditions And noise characterization under time step t guidance Predicted denoising noise; a calculation method representing a mean square error; the overall optimization loss of the proposed contrast parameter decoupling module is determined by semantic loss And attribute loss The total loss is expressed as follows: 。

Description

Multi-mode image synthesis method with decoupling of contrast parameters Technical Field The invention belongs to the technical field of deep learning, and relates to a multi-mode image synthesis method with decoupling of contrast parameters. Background At present, the technology related to the invention comprises three aspects, namely a first text condition guided image generation technology, a second decoupling representation learning technology; In recent years, the field of image generation has grown in an explosive manner, and the application of the field has been deeply penetrated into a plurality of fields such as image synthesis, sample expansion, attribute migration, decoupling learning and the like. With deep characterization learning, the generation paradigm shows a diversified development trend, and the current mainstream architecture includes a Flow-based model (Flow-based Models), a generation countermeasure network (GANs), an autoregressive model (Autoregressive Models), and a Diffusion probability model (Diffusion Models). The four models have advantages in probability density modeling and sampling strategies, are different but all aim at searching an optimal solution between high fidelity and diversity of a generated sample, and jointly form a basic stone of a contemporary computer vision image generation task. Among the many generation paradigms, the diffusion model stands out by virtue of its excellent generation quality and training stability, which is essentially a markov chain-based probabilistic process in which the forward process destroys data into pure noise by step-wise injection of gaussian noise, while the reverse process trains the network to learn a reverse denoising profile, thereby reconstructing structured data from random noise. The diffusion architecture represented by Denoising Diffusion Probabilistic Models (DDPM) and score-based generative models exhibits superior distribution coverage and sample fidelity compared to the pattern collapse and training concussion problems common to GANs. In order to further break through the bottleneck of the pixel-level Diffusion model that is costly to calculate in a high-dimensional space, a potential Diffusion model (Latent Diffusion Models, LDMs) represented by Stable Diffusion has been developed. The architecture innovatively introduces a perceptual compression mechanism that utilizes a variational self-encoder (VAE) to map an image from a high-dimensional pixel space to a low-dimensional potential space (LATENT SPACE). The core denoising process is then performed in an efficient potential space through a U-Net network, and flexibly blended into conditional control signals such as text by using a Cross-Attention mechanism (Cross-Attention). Finally, the denoised potential features are mapped back to pixel space by the VAE image decoder. Due to the fact that denoising is performed in potential space, the design greatly reduces the computational complexity, and meanwhile, the reasoning speed and the generation stability are remarkably improved. The text condition guided image generation technology aims at generating a high-quality image which is consistent with the semantics of the text condition guided image generation technology through natural language description, and is a research hot spot in the crossing field of computer vision and natural language processing in recent years. Early generation methods were based primarily on generating an antagonism network (GANs), representative efforts including StackGAN, attnGAN and DF-GAN, etc., which fuse text features with image features through multi-stage refinement or attention mechanisms. However, GAN-based methods often face problems of unstable training and pattern collapse, and it is difficult to generate diversified high-resolution images. In recent years, a method for generating a Denoising Diffusion Probability Model (DDPM) has been developed. The Image generation technology is characterized in that a text generation Image technology represented by a figure-E2 model of OpenAI, a figure Image of Google and a figure-dispersion (SD) model of Stability AI, and fine control on Image content and style is realized by combining a strong Pre-training text encoder such as a contrast text Image Pre-training (Contrastive Language-Image Pre-training) through a Markov chain process of simulating that an Image is gradually restored from Gaussian noise to a clear Image. The invention adopts the SD diffusion model, and the training difficulty of multi-mode image generation is obviously reduced by utilizing the abundant priori knowledge obtained by mass pre-training. Meanwhile, by means of the efficient potential space and the cross attention mechanism, the method realizes that text prompt is used as a unified control signal, and ensures that generated multi-mode images are strictly consistent in semantics. Decoupling means learning eliminates entanglement and dependence between features by separ