CN-121810509-B - Modal attribute-structure decoupling infrared and visible image fusion method

CN121810509BCN 121810509 BCN121810509 BCN 121810509BCN-121810509-B

Abstract

The invention belongs to the technical field of image information processing, and relates to a method for fusing infrared and visible images with decoupled modal attribute-structure, which is constructed based on a pre-trained latent space diffusion model, and an integral network frame comprises an encoder of a variational self-encoder Decoder for U-Net denoising network and variational self-encoder . According to the invention, the structure LoRA parameter and the attribute LoRA parameter are introduced into the pre-trained latent space diffusion model, and the modal structure information and the modal attribute information are respectively modeled as the differential variation of model prediction noise before and after LoRA is introduced. By applying contrast constraint to the differential variation, effective decoupling of the structural information and the attribute information is realized in the diffusion noise space, so that important complementary features of different modes are prevented from being lost in the fusion process. In the reasoning stage, the content information among different modes can effectively perform cross-mode feature fusion, and finally a high-quality fusion image is generated.

Inventors

ZHAO WENDA
Cui Hengshuai

Assignees

大连理工大学

Dates

Publication Date: 20260512
Application Date: 20260311

Claims (3)

1. The method is characterized in that the method is constructed based on a pre-trained latent space diffusion model, and the whole network framework comprises encoders of a variational self-encoder Decoder for U-Net denoising network and variational self-encoder ; Encoder of the variation self-encoder The specific implementation process of (2) is as follows: infrared image And visible light image Encoder for respectively inputting variable self-encoder The corresponding potential representation is obtained: Wherein, the Representing a modality A low-dimensional representation of features in the latent space, Representing the infrared or visible mode, and then, during forward diffusion, for Performing step-by-step denoising operation of t time steps to obtain corresponding denoising potential representation The calculation process is as follows: Wherein, the Representing noise scheduling coefficients For controlling at the first The proportional relationship between the potential representation and gaussian noise in the individual time steps, Represent the first The noise scheduling coefficients corresponding to the individual time steps, Representing gaussian noise subject to a standard normal distribution; the specific implementation process of the U-Net denoising network is as follows: Latent representation to be noisy Inputting the U-Net denoising network and outputting a noise prediction result Wherein, the method comprises the steps of, The method is a basic parameter of a U-Net denoising network; Two types LoRA of parameters are introduced into the U-Net denoising network, wherein a leachable structure LoRA parameter is introduced at the encoder side For modeling modalities Is included in the decoder-side lead-in attribute LoRA parameter For modeling modalities The attribute LoRA parameter at the decoder side is frozen when calculating the structural noise rise Only encoder-side structure LoRA parameters are activated And potentially represent the noise Performing noise prediction to obtain a corresponding noise prediction result And combining the noise prediction result with the noise prediction result Difference is made to obtain the mode Corresponding structural noise increment: Wherein, the Representing structural information learned from structural LoRA parameters under frozen basis model parameters for describing a modality The amount of change in structural information relative to the pre-trained latent spatial diffusion model, freezing encoder-side structural LoRA parameters in calculating the attribute noise rise Only the decoder-side attribute LoRA parameters are activated Obtaining attribute noise prediction results in the same manner And further calculate the modality Corresponding attribute noise rise: Wherein, the Representing attribute information learned from attribute LoRA parameters under frozen pre-trained latent spatial diffusion model parameters for describing a modality The variation of the attribute information of (2) relative to the pre-trained latent spatial diffusion model; then, by minimizing cosine distance between structural noise increment of infrared mode and visible mode, consistency alignment of cross-mode structural information is realized, and cross-mode alignment loss is realized The expression is as follows: Wherein, the Simultaneously, the decoupling of the structural information and the attribute information in the infrared mode and the visible mode and the same-mode decoupling loss are realized by minimizing the cosine similarity between the structural noise increment and the attribute noise increment in the same mode The expression is as follows: contrast constraint loss of noise rise The expression is as follows: to ensure that the proposed network has stable image reconstruction capability, image reconstruction loss is introduced Constraint is carried out on the U-Net denoising network, and in the back diffusion process, the U-Net denoising network is based on the potential representation of the denoising of each time step Gradually predicting the noise component, and carrying out iterative denoising update on the potential representation by combining with a preset noise scheduling coefficient so as to obtain the final potential representation 。
2. The method of claim 1, wherein the final potential representation is a combination of the modality attribute-structure decoupled infrared and visible images Decoder of a variable self-encoder Reconstructing to obtain corresponding reconstructed images And is connected with the input image The image reconstruction loss is calculated, which is defined as: Wherein an image is input Including infrared images And visible light image , ; Representing an input image And reconstructing the image Absolute pixel-by-pixel error between.
3. The method of modality attribute-structure decoupled infrared and visible image fusion according to claim 2, wherein the total loss of the proposed network Using contrast constraint loss And image reconstruction loss Joint optimization was performed as follows: In the reasoning stage, respectively inputting infrared image and visible light image into encoder of variational self-encoder Obtaining the in-mode pattern Low dimensional feature representation in a latent space Then, in the back diffusion process, adopting U-Net denoising network to make noise prediction step by step, adopting element-by-element addition to obtain a complete fusion structure feature representation, in the process of extracting each layer of feature of U-Net denoising network respectively obtaining infrared structure feature With visible light structural features And obtaining the fusion structural characteristics at the corresponding hierarchy in an element-by-element addition mode, wherein the fusion structural characteristics are formed by: Wherein, the The characteristics of the fusion structure are represented, Representing an element-by-element addition operation; visible light model attribute LoRA parameters learned by injection In the decoding process of the modulated U-Net denoising network, the fusion structural feature is endowed with attribute information of a visible light mode, and a noise prediction result after the visible light mode attribute modulation is obtained ; In the gradual process of the back diffusion process, the noise prediction result after modulation Progressively updating the potential representation to obtain a fused potential representation And finally, fusing the potential representations Decoder of input variable self-encoder Decoding to generate final fusion image The form is expressed as: 。

Description

Modal attribute-structure decoupling infrared and visible image fusion method Technical Field The invention belongs to the technical field of image information processing, and relates to an infrared and visible image fusion method with mode attribute-structure decoupling. Background At present, the technology related to the invention comprises two aspects, namely an image fusion method of modal feature decomposition and an image fusion method driven by a diffusion model. The fusion method based on modal feature decomposition separates modal attribute information and modal structure information in different modal images by constructing shared feature branches and modal specific feature branches of different structures, so that cross-modal feature fusion is realized in a semanteme consistent space. However, the modal decomposition paradigm of this type of method relies on a complex modal decomposition network structure, and the decomposition result is limited by the assumption of human priori, so that it is difficult to flexibly decouple the modal characteristics. In recent years, a diffusion denoising probability model is used as a generation model with stability and controllability. The model gradually adds noise in the forward process to construct a Markov chain, and then the inverse process of the Markov diffusion process is estimated through the prediction noise approximation in the reverse process, so that a target image is gradually generated. Based on the above principle, studies have been made to introduce a diffusion model into the task of infrared and visible image fusion. The existing diffusion model driven image fusion method takes images from different modes as condition input, and guides a diffusion model to generate fusion results in the denoising process. However, these methods directly fuse different modal feature information, and lack effective decomposition of modal attribute information and modal structure information. Because of domain differences among different modal characteristics, important complementary information is easily lost in the denoising process of the model, so that the quality of a fusion result is reduced. In summary, the method adopts a mode of decomposing a network by manual design or directly fusing cross-modal characteristics, but cannot effectively fuse complementary characteristics of two modes, so that the fusion effect is limited. Aiming at the problem, a potential solution is to realize the self-adaptive decoupling and alignment of the modal attribute-structure in a diffusion noise space through a noise difference and contrast learning mechanism without constructing a complex modal decomposition network structure, thereby effectively relieving the domain difference of different modal characteristics in the denoising process and fully fusing the complementary information among different modalities. Therefore, the invention provides an infrared and visible image fusion method with mode attribute-structure decoupling. First, the learnable low-rank adaptation (LoRA) parameters, namely the attributes LoRA and the structure LoRA, are introduced on the basis of the base model (pre-trained diffusion model), respectively. Then, in the denoising process, the model outputs the prediction noise after the attribute LoRA or the structure LoRA is introduced, and performs differential calculation with the prediction noise of the base model, so that the attribute information and the structure information are represented as noise increments relative to the base model. On the basis, contrast learning constraint is constructed among different noise increments, so that the structural noise increments of infrared and visible modes are mutually close in potential space, and meanwhile, the attribute noise increment and the structural noise increment under the same mode are mutually far away, thereby realizing cross-mode structure alignment and effective decoupling of attribute-structure in diffusion noise space. Disclosure of Invention Aiming at the problem that the infrared and visible light image fusion task is difficult to effectively decompose the modal attribute and the structure information, the infrared and visible image fusion method with the decoupling of the modal attribute and the structure is provided. The core idea of the method is to introduce attributes LoRA and structures LoRA in the base model, and to explicitly represent the learned content information and style information as noise-plus-noise-increases relative to the base model. In combination with a contrast learning mechanism, alignment of infrared and visible modal content information and effective decoupling of content-style information are achieved in a noise space. Specifically, in the training stage, the denoising network of the base model is divided into two parts, namely an encoder and a decoder, wherein the encoder introduces structural LoRA learning mode structural information, and the decoder introduces