CN-120070202-B - Multi-mode image fusion method, system, equipment and readable storage medium based on inverse diffusion model

CN120070202BCN 120070202 BCN120070202 BCN 120070202BCN-120070202-B

Abstract

The invention relates to a multi-mode image fusion method of a back diffusion model, which comprises the following steps of reversing a visible light image to a noise potential space by using a back diffusion technology, guiding an infrared image to reverse by using the characteristics of the reversed visible light image, guiding the appearance attribute of the visible light to inject the infrared characteristics by using the reverse process in the diffusion model, generating an infrared image with a visible light style, designing a specific fusion rule, and fusing the reversed visible light and infrared characteristics by using an attention layer in the denoising process, retaining the text interaction capability of the model and supporting fusion control of language driving. The invention can directly generate high-quality fusion images without additional training or fine adjustment. The obtained fusion image is highly compatible with the basic model, so that the difference problem between data domains is effectively solved, and the performance of a downstream machine perception task is remarkably improved. The invention obviously reduces the training cost and provides an efficient and innovative solution for cross-domain tasks.

Inventors

JIANG JUNJUN
LIANG PENGWEI
WANG CHENYANG
MA QING
LIU XIANMING
MA JIAYI

Assignees

哈尔滨工业大学

Dates

Publication Date: 20260508
Application Date: 20250127

Claims (5)

1. The multi-mode image fusion method of the back diffusion model is characterized by comprising the following steps of: Step one, following the frame of a diffusion model, wherein the diffusion model comprises a forward process of T steps and a reverse process of T steps, and the forward process gradually processes a clean image Conversion to white gaussian noise; can be expressed as: (1) Wherein, the method comprises the steps of, Representing an image at a time t, Refers to the hyper-parameters associated with time step t, Is a standard normal distribution; The inverse process in its diffusion model is expressed as: (2) In the formula (2), Is about Is used for the control of the temperature of the liquid crystal display device, Prediction , Representation pointing Is provided in the direction of (a), A network model which is a diffusion model, generating visible features at step t 、 And The expression pattern of (2) is as follows: , , (3) ; in equation (3), the network model of the diffusion model At the input In the case of (a) the prediction of noise ; And In the prediction of noise Generated at the time; The method comprises the following steps of (1) merging visible light characteristics into an infrared image, and (3) according to a formula (2) and a formula (3), defining the infrared updating step of the visible light style: , In the formula (4) The weight factor is represented by a weight factor, I.e. visual cues, which are crucial for guiding the noisy infrared image towards the visible direction in each time step, according to equation (2) and equation (3), in equation (4) It can also be expressed as: (5); (6) Equation (4) can be rewritten as: (7); step three, in order to maintain the visual appearance, the visual characteristics are used as the basic component of the fusion image, and the fusion image The iterative generation process of (2) can be expressed as: (8) Introducing customized fusion rules to inject infrared information similar to visible light, wherein the fusion rules are applicable to self-attention layers in a denoising network It can be defined as: (9); And Representing two super parameters, the constraint condition needs to be satisfied ; To ensure that the fused image retains primarily visible content, the query vector is maintained during the iterative process Invariable, using self-attention in layers The visual characteristics of the infrared image are preserved and the infrared information is injected in an early step of denoising.
2. The method for multi-modal image fusion of back-diffusion model of claim 1, wherein the overall quality of the resulting fusion image is improved using no classifier guidance, two forward propagates through the denoising network: (10); representing text embedding from a pre-trained text encoder; represents the noise predicted under the direction of formula (9) It is indicative of an unconditional generation of a result, Is the intensity of the steering vector.
3. A multi-modal image fusion system of a back-diffusion model, comprising a computer module applying a multi-modal image fusion method of a back-diffusion model according to any one of claims 1 to 2.
4. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 2 when the computer program is executed.
5. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any one of claims 1 to 2.

Description

Multi-mode image fusion method, system, equipment and readable storage medium based on inverse diffusion model Technical Field The invention belongs to the technical field of image processing, and particularly relates to a multi-mode image fusion method, a system, equipment and a readable storage medium based on a back diffusion model. Background Image fusion techniques are widely accepted for their ability to integrate complementary information in a multi-source image into a single fused image. Each modality has its limitations in capturing all information in a scene due to the physical limitations inherent to the various sensors. For example, visible light sensors commonly used in everyday applications can effectively capture scene details, but are very sensitive to lighting conditions. The performance is significantly degraded in poor lighting conditions, such as during night or overexposure conditions. In contrast, infrared sensors are robust to different lighting and weather conditions, capturing useful scene information, whether daytime or nighttime. However, infrared images lack the detailed structural information provided by visible light sensors. By combining the complementary information of the two modes, the infrared and visible light image fusion technology can generate a fusion image which can keep effective information in a scene as far as possible. With the development of the generation type learning technology, a plurality of fusion methods are developed for the fusion task of infrared and visible light images. Generally, these methods can be broadly divided into three categories, self-encoder (AEs), generation of a countermeasure network (GANs), and Diffusion Model (DMs). AE-based approaches typically employ complex network architectures to improve feature extraction. The GAN-based approach utilizes a resistance framework, where the generator is intended to produce a fused image that can fool the discriminant, which strives to distinguish the generated image from the real image. Recently, DM-based methods have received attention because of their ability to produce high quality results, which are generally more stable than GAN. Despite the remarkable success of these approaches, a key challenge remains unsolved. This challenge relates to the problem of fusing images to accommodate downstream tasks. The spectrum captured by the infrared sensor is different from visible light, resulting in a significant difference in the appearance of the image. During the fusion process, most existing methods force maintaining pixel-level similarity between the fused image and its source image. Thus, the fused image incorporates the appearance characteristics of both the infrared and visible light modes. From the perspective of appearance attributes, infrared, visible, and fused images belong to three different domains. In this case, while existing fusion methods perform well in traditional computer vision tasks, the advent of the underlying model introduced new challenges. In particular, fused images that are independent domains are difficult to seamlessly integrate with pre-trained base models. These models are typically trained on large-scale visible image datasets and include infrared images to some extent. However, few base models include fused images in the training data, and there is an inherent domain gap when fused images are applied directly to these models. When the same fused image is directly input into the pre-trained detection model, the result is less ideal than a fused image more similar to visible light. Disclosure of Invention The invention aims to solve the problem that a fusion image is not directly adapted to a high-level vision pre-training model, and further provides a multi-mode image fusion method of a back diffusion model. The technical scheme adopted for solving the problems is that the multi-mode image fusion method of the back diffusion model comprises the following steps: reversing a visible light image to a noise potential space by using a back diffusion technology, and then guiding an infrared image to reverse by using the characteristics of the reversed visible light image; Step two, guiding through the inverse process in the diffusion model, injecting the appearance attribute of the visible light into the infrared characteristic, and generating an infrared image with a visible light style by the characteristic; And thirdly, designing a specific fusion rule for fusing reversed visible light and infrared characteristics of an attention layer in the denoising process, reserving text interaction capability of a model and supporting language-driven fusion control. Further, in the first step, the frame of a denoising diffusion probability model is followed, wherein the model comprises a T-step diffusion process and a T-step reverse process, and the forward process gradually leads a clean imageConversion to white gaussian noise; can be expressed as: (1) Wherein, the Representing a