CN-122023149-A - Missing infrared mode image fusion method based on downstream task driven learning

CN122023149ACN 122023149 ACN122023149 ACN 122023149ACN-122023149-A

Abstract

The invention relates to a missing infrared mode image fusion method based on downstream task driving learning, and belongs to the technical field of infrared and visible light image fusion. The method comprises the steps of training paired infrared and visible light images to obtain dictionary encoders and decoders, obtaining pseudo infrared encoding coefficients, realizing multi-weight self-adaptive coefficient fusion of the pseudo infrared encoding coefficients and the visible light encoding coefficients on a coefficient domain, training a pseudo infrared encoding coefficient reasoning network and an encoding coefficient self-adaptive fusion network together in the fusion process to reconstruct a final fusion image, inputting the fusion image into a downstream task in the reconstruction process, and updating the pseudo infrared encoding coefficient reasoning network and the encoding coefficient self-adaptive fusion network simultaneously through loss to obtain the fusion image more suitable for the downstream task. According to the invention, how to use the coefficient reasoning and fusion network to be more suitable for the downstream task through the guidance of the downstream task under the condition of the missing infrared mode is solved, and the fusion image meeting the requirement of the downstream task is obtained through a single visible light image.

Inventors

ZHANG YAFEI
MA MENG
LI HUAFENG
XIE MINGHONG
Dong Neng

Assignees

昆明理工大学

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (8)

1. The missing infrared mode image fusion method based on downstream task driving learning is characterized by comprising the following steps of: Step1, acquiring paired infrared and visible light image training data sets; step2, training the infrared and visible light images to obtain dictionary encoders and decoders; Step3, sending the visible light image into an encoder to be decomposed into visible light coding coefficients, and obtaining the pseudo-infrared coding coefficients through a coarse-fine granularity pseudo-infrared coding coefficient reasoning network by the visible light coding coefficients; Step4, realizing multi-weight self-adaptive coefficient fusion of the pseudo infrared coding coefficient and the visible light coding coefficient on a coefficient domain, training a pseudo infrared coding coefficient reasoning network and a coding coefficient self-adaptive fusion network together in the fusion process, and finally reconstructing a final fusion image; Step5, inputting the fusion image into a downstream task in the process of reconstructing the fusion image, and updating the pseudo infrared coding coefficient reasoning network and the coding coefficient self-adaptive fusion network through loss to obtain the fusion image which is more suitable for the downstream task.
2. The method for merging missing infrared mode images based on downstream task driven learning of claim 1, wherein each group of infrared and visible image pairs in the data set for dictionary training in Step1 consists of a group of corresponding infrared and visible images, and the resolution of each image is 512×512; the data set of the infrared and visible light image is preprocessed by randomly cutting the data into 256 multiplied by 256, and normalizing the processed data to be between 0 and 1.
3. The method for missing infrared modality image fusion based on downstream task driven learning as claimed in claim 1, wherein the Step2 comprises: pairs of infrared-visible images And Simultaneously inputting the two modal images and the coding coefficients into an initialized multi-layer dictionary encoder to obtain coding coefficients, wherein the dictionary encoder comprises shared dictionary parameters and a coefficient solving module, sending the two modal images and the coding coefficients into each layer to perform calculation and update, and finally obtaining corresponding coding coefficients as marks And The method comprises the steps of establishing a mapping between two modal coefficients through a depth network, sending the bimodal coefficients into a multi-layer dictionary decoder at the same time, obtaining reconstructed infrared and visible light images through multi-layer calculation and updating, ensuring the effect through the displayed reconstruction loss, wherein the loss is defined as: ; Wherein, the And Respectively representing the visible light image and the infrared image reconstructed by the dictionary decoder, and obtaining the dictionary encoder and the decoder capable of simultaneously decomposing and reconstructing the visible light and infrared two-mode images after multiple times of training.
4. The method for missing infrared modality image fusion based on downstream task driven learning as claimed in claim 1, wherein Step3 comprises: In pseudo infrared coding coefficient reasoning network, visible light image is obtained Inputting into dictionary encoder, and decomposing into coding coefficients of visible light mode The visible light coding coefficient and the corresponding infrared coding coefficient can exist in the unified coefficient space at the moment, the possibility of modal conversion between the coefficients is established, and then the visible light coding coefficient is obtained The modal conversion task is decoupled into two stages of global infrared mapping and detailed infrared correction, and the two stages are respectively processed by a coarse-fine granularity coefficient modal conversion network And a detail correction network To complete; the coarse transformation network is implemented using a convolution-attention network layer combined U-Net, expressed as: ; obtaining the primary converted coarse infrared coding coefficient The method is used for learning the global mapping relation of the coefficients from visible light to infrared, modeling the overall corresponding relation of the cross-mode, and obtaining the low-frequency component and the global structure of the infrared image; A multi-expert detail correction network is introduced: ; the detail correction network is used for predicting high-frequency detail residual errors, compensating the image details which are not converted in the coarse conversion network, and obtaining pseudo infrared coding coefficients after modal conversion by the sum of expert output results of different weights Pseudo-infrared coding coefficient after mode conversion Obtaining pseudo infrared image after decoder 。
5. The method for missing infrared modality image fusion based on downstream task driven learning of claim 4, wherein Step3 further comprises: during the training modality conversion, the correction amplitude is limited by a dual infrared intensity gradient constraint, defined as: ; ; Through heat loss Normalized thermal weight map Enhanced hot strength grade alignment, defined as: ; Wherein, the Is an image of the light of the infrared ray, For coarse infrared coding coefficients A pseudo-infrared image is obtained after passing through a decoder, Representing the gradient operator.
6. The method for missing infrared modality image fusion based on downstream task driven learning as claimed in claim 1, wherein Step4 comprises: In a multi-weight self-adaptive coefficient fusion network, the visible light coding coefficient and the pseudo infrared coding coefficient are compared and weighted according to different input contents, the result is sent to a fusion module MAFN, multi-weight self-adaptive fusion is realized on a coefficient domain, and the fusion coefficient is obtained by outputting the visible light structure and infrared thermal semantics obtained by prediction : ; Wherein, the Is the coding coefficient of the visible light, For the pseudo-infrared coding coefficient after modal conversion, MAFN adaptively generates a weight value according to the influence of different inputs on a target fusion result in the fusion process And For different inputs; And fusing the coefficients Sending the image and the shared dictionary D into a trained dictionary decoder for consistency reconstruction to obtain an infrared and visible light fusion image under the condition of missing infrared states 。
7. The method for missing infrared modality image fusion based on downstream task driven learning as claimed in claim 1, wherein Step5 comprises: Obtaining a fused image Thereafter, the images are fused As a single modality feed to a trained and frozen downstream task network In (a): ; Wherein, the The result of the task prediction is indicated, Regarding the fusion network as an intermediate representation of task guidance, and regarding different downstream tasks As a label agent, outputting the fusion result of different downstream tasks and utilizing the downstream task labels Constraint is carried out on the fusion result, so that the fusion quality is measured by the performance of the downstream task; In addition, when the coding coefficient self-adaptive fusion network is trained, the former coarse conversion network freezes in order to ensure the stability of global modal conversion, the detail correction network participates in training, and the pseudo infrared detail is corrected to different degrees through the loss back propagation guidance of a downstream task, so that the pseudo infrared coding coefficient output by the network is suitable for the downstream task, the subsequent coding coefficient self-adaptive fusion network is cooperated for further adjustment, and the reasoning and the fused secondary training process act on the fusion output result together, so that the fusion result corrected to different degrees is obtained.
8. The method for missing infrared modality image fusion based on downstream task driven learning of claim 7, wherein Step5 further comprises: exploiting downstream task losses The detail correction network and the coding coefficient self-adaptive fusion network in the pseudo infrared coding coefficient reasoning network of the previous item are simultaneously constrained, constraint losses corresponding to each downstream task are different, the network model weights of each task are different, and the losses corresponding to the semantic segmentation tasks are defined as follows: ; Wherein, the Representing cross entropy loss, emphasizing pixel-by-pixel classification correctness; representing the Dice loss, maximizing the overall overlap of the prediction partition area with the real area, The weight is represented by a weight that, Representing the downstream task tag, and the loss corresponding to the target detection task is defined as: ; Wherein, the Representing a loss of classification, Representing the regression loss of the bounding box, The gradient of the task loss is used for simultaneously reacting to the pseudo infrared reasoning and the coefficient fusion process, so that the fusion network is guided to generate a fusion image which is more favorable for downstream discrimination; based on task loss, intensity consistency loss is adopted And gradient consistency loss : ; Where max represents the larger of the pixel/gradient magnitudes, which is point-wise, contributing to Meanwhile inherits the clear edges of the infrared heat intensity peak value and the visible light, and the final fusion loss is as follows: 。

Description

Missing infrared mode image fusion method based on downstream task driven learning Technical Field The invention relates to a missing infrared mode image fusion method based on downstream task driving learning, and belongs to the technical field of infrared and visible light image fusion. Background In the practical application of infrared-visible light fusion, the infrared mode can provide more stable significance information for a thermal target and a weak light scene, and the visible light mode can provide more abundant texture details and semantic structures. The conventional infrared-visible light fusion method generally assumes that two modes are available at the same time, and takes visual quality or statistical indexes as main optimization targets, so as to generate a fusion result with better look and feel. However, in real deployment, the infrared sensor is subject to cost, energy consumption, volume, maintenance conditions and imaging environment interference, and problems such as infrared mode loss, serious degradation of signal quality or acquisition dyssynchrony often occur, so that a traditional fusion model relying on bimodal input cannot work stably. In order to cope with the problem of missing infrared, a strategy based on modal complement/modal conversion has been proposed, namely, pseudo infrared is estimated by using visible light and then fused, or cross-modal mapping is realized by generating a model to complement missing information, but the method takes pixel-level reconstruction consistency as core supervision, so that pseudo infrared characterization with lifelike appearance and irrelevant tasks is easy to generate, and a fused image is unstable in the aspects of target boundary, thermal target significance or small target detection. On the other hand, the idea of 'downstream task oriented fusion' in recent years tries to directly input a fusion result into a detection/segmentation network, and the fusion network is constrained reversely by task loss, so that the fusion meets the discrimination requirement, but most of the existing task driven fusion still depends on real infrared input, and a collaborative optimization mechanism for the generation and fusion of the modes in an 'infrared missing' scene is lacking. Disclosure of Invention Aiming at the problems that the infrared mode is easy to be lost, degraded or unavailable in practical application, the existing infrared-visible light fusion method is highly dependent on real infrared input and cannot work stably under the condition of missing infrared, the invention provides a missing infrared mode image fusion method based on downstream task driving learning; the invention directly uses the fusion result in the downstream task, and utilizes the downstream task loss reverse constraint mode conversion and fusion process to realize the mode completion and fusion representation generation and the end-to-end collaborative learning of the task optimization under the condition of infrared mode deletion, thereby effectively improving the robustness, the discriminant and the application value of the fusion result in the actual task. The technical scheme of the invention is that the missing infrared mode image fusion method based on downstream task driving learning comprises the following steps: Step1, acquiring paired infrared and visible light image training data sets; step2, training the infrared and visible light images to obtain dictionary encoders and decoders; Step3, sending the visible light image into an encoder to be decomposed into visible light coding coefficients, and obtaining the pseudo-infrared coding coefficients through a coarse-fine granularity pseudo-infrared coding coefficient reasoning network by the visible light coding coefficients; Step4, realizing multi-weight self-adaptive coefficient fusion of the pseudo infrared coding coefficient and the visible light coding coefficient on a coefficient domain, training a pseudo infrared coding coefficient reasoning network and a coding coefficient self-adaptive fusion network together in the fusion process, and finally reconstructing a final fusion image; Step5, inputting the fusion image into a downstream task in the process of reconstructing the fusion image, and updating the pseudo infrared coding coefficient reasoning network and the coding coefficient self-adaptive fusion network through loss to obtain the fusion image which is more suitable for the downstream task. Further, in Step1, each group of infrared-visible light image pairs in the data set for dictionary training comprises a group of corresponding infrared-visible light images, and the resolution of each image is 512×512; the data set of the infrared and visible light image is preprocessed by randomly cutting the data into 256 multiplied by 256, and normalizing the processed data to be between 0 and 1. Further, step2 includes: pairs of infrared-visible images AndSimultaneously inputting the two modal images a