CN-121999334-A - Multi-source image fusion method and system based on depth feature interaction

CN121999334ACN 121999334 ACN121999334 ACN 121999334ACN-121999334-A

Abstract

The invention discloses a multi-source image fusion method and a system based on depth feature interaction, wherein the method comprises the steps of obtaining a first source image and a second source image of the same scene; the method comprises the steps of respectively carrying out feature coding on a first source image and a second source image to obtain respective multi-scale features, carrying out depth feature interaction based on the multi-scale features to obtain a deep fusion feature map, carrying out semantic decoding on the deep fusion feature map to generate global semantic representation and at least one group of multi-scale intermediate semantic features, carrying out semantic guided feature mapping and fusion on the deep fusion feature map by utilizing the global semantic representation and the multi-scale intermediate semantic features to obtain semantic guided deep features, carrying out feature decoding by taking the semantic guided deep features as decoding initial features, and outputting a fused image obtained through feature decoding. The invention can reduce modal conflict and information redundancy at a multi-scale level, and promote semantic consistency of fusion results through semantic constraint.

Inventors

NIU LING
ZHU BIAN
SHEN QIUHUI
XU LIANG
SHI FANG

Assignees

周口师范学院

Dates

Publication Date: 20260508
Application Date: 20260105

Claims (10)

1. A multisource image fusion method based on depth feature interaction is characterized by comprising the following steps: acquiring a first source image and a second source image of the same scene; Respectively carrying out feature coding on the first source image and the second source image to obtain respective multi-scale features, and carrying out depth feature interaction based on the multi-scale features to obtain a deep fusion feature map; Performing semantic decoding on the deep fusion feature map to generate a global semantic representation and at least one group of multi-scale intermediate semantic features; Performing semantic guided feature mapping and fusion on the deep fusion feature map by using global semantic representation and multi-scale intermediate semantic features to obtain semantic guided deep features; And performing feature decoding by taking the deep features guided by the semantics as decoding initial features, and outputting a fused image obtained by the feature decoding.
2. The depth feature interaction-based multi-source image fusion method of claim 1, further comprising performing preprocessing on the first source image and the second source image before feature encoding the first source image and the second source image, wherein the preprocessing at least comprises alignment processing and normalization processing, wherein the first source image is an infrared image, and wherein the second source image is a visible light image.
3. The depth feature interaction-based multi-source image fusion method according to claim 2, wherein the depth feature interaction comprises performing frequency domain interaction on infrared features and visible light features in at least one scale in a feature encoding process, wherein the frequency domain interaction comprises performing Fourier transform on the infrared features and the visible light features respectively to obtain respective frequency domain features; The weighting fusion at least comprises the steps of distributing weights according to frequency bands, enabling the weight coefficient corresponding to infrared light to be higher than the weight coefficient corresponding to visible light in a low frequency band, enabling the weight coefficient corresponding to visible light to be higher than the weight coefficient corresponding to infrared light in a high frequency band, and adaptively determining the weight coefficients of the infrared light and the visible light in the middle frequency band according to response intensity or consistency of the infrared light and the visible light in the frequency band.
4. The depth feature interaction-based multi-source image fusion method according to claim 3, wherein when the amplitude information and the phase information are weighted and fused, a first frequency domain uncertainty of an infrared frequency domain feature and a second frequency domain uncertainty of a visible light frequency domain feature are respectively determined, and weight coefficients corresponding to infrared and visible light are adaptively adjusted according to the first frequency domain uncertainty and the second frequency domain uncertainty, so that the frequency domain feature with a higher frequency domain uncertainty value is suppressed in the weighted fusion, and the frequency domain feature with a lower frequency domain uncertainty value is strengthened in the weighted fusion; The frequency domain uncertainty is used for representing noise, distortion or fluctuation degrees of the corresponding frequency domain features at each frequency point or each frequency band, and the larger the frequency domain uncertainty is, the lower the reliability of the corresponding frequency domain features is.
5. The depth feature interaction-based multi-source image fusion method according to claim 4, wherein the determining of the first frequency domain uncertainty and the second frequency domain uncertainty comprises generating a confidence map distributed according to frequency bands or frequency points according to frequency band statistics or frequency domain response intensities of infrared frequency domain features and visible light frequency domain features, and modulating fusion weights of infrared light and visible light at corresponding frequency bands or frequency points by using the confidence map.
6. The depth feature interaction-based multi-source image fusion method according to claim 1, wherein the process of performing semantic decoding on the depth fusion feature map comprises the steps of taking the depth fusion feature map as input, performing progressive upsampling and feature reconstruction to obtain semantic feature maps corresponding to multiple scales, wherein multi-scale intermediate semantic features are at least composed of the semantic feature maps, performing global statistics on the basis of at least one semantic feature map in the multi-scale intermediate semantic features, and obtaining the global semantic representation.
7. The depth feature interaction-based multi-source image fusion method according to claim 1, wherein the multi-scale intermediate semantic features correspond to a plurality of decoding scales of the feature decoding, and in the feature decoding process, channel weights are generated by using the intermediate semantic features of the corresponding scales under at least one decoding scale, and semantic consistency modulation is performed on the fusion features under the decoding scales according to the channel weights.
8. The depth feature interaction-based multi-source image fusion method according to claim 1, wherein the feature decoding process comprises the steps of upsampling layer by layer, and in at least one upsampling layer, fusing upsampling features of the upsampling layer with infrared features and visible light features stored in an encoding stage to obtain decoding fusion features under the scale.
9. The depth feature interaction-based multi-source image fusion method according to claim 8, wherein when the upsampling features are fused with the infrared features and the visible light features stored in the encoding stage, global statistics is performed on the features to be fused to obtain channel description vectors, channel weights for channel weight recalibration are generated based on the channel description vectors, and convolution reconstruction is performed after channel responses of the features to be fused are subjected to weight recalibration according to the channel weights to obtain fusion features of corresponding decoding scales.
10. A depth feature interaction-based multi-source image fusion system, comprising: the acquisition unit is used for acquiring a first source image and a second source image of the same scene; the encoding unit is used for respectively carrying out feature encoding on the first source image and the second source image to obtain respective multi-scale features; the depth feature interaction unit is used for carrying out depth feature interaction based on the multi-scale features to obtain a deep fusion feature map; the semantic decoding unit is used for performing semantic decoding on the deep fusion feature map to generate a global semantic representation and at least one group of multi-scale intermediate semantic features; The semantic guidance fusion unit is used for carrying out semantic guidance feature mapping and fusion on the deep fusion feature map by utilizing the global semantic representation and the multi-scale intermediate semantic features to obtain semantic guidance deep features; And the feature decoding unit is used for executing feature decoding by taking the deep features guided by the semantics as decoding initial features and outputting a fused image obtained by the feature decoding.

Description

Multi-source image fusion method and system based on depth feature interaction Technical Field The invention relates to the technical field of image processing and computer vision, in particular to a multi-source image fusion method and a multi-source image fusion system based on depth feature interaction. Background The multi-source image fusion aims at complementarily integrating information of different imaging sources (such as infrared and visible light) in the same scene so as to obtain a fusion image with both thermal target significance and texture detail definition. In the applications of night monitoring, low-illumination sensing, target detection and tracking, security inspection, driving assistance and the like, the infrared image can highlight a thermal radiation target and is insensitive to illumination change, and the visible light image can present rich texture and structure details, so that scene background can be conveniently understood. The two are effectively fused, so that the interpretability of the image and the robustness of the subsequent visual task can be improved. The existing image fusion method has the following defects that the fusion result is easy to generate the phenomena of hot target suppression, texture blurring, artifact increase and the like, 1. The spectrum complementary information is not utilized sufficiently, a plurality of processing methods mainly perform simple feature splicing, pixel-by-pixel weighting or single attention fusion in a space domain, the complementary relation between infrared and visible light in a spectrum level is difficult to explicitly describe and utilized, because the low-frequency part of an image is often related to integral brightness/energy distribution and main body contour, the high-frequency part is often related to edge and texture details, and when the fusion strategy lacks differentiation and regulation on contribution of different frequency bands, the problem that the heat significant information is weakened or visible light details are smoothed easily occurs; the cross-modal interaction in the encoding stage is insufficient or unstable, the cross-modal interaction is often simplified into the addition or splicing of the characteristics of a few layers in the existing end-to-end depth network, or the one-time interaction is carried out in a bottleneck layer, so that the interaction is difficult to adaptively adjust along with the scale change, meanwhile, when a source image is influenced by noise, shielding, overexposure, smoke and the like, a suppression mechanism for unreliable frequency bands/frequency points is lacked, distortion and artifact are easy to be introduced in fusion, 3, the multi-level relation among channels is ignored in the decoding stage, in the encoding-decoding framework such as a U-shaped network and the like, the decoding stage usually carries out fusion by up-sampling and fusion of jump connection characteristics to restore the spatial resolution, the channel-by-channel addition or one-time attention is often adopted in the existing method, the hierarchical relation of the commonality and the difference among channels is not distinguished in the multi-scale decoding process, modal conflict and information redundancy are easy to be caused, further structural detail is unclear, background texture fracture or edge artifact are caused, 4, semantic consistency understanding is insufficient, the fusion should not only be visually clear, but also be kept consistent semantically, i.e. key semantic structures such as a thermal target, a road/building and the like should not be destroyed or confused in the fusion result, and if guidance and constraint on global semantics are absent, the fusion result may have the situation that local details are enhanced but the whole semantics are not consistent, so that the expression of a downstream task is affected. Therefore, there is a need for a multi-source image fusion scheme that can explicitly utilize frequency domain complementarity during the encoding phase and introduce semantic guidance and channel recalibration mechanisms during the decoding phase to improve the thermal target retention, texture detail fidelity and semantic consistency of the fused image, and reduce artifacts and noise propagation. Disclosure of Invention The invention aims to provide a multi-source image fusion method and a multi-source image fusion system based on depth feature interaction, which are used for solving the problems that in the prior art, spectrum complementation is not enough, channel relation modeling is rough, semantic consistency is difficult to guarantee and the like, so that a fusion image with better subjective visual quality and objective indexes is obtained. In order to achieve the above purpose, the present invention provides a multi-source image fusion method based on depth feature interaction, which specifically includes: acquiring a first source image and a second source