CN-121999335-A - Infrared and visible light image fusion method and system based on multi-branch cooperative enhancement

CN121999335ACN 121999335 ACN121999335 ACN 121999335ACN-121999335-A

Abstract

The invention discloses an infrared and visible light image fusion method and system based on multi-branch cooperative enhancement, wherein the method mainly builds a double-branch architecture of the cooperative work of a global structure extraction branch GSEM and a local detail extraction branch LDEM, GSEM captures global low-frequency structural features of an image through depth separable convolution and residual connection based on ConvNeXt Block modules, provides a stable global consistency anchor point for fusion results, LDEM introduces Triplet Attention mechanism, dynamically focuses local high-frequency details from three dimensions, forms progressive feature enhancement logic, and ensures core information of two modes to be effectively reserved. The invention realizes the efficient extraction and high-quality fusion of the bimodal information through the end-to-end network architecture design and the multi-stage loss constraint.

Inventors

CHEN MING
ZHANG WANGWEI
QIN XINYUE
CHENG JUNQIANG
WEN XIAOBO
CHU YANGYANG
LIANG SHUJUN

Assignees

铜陵学院
郑州轻工业大学

Dates

Publication Date: 20260508
Application Date: 20260116

Claims (10)

1. The infrared and visible light image fusion method based on multi-branch cooperative enhancement is characterized by comprising the following steps of: s1, performing preliminary feature mapping on an input infrared original image and a visible light original image by a modal feature encoder, and mapping the images into a unified feature space to obtain infrared coding features and visible light coding features; s2, inputting the coding features output by the S1 into a global structure extraction module, constructing based on ConvNeXt Block, combining a depth separable convolution and residual error connection structure, capturing global low-frequency structural features of an image, and obtaining infrared basic structural features and visible light basic structural features; S3, inputting the coding features output by the S1 into a local detail extraction module, and obtaining infrared detail features and visible light detail features from channel-width, height-channel and space three-dimensional dynamic focusing mode specific high-frequency detail information based on Triplet Attention mechanisms; s4, collecting six types of multi-scale features including the coding features output by the S1, the basic structure features output by the S2 and the detail features output by the S3, and sequentially performing three-stage fusion processing of channel splicing, dimension compression and feature calibration on an input multi-source feature fusion layer to obtain calibration features; S5, inputting the calibration features output by the S4 into a fusion decoder, modeling a long-range dependency relationship by the fusion decoder through a double-layer transform structure, completing feature mapping and image reconstruction through a two-stage convolution layer, and outputting a primary infrared and visible light fusion image; and S6, constructing a joint loss function in a training stage, and performing end-to-end collaborative optimization on the characteristic extraction and fusion process of the steps to obtain a final infrared and visible light fusion image.
2. The multi-branch cooperative enhancement-based infrared and visible light image fusion method according to claim 1, wherein in S1, a lightweight convolutional network architecture is adopted by a modal feature encoder MFE, and the feature mapping process further comprises: s1.1, mapping an input single-channel infrared and visible light original image into a 64-channel feature space through 1X 1 convolution, wherein the expression is as follows: Wherein, the Representing the primary feature mapping result of the infrared original image after 1 x 1 convolution, Representing the primary feature mapping result of the visible light original image after 1X 1 convolution, wherein I is the infrared original image, V is the visible light original image, and the convolution operation is recorded as in the encoding stage Representing the input channel as The output channel is The convolution kernel has a size of Is a convolution layer of (2); s1.2, sequentially passing through nonlinear activation functions Batch normalization operation And a feature enhancement module C3 consisting of a stack of n Bottleneck modules to extract and enhance feature expression capabilities layer by layer, the flow is expressed as: Wherein, the Which is indicative of the characteristics of the infrared code, Representing visible light coding features; S1.3, outputting 64 channels of infrared coding features and visible light coding features by the mode feature encoder MFE, and corresponding to coding results of infrared and visible light original images.
3. The multi-branch cooperative enhancement based infrared and visible light image fusion method according to claim 1, wherein in S2, the global structure extraction module GSEM is configured to capture global structural features shared between infrared and visible light modes, and the feature capture process further includes: S2.1, based on the modified ConvNeXt Block, a ConvNeXt enhancement module CNeB is formed, in this CNeB structure, convNeXt Block employs a 7 x 7 depth separable convolution DWConv to model long-range spatial dependencies, the feature transformation process formally expressed as: Wherein, the Representing the features of the input features after 7 x 7 depth separable convolution processing, The input features of the CNeB structure are represented, Representation of The characteristics after the layer normalization operation, Representation of Features processed by the channel up-dimension convolution and GELU activation functions, Representation of The characteristics after the channel dimension reduction convolution, The output characteristics of the CNeB structure are represented, A 7 x 7 depth convolution is represented, For the layer normalization operation, Representing the activation function, the convolution operation is written as Representing the input channel as The output channel is The convolution kernel has a size of Is used for the convolution layer of (c), Is a learnable scaling parameter; S2.2, dividing an input feature into two parts by adopting a cross-stage part connection mechanism CSP for parallel processing, wherein one part extracts global features through CNeB structures, the other part maintains original path features, and finally, splicing is carried out in channel dimensions to form basic structure features with more global consistency, and the output is expressed as: Wherein, the Which is indicative of the characteristics of the infrared infrastructure, Representing the characteristics of the visible light infrastructure, Which is indicative of the characteristics of the infrared code, Which is indicative of the characteristics of the visible light encoding, Representing the feature processing functions of the global structure extraction module.
4. The multi-branch cooperative enhancement-based infrared and visible light image fusion method according to claim 1, wherein in S3, the local detail extraction module LDEM is configured to extract high-frequency detail features with distinction in infrared and visible light modes, and the feature extraction process further includes: S3.1, introducing Triplet Attention mechanisms, capturing the dependence of channel-width CW, height-channel HC and space HW dimensions through three independent attention branches respectively, and realizing characteristic re-weighting in different spaces and channel directions, wherein the calculation process is uniformly expressed as: Wherein, the Attention weights representing channel and width dimensions, Attention weights representing height and channel dimensions, The attention weight representing the spatial dimension, Representing the attention weight of the original input feature x through the channel and width dimension The characteristics of the weighted features are then used, Representing dimension-permuted features Through altitude and channel dimension attention weighting The characteristics of the weighted features are then used, Representing dimension-permuted features Attention weighting via spatial dimension The characteristics of the weighted features are then used, The original input characteristics are represented as such, And Respectively represent the characteristics after dimension substitution, The function is activated for Sigmoid, Splicing operation representing maximum pooling and average pooling results of channel dimensions, and in the encoding stage, convolution operation is recorded as Representing the input channel as The output channel is The convolution kernel has a size of Is used for the convolution layer of (c), Representing element-by-element multiplication; s3.2, output of three attention branches Weighted fusion is carried out, and the feature representation after detail enhancement is obtained: Wherein, the Which is indicative of the nature of the infrared details, Which represents the detailed characteristics of the visible light, Which is indicative of the characteristics of the infrared code, Representing visible light coding features; Thus, the local detail extraction module LDEM is formally represented as: Wherein, the Representing the feature processing functions of the local detail extraction module.
5. The multi-branch cooperative enhancement based infrared and visible light image fusion method according to claim 1, wherein the S4 further comprises: s4.1, the multisource feature fusion layer encodes infrared features Infrared infrastructure features Infrared detail features Visible light encoding features Visible light infrastructure features Visible detail features The six types of features are spliced in the channel dimension to form a 384-channel fusion feature matrix : Wherein, the Characteristic splicing operation for representing channel dimension; s4.2, in order to reduce the calculation overhead caused by high-dimensional splicing, channel information is effectively aggregated, and 1X 1 convolution is adopted to perform dimensional compression operation to generate 128-dimensional compression characteristics : Wherein, in the encoding stage, the convolution operation is recorded as Representing the input channel as The output channel is The convolution kernel has a size of Is a convolution layer of (2); S4.3, performing characteristic calibration, and compressing the characteristics Further optimizing to obtain calibration features 。
6. The multi-branch cooperative enhancement based infrared and visible light image fusion method according to claim 1, wherein the step S5 further comprises: s5.1, merging calibration features output by decoder in multi-source feature merging layer As input, a double-layer transducer structure is adopted to realize global associated modeling and image reconstruction of multi-modal characteristics, each layer consists of layer normalization LayerNorm, a multi-head self-attention mechanism MSA and a feedforward network FFN, and a progressive fusion process from global dependent modeling to modal characteristic weighted optimization is formed; s5.2, realizing feature mapping and image reconstruction through two-stage convolution layers, and outputting a primary infrared and visible light fusion image F, wherein the mathematical form is defined as follows: Wherein, the Representing a fusion decoder.
7. The method for merging infrared and visible light images based on multi-branch cooperative enhancement as claimed in claim 1, wherein in S6, a joint loss function is used for guiding a model to balance between structure fidelity and detail preservation, generating a merging result with both significant targets and rich texture details, and the merging result consists of a structure similarity loss L SSIM , a mean square error loss L MSE , a gradient loss L Grad and a relevance decomposition loss L decomp , and the joint loss function is used for generating a merging result with both significant targets and rich texture details The definition is as follows: wherein I is an infrared original image, V is a visible light original image, and alpha, beta, gamma and mu are weight coefficients.
8. The multi-branch cooperative enhancement-based infrared and visible light image fusion method according to claim 7, wherein in S6, the structural similarity loss L SSIM is based on the structural consistency constraint of the maximum mode feature, and is defined as: Wherein F is a primary infrared and visible light fusion image, Representing the maximum value of the corresponding pixels of the two modes, for highlighting the salient region, For structural similarity index, by minimizing 1 SSIM, avoid the structure distortion or fracture of the fusion image; The mean square error loss L MSE is used to measure the pixel level difference to ensure brightness fidelity, defined as: Wherein H and W are the height and width of the image respectively, Representing the Frobenius norm; Gradient penalty L Grad is used to emphasize edge and texture detail preservation, defined as: Wherein, the The gradient operator is represented by a gradient operator, The maximum of the two mode gradients is taken to highlight the critical edges, Is the L2 norm; The correlation decomposition loss L decomp is used to enhance the mode specificity of infrared and visible light characteristics, defined as: Wherein, the In order to be a characteristic of the infrared details, As a feature of the infrared base structure, In order to be a detailed feature of the visible light, As a feature of the basic structure of the visible light, The method is characterized in that the method is a correlation coefficient operator, is used for measuring similarity of feature distribution, sigma is a stability coefficient, and is used for avoiding zero denominator and strengthening attention of detail features.
9. The infrared and visible light image fusion system based on multi-branch cooperative enhancement is applicable to the infrared and visible light image fusion method based on multi-branch cooperative enhancement as claimed in any one of claims 1 to 8, and is characterized by comprising the following steps: The modal feature encoder is used for mapping the infrared original image and the visible light original image to a unified feature space and laying a foundation for subsequent cross-modal feature processing and fusion; the dual-branch feature enhancement unit is connected to the modal feature encoder and is used for extracting multi-level features from original image input and realizing effective separation and enhancement of basic structure and detail information through structure differentiation design, and comprises two sub-modules with complementary functions, namely a global structure extraction module and a local detail extraction module, wherein the global structure extraction module outputs global low-frequency structural features, and the local detail extraction module outputs local high-frequency detail information; And the fusion decoder is connected to the dual-branch characteristic enhancement module, adopts Restormer architecture, models a long-range dependency relationship based on a transducer mechanism, and is used for dynamically balancing the contribution of infrared and visible light characteristics, so as to realize the deep fusion of multi-mode information and generate high-fidelity image reconstruction.
10. The multi-branch co-enhancement based infrared and visible light image fusion system of claim 9, wherein the global structure extraction module is constructed based on ConvNeXt Block, combining a depth separable convolution and residual connection structure to effectively capture global low frequency structural features of an image; The local detail extraction module introduces Triplet Attention mechanism to dynamically focus high-frequency detail information from three dimensions of a channel, a space and interaction thereof so as to strengthen the mode specific characteristics and improve detail expression.

Description

Infrared and visible light image fusion method and system based on multi-branch cooperative enhancement Technical Field The invention relates to the technical field of computer vision, in particular to an infrared and visible light image fusion method and system based on multi-branch cooperative enhancement. Background The infrared and visible light image fusion (IVIF) can generate a composite image with heat radiation target information and detail textures, and has important application value in scenes such as night monitoring and safe driving. The traditional fusion method mainly relies on manually designed feature extraction and fusion rules, and is difficult to fully model the complementary and nonlinear relations among multiple modes in a complex environment. With the rapid development of deep learning technology in the field of computer vision, an image fusion method based on a neural network gradually becomes the main stream of research. The method realizes the self-adaptive learning of feature extraction and fusion strategy through an end-to-end training mechanism, remarkably improves fusion performance and generalization capability, and still has three critical limit problems: 1. The global and local feature collaborative modeling is insufficient, most methods only rely on a single branch to perform feature extraction, or a double-branch structure is adopted, but modeling responsibilities of a global structure and local details are not explicitly divided, so that global contour blurring or local detail distortion is caused in a fusion result; 2. The feature dimension of the multi-mode feature fusion is incomplete, only basic features or detail features are generally considered in the fusion process of the existing method, the mode specificity and the sharing features cannot be covered at the same time, the feature cooperation range is limited, and the deep fusion of the multi-scale and multi-type features is difficult to realize; 3. The feature decomposition and fusion stage is disjoint, and many methods consider the feature decomposition and fusion as mutually independent processes, and lack an end-to-end joint constraint mechanism, so that semantic deviation exists between the decomposition features and the fusion targets, and the consistency and stability of the final result are affected. Disclosure of Invention The infrared and visible light image fusion method and system based on multi-branch cooperative enhancement, which are provided by the invention, realize more stable and more efficient feature extraction and fusion, and at least solve one of the technical problems. In order to solve the technical problems, the invention adopts the following technical scheme: The infrared and visible light image fusion method based on multi-branch cooperative enhancement comprises the following steps: s1, performing preliminary feature mapping on an input infrared original image and a visible light original image by a modal feature encoder, and mapping the images into a unified feature space to obtain infrared coding features and visible light coding features; s2, inputting the coding features output by the S1 into a global structure extraction module, constructing based on ConvNeXt Block, combining a depth separable convolution and residual error connection structure, capturing global low-frequency structural features of an image, and obtaining infrared basic structural features and visible light basic structural features; S3, inputting the coding features output by the S1 into a local detail extraction module, and obtaining infrared detail features and visible light detail features from channel-width, height-channel and space three-dimensional dynamic focusing mode specific high-frequency detail information based on Triplet Attention mechanisms; s4, collecting six types of multi-scale features including the coding features output by the S1, the basic structure features output by the S2 and the detail features output by the S3, and sequentially performing three-stage fusion processing of channel splicing, dimension compression and feature calibration on an input multi-source feature fusion layer to obtain calibration features; S5, inputting the calibration features output by the S4 into a fusion decoder, modeling a long-range dependency relationship by the fusion decoder through a double-layer transform structure, completing feature mapping and image reconstruction through a two-stage convolution layer, and outputting a primary infrared and visible light fusion image; and S6, constructing a joint loss function in a training stage, and performing end-to-end collaborative optimization on the characteristic extraction and fusion process of the steps to obtain a final infrared and visible light fusion image. Further, in the step S1, the mode feature encoder MFE adopts a lightweight convolutional network architecture, and the feature mapping process further includes: s1.1, mapping an input single-channel infr