CN-122023977-A - Method and system for multi-mode fusion of visible light and infrared image based on transition state

CN122023977ACN 122023977 ACN122023977 ACN 122023977ACN-122023977-A

Abstract

The invention relates to a method and a system for multi-mode fusion of visible light and infrared images based on a transitional state, which are used for constructing a feature extraction network, inputting original visible light images and infrared images after training, outputting corresponding low-frequency features and high-frequency features by an encoder, embedding an improved multi-mode fusion model between the encoder and a decoder of the feature extraction network to obtain an improved feature extraction network, acquiring the transitional state features by the low-frequency features and the high-frequency features output by the encoder of the trained feature extraction network, fusing by a decoder, inputting paired visible light images and infrared images to the trained improved feature extraction network, and outputting a fused image. The method and the device reserve the detail characteristics of the special visible light image, exert the robustness of the infrared image, enhance the distinguishing capability of the model on different frequency characteristics, improve the fusion quality and the downstream task performance, improve the transition state generation quality, keep the end-to-end trainability and improve the characteristic decoupling capability.

Inventors

YE LEI
Wang Dihong
LIANG DEYUAN
XU CHENQI

Assignees

浙江工业大学

Dates

Publication Date: 20260512
Application Date: 20251219

Claims (10)

1. The method is characterized in that a feature extraction network is constructed, the feature extraction network comprises an encoder and a decoder, the method is used for inputting an original visible light image and an original infrared image after training, and the encoder outputs corresponding low-frequency features and corresponding high-frequency features; An improved multi-mode fusion model is embedded between an encoder and a decoder of the feature extraction network, so that an improved feature extraction network is obtained, and low-frequency features and high-frequency features output by the encoder of the feature extraction network after training are used for obtaining transition state features and fusing the transition state features by a decoder; and inputting the paired visible light images and infrared images into the trained improved feature extraction network, and outputting a fusion image.
2. The method for multi-modal fusion of visible light and infrared images based on transitional states of claim 1, wherein the feature extraction model comprises a shared feature encoder, a dual-branch independent encoder and a decoder which are sequentially arranged; the shared feature encoder extracts common shallow features of an original visible light image and an infrared image; The double-branch independent encoder comprises a basic feature encoder and a detail feature encoder which respectively correspond to the visible light image and the infrared image, wherein the 2 basic feature encoders are respectively used for extracting corresponding low-frequency features based on common shallow features of the visible light image and the infrared image, and the 2 detail feature encoders are respectively used for extracting corresponding high-frequency features based on common shallow features of the visible light image and the infrared image; And splicing the low-frequency characteristic and the high-frequency characteristic of the corresponding visible light image, splicing the low-frequency characteristic and the high-frequency characteristic of the corresponding infrared image, inputting the spliced low-frequency characteristic and the high-frequency characteristic of the corresponding infrared image into a decoder, and outputting the reconstructed visible light image and infrared image.
3. The method for multi-modal fusion of visible and infrared images based on transitional states of claim 2, wherein the feature extraction model is trained based on the original visible and infrared images and the reconstructed visible and infrared images and calculating corresponding losses, respectively.
4. The method for multi-modal fusion of visible light and infrared images based on the transitional state is characterized in that the improved multi-modal fusion model comprises a convolution-attention fusion module, a linear transformation-grouping convolution module, a transitional state feature fusion module and a gating forward propagation module which are sequentially arranged, wherein the convolution-attention fusion module is connected with the output end of a double-branch independent encoder of a trained feature extraction network, low-frequency features of corresponding visible light images and infrared images are spliced, high-frequency features of corresponding visible light images and infrared images are spliced, and the high-frequency features of corresponding visible light images and infrared images are input into the convolution-attention fusion module; the output end of the gating forward propagation module is connected with the input end of the decoder of the trained feature extraction network.
5. The method for multi-modal fusion of visible and infrared images based on transitional states of claim 4, wherein the convolution-attention fusion module comprises a local branching module and a global branching module; The local branch module comprises a convolution layer and a deformable convolution layer which are sequentially arranged; The global branch module comprises a first convolution layer, a self-attention unit and a second convolution layer which are sequentially arranged, deformable convolution layers are respectively arranged between Q, K, V matrixes of the self-attention unit and the first convolution layers, and the input ends of the first convolution layers are connected with the input ends of the second convolution layers in a jumping manner; The output of the deformable convolution layer is fused with the output of the second convolution layer and then output.
6. The method for multi-modal fusion of visible light and infrared images based on transitional states of claim 4, wherein the linear transformation-grouping convolution module comprises a convolution layer and a grouping convolution layer, and the output of the convolution layer is split by the grouping convolution layer to obtain K transitional states, and 2 transitional state image feature sets corresponding to the visible light and infrared images respectively are obtained.
7. The method for multi-modal fusion of visible light and infrared images based on a transitional state according to claim 6, wherein the transitional state feature fusion module comprises a linear remodelling segmentation unit, a multi-head attention unit, a linear layer and a residual connecting layer which are sequentially arranged; The linear remodelling and segmenting unit comprises a convolution layer, a remodelling layer and a segmentation output head which respectively correspond to the transition state image feature sets of the visible light and the infrared image, the transition state image feature sets of the infrared image and the visible light of the two branches are respectively operated and then output to a Q matrix of the multi-head attention unit, and the transition state image feature sets of the visible light image and the infrared image of the two branches are respectively operated and then output to a K, V matrix of the multi-head attention unit; the output of the multi-head attention unit obtains K state outputs corresponding to the upper branch visible light and the lower branch infrared image through the linear layer and the residual error connecting layer; and (5) carrying out aggregation and splicing along the state dimension to obtain the fusion coding characteristic of the transition state image.
8. The method for multi-modal fusion of visible light and infrared images based on a transitional state according to claim 7, wherein the gating forward propagation module comprises a linear expansion layer, a depth separable convolution layer and an activation function layer are arranged behind the linear expansion layer, the output of the activation function layer is multiplied by the linear expansion layer and then is input into a linear compression layer, and the output of the linear compression layer is fused with an original characteristic diagram through residual connection and then the enhanced transitional state characteristic is output.
9. The method for multi-modal fusion of visible light and infrared images based on transitional states of claim 8, wherein after each enhanced transitional state is fused with a primary feature by a convolution layer, the primary feature is spliced with an output sampled by a decoder module at the upper layer according to channel dimensions, and the spliced result is mapped to a pixel space through the convolution layer, an activation function and a linear transformation layer to generate a final fused image.
10. The visible light and infrared image multi-mode fusion system based on the transition state is characterized by comprising: at least one processor, and A memory communicatively coupled to at least one of the processors; The memory stores instructions executable by the processor for implementing the transition state based visible light and infrared image multi-modality fusion method of one of claims 1-9.

Description

Method and system for multi-mode fusion of visible light and infrared image based on transition state Technical Field The invention relates to the technical field of image or video recognition or understanding, in particular to a method and a system for multi-mode fusion of visible light and infrared images based on a transition state. Background With the rapid development of intelligent driving technology, the existing application scene puts higher demands on the environment perception capability of traffic scenes. Environmental awareness is a key premise for realizing core functions such as automatic driving, traffic monitoring, pedestrian protection and the like. According to statistics, 80% of traffic accidents worldwide occur in low-light environments, such as driving environments of night, tunnels, dusk and the like, wherein pedestrian collision accounts for more than 65% of fatal accidents, and particularly in rainy and foggy weather, the failure rate of the conventional visible light camera in vehicle identification is up to 40%, so that a single visible sensor is difficult to stably acquire comprehensive environmental information, and the accuracy and safety of system decision are seriously affected. The visible light image sensor can capture abundant detailed information such as colors and textures, is suitable for identifying characteristics such as traffic signals, lane lines and vehicle brands, but can drastically reduce imaging quality under the scenes such as night, low illumination of tunnels, haze, heavy rain and the like, and is easy to lose key information such as pedestrian contours, barrier edges and the like. Compared with the method, the infrared image sensor, such as an infrared thermal imager, can clearly distinguish living objects (such as pedestrians and animals) from non-living objects (such as vehicles and buildings) in dark and bad weather by detecting thermal radiation imaging of an object, is a core path breaking through environmental perception bottleneck, but the infrared image lacks color and texture details, so that information such as traffic signs and license plates is difficult to accurately identify. Therefore, the multi-mode fusion technology of the visible light and the infrared image becomes an important means for solving the environmental perception bottleneck in the traffic scene. By fusing complementary information of the two modes, detailed characteristics of the visible light image can be reserved, robustness of the infrared image can be exerted, and therefore more comprehensive environment perception is provided for applications such as automatic driving path planning, abnormal event detection of traffic monitoring, pedestrian collision early warning and the like. However, the existing multi-mode fusion method still has the following limitations in traffic scenes: (1) The feature alignment is insufficient, namely the mode difference between visible light and infrared images is large, such as an imaging principle, pixel distribution and the like, and the accurate alignment of bottom features is difficult to realize by the traditional fusion method such as weighted average, feature stitching and the like, so that the fusion result is blurred and artifact is caused, and the edge integrity of targets such as vehicles, pedestrians and the like is influenced; (2) In a traffic scene, low-frequency structural features such as road topology, overall shape of a vehicle and the like and high-frequency detail features such as turn signal flickering, pedestrians, gestures and the like need to be accurately reserved at the same time, but the existing depth model such as a transducer adopts a single encoder to process double modes, so that the frequency domain feature coupling distortion is easy to be caused; (3) The key target enhancement is insufficient, namely pedestrians and vehicles are targets which need to be focused on in a traffic scene, but the existing fusion method often carries out equalization treatment on global features, and cannot pertinently enhance the remarkable features of a target area, including the heat radiation features of the pedestrians, the contour features of the vehicles and the like, so that the accuracy of subsequent downstream tasks such as target detection is reduced; (4) The state continuity is lost, namely the dynamic processes such as high-speed running of a vehicle, abrupt crossing of pedestrians and the like require that a perception system captures continuous evolution of a target state, but the prior art adopts a discrete fusion strategy, such as double-channel attention and the like, and cannot model a continuous transition rule from infrared characteristic leading to visible light characteristic leading. In order to solve the above-mentioned problems, a fusion method capable of precisely aligning the visible light and infrared features, dynamically capturing the mode evolution rule, and pertinently enhancing the traffic key target