CN-122023986-A - Transformer-based infrared and visible light fusion target detection system

CN122023986ACN 122023986 ACN122023986 ACN 122023986ACN-122023986-A

Abstract

The invention belongs to the field of computer vision and image processing, and particularly relates to a target detection system based on infrared and visible light fusion of a transducer. The method is characterized in that a dual-path complementary attention fusion module contained in a coding unit is fused, the layer is provided with a first cross attention sub-layer and a second cross attention sub-layer which are parallel, the visible light characteristics are respectively inquired by infrared characteristics and the infrared characteristics are inquired by the visible light characteristics, the bidirectional cross-modal context characteristics are generated, and then the bidirectional cross-modal context characteristics are fused through a characteristic aggregation sub-layer. In addition, the system performs feature refinement by adding modality specific biases before cross-attention and trains with a loss function that contains a contrastive complementary regularization term. According to the invention, through an explicit bidirectional cross attention mechanism and specific training constraints, the network is guided to adaptively discover and fuse complementary information between infrared and visible light modes, and finally high-robustness target detection is realized.

Inventors

TIAN WEIXIN
CHEN SHIZUO

Assignees

三峡大学

Dates

Publication Date: 20260512
Application Date: 20260130

Claims (10)

1. The infrared and visible light fusion target detection system based on the Transformer is characterized by comprising: The feature extraction unit is used for respectively extracting an infrared feature map and a visible light feature map from an input infrared image and a visible light image to obtain two mode feature maps; The fusion coding unit is used for carrying out depth fusion on the infrared characteristic diagram and the visible light characteristic diagram, and comprises at least one dual-path complementary attention fusion module, wherein the dual-path complementary attention fusion module comprises an intra-mode attention refining layer, a first cross attention sub-layer, a second cross attention sub-layer and a characteristic aggregation sub-layer; the intra-mode attention refining layer is configured to perform self-attention operation on an input mode feature map, apply a bias vector corresponding to the mode type to at least one of a query vector, a key vector and a value vector, and thus output mode refining features serving as sources of the query features, the key features and the value features in the corresponding cross-attention sub-layer; The first cross-attention sub-layer is configured to receive an infrared feature map as a source of query features, receive a visible light feature map as a source of key features and value features, perform a first cross-attention operation based on the query features, key features and value features, and output a first cross-modal context feature; the second cross-attention sub-layer is configured to receive a visible light feature map as a source of query features, receive an infrared feature map as a source of key features and value features, perform a second cross-attention operation based on the query features, key features, and value features, and output a second cross-modal context feature; The feature aggregation sublayer is configured to receive a first cross-modal context feature and a second cross-modal context feature and aggregate the two into a fused feature map; and the detection decoding unit is used for processing the fusion characteristic diagram and outputting a target detection result.
2. The infrared and visible light fusion based object detection system according to claim 1, wherein the feature extraction unit comprises a visual transducer backbone network sharing parameters, and the processing procedure comprises: Dividing an infrared image and a visible light image into an image block sequence respectively; The infrared image block sequence and the visible light image block sequence are respectively mapped into a first characteristic sequence and a second characteristic sequence through the same linear projection layer; adding a learnable infrared mode embedded vector to the first feature sequence and adding a learnable visible mode embedded vector to the second feature sequence; and inputting the first characteristic sequence and the second characteristic sequence added with the modal embedded vector into a subsequent shared transducer coding layer for processing so as to extract the infrared characteristic map and the visible light characteristic map.
3. The Transformer based infrared and visible light fusion object detection system of claim 2, wherein a leachable position embedding vector is added to the first and second feature sequences, respectively, after the linear projection layer projects and before the modal embedding vector is added.
4. The Transformer based infrared and visible light fusion target detection system of claim 1, wherein the intra-modality attention refinement layer applies the bias vector to all of the query vector, the key vector and the value vector, wherein the bias vector applied to the query vector, the bias vector applied to the key vector and the bias vector applied to the value vector are three different learnable vectors.
5. The Transformer based infrared and visible light fusion target detection system of claim 1, wherein the feature aggregation sub-layer is specifically configured to perform bi-directional guided dynamic aggregation operations, comprising: Generating a first spatial weight map for modulating visible light features based on the first cross-modal context features; modulating the visible light characteristic map by using the first space weight map to obtain a first enhancement characteristic; Generating a second spatial weight map for modulating infrared features based on the second cross-modal context features; Modulating the infrared feature map by using the second space weight map to obtain a second enhancement feature; and combining the first cross-modal context feature, the second cross-modal context feature, the first enhancement feature and the second enhancement feature to generate the fusion feature map.
6. The infrared and visible light fusion-based target detection system according to claim 5, wherein the combining method specifically comprises the steps of splicing the first cross-modal context feature, the second cross-modal context feature, the first enhancement feature and the second enhancement feature in a channel dimension, and fusing the first cross-modal context feature, the second cross-modal context feature, the first enhancement feature and the second enhancement feature in the channel dimension by a convolution layer to generate the fusion feature map.
7. The Transformer-based infrared and visible light fusion target detection system of claim 1, wherein the fusion coding unit comprises a plurality of sequentially connected dual-path complementary attention fusion modules; for the kth dual-path complementary attention fusion module, k is an integer greater than 1, and the received infrared characteristic diagram and visible light characteristic diagram are obtained by analyzing the fusion characteristic diagram output by the kth-1 dual-path complementary attention fusion module through a characteristic decoupling layer; The feature decoupling layer is configured to perform a learnable linear transformation on the input fused feature map to separate out two sets of feature components that respectively characterize the infrared modality and the visible light modality.
8. The Transformer-based infrared and visible light fusion target detection system of claim 1, wherein the detection decoding unit is an iterative refined decoder, the operations comprising: Initializing a set of detection query vectors; Performing multiple decoding iterations, in each iteration, interacting the detection query vector of the current round with the fusion feature map, outputting a target detection prediction result of the current round, and generating a residual vector for updating the detection query vector; And adding the residual vector with the detection query vector of the current round to obtain the detection query vector used for the next round of iteration.
9. The Transformer-based infrared and visible light fusion target detection system of claim 1, wherein the feature extraction unit, fusion encoding unit and detection decoding unit are trained with a loss function, the loss function comprising a contrastive complementary regularization term; the contrastive complementary regularization term is calculated according to the first cross-modal context feature and the second cross-modal context feature, and the construction mode comprises the following steps: Selecting two feature vectors corresponding to the space positions from the first cross-modal context feature and the second cross-modal context feature to form a positive sample pair; Selecting two eigenvectors with non-corresponding spatial positions to form a negative sample pair; The contrast complementary regularization term is used for reducing the distance between the two eigenvectors in the positive sample pair and increasing the distance between the two eigenvectors in the negative sample pair in the training process.
10. The method of a Transformer-based infrared and visible light fusion detection system according to any of claims 1-9, wherein the method comprises the steps of: s1, preparing and preprocessing a data set; s2, constructing a multi-mode target detection network, wherein the network comprises a feature extraction unit, a fusion encoding unit and a detection decoding unit; s3, training and optimizing the multi-mode target detection network; s4, verifying the accuracy of the multi-mode target detection network; s5, using the multi-mode target detection network for target detection in an actual scene.

Description

Transformer-based infrared and visible light fusion target detection system Technical Field The invention belongs to the field of computer vision and image processing, and particularly relates to a target detection system based on infrared and visible light fusion of a transducer. Background The fusion detection of the infrared and visible light images aims to comprehensively utilize the respective characteristics of the infrared image, such as thermal radiation sensitivity, and the visible light image, which are rich in texture details, so as to improve the target perception performance in a complex environment. The technology has application value in security monitoring, automatic driving, military reconnaissance, night operation and other scenes. Currently, multi-modal target detection methods based on deep learning have been widely studied. According to the difference between the fusion strategy and the core architecture, the prior art schemes can be mainly divided into the following types: The first type is a feature level fusion method based on convolutional neural networks. For example, chinese patent document publication No. CN116452937a discloses a multi-modal target detection method based on dynamic convolution and attention mechanism. The method is characterized in that after bimodal features are respectively extracted, weighted fusion is carried out by utilizing a channel and a spatial attention mechanism, and feature refinement is carried out by adopting a residual network. Such methods rely mainly on the local perceptual characteristics of convolution operations, and the fusion process is essentially a spatial or channel weighted combination of features, which makes it difficult to explicitly model the long Cheng Yuyi dependence and structural complementation relationship between cross-modes. The second category is fusion methods based on the Transformer architecture, which attempt to capture global context using self-attention mechanisms. For example, chinese patent document publication No. CN117542020a discloses a multi-modal object detection method based on infrared and visual images. According to the method, on the basis of convolutional neural network feature extraction, a local window-based transducer module is introduced to perform feature interaction and aggregation. However, the attention mechanisms employed by this approach are still focused on contextual modeling of intra-modal or local regions, without building explicit cross-modal interaction pathways. The dual-mode features are simply spliced and then sent to the attention module, and the failure to mechanically guide the network to distinguish and utilize asymmetric complementary information between the two modes can lead to information confusion and feature degradation. The third class is methods that introduce contrast learning to optimize feature representations. For example, chinese patent document publication No. CN118154844a discloses a multi-modal target detection method based on contrast learning. The method utilizes a contrast loss function to pull in-class features and push out inter-class features by constructing positive and negative sample pairs. However, the contrast learning of the method is mainly applied to the overall feature representation level, and is not specially limited to the key link of cross-modal complementarity. Therefore, it is difficult to directly guide the converged network to focus on mining and utilizing complementary information having discriminant between modalities. In combination, the following main problems remain in the prior art: the fusion mechanism is insufficient for guiding complementarity, the existing fusion method is mostly limited by weighted combination or simple interaction of features, and lacks an explicit and structured design for guiding a network to actively establish a deep complementary relation between infrared and visible light features. The attention mechanism is applied to heterogeneous fusion, namely, when a universal self-attention mechanism processes bimodal characteristics, the common self-attention mechanism is difficult to adaptively distinguish modal specific information from cross-modal complementary information, and key information is easy to dilute or interfere. The training goal lacks special constraint on the fusion process, namely the loss function of the existing method usually only takes final detection precision as an optimization goal, and the existing method lacks a regularization mechanism which directly constrains the fusion process and focuses the fusion process on mining effective complementary information, so that the fusion result may not reach the optimal value. Therefore, a new technical scheme is needed, which can realize the depth complementarity fusion between the infrared and visible light modes explicitly and efficiently from the cooperative optimization of the model architecture and the training strategy level, so as to improve the rob