CN-122024167-A - Infrared target detection method and device based on double-channel feature fusion

CN122024167ACN 122024167 ACN122024167 ACN 122024167ACN-122024167-A

Abstract

The infrared target detection method comprises the steps of obtaining an infrared image sequence, processing the infrared image sequence to generate a motion feature image sequence, enabling the motion feature image to have the same spatial resolution and pixel alignment relation with corresponding infrared images, conducting multi-scale feature extraction on the infrared images by utilizing a pre-built appearance feature extraction model to obtain an appearance feature set, conducting multi-scale feature extraction on the motion feature image by utilizing the pre-built motion feature extraction model to obtain a motion feature set, conducting bidirectional cross-modal fusion on appearance features in the appearance feature set and motion features in the motion feature set to obtain enhanced appearance features and enhanced motion features, and conducting target decoding on fusion features obtained by the enhanced appearance features and the enhanced motion features through a detection decoder based on a Transformer to obtain a target detection result.

Inventors

GAO JIN
LIU NIAN
LI WENJUAN
WANG BO
HU WEIMING

Assignees

中国科学院自动化研究所

Dates

Publication Date: 20260512
Application Date: 20260211

Claims (10)

1. An infrared target detection method, comprising: acquiring an infrared image sequence; processing the infrared image sequence to generate a motion feature image sequence, wherein the motion feature image in the motion feature image sequence and the corresponding infrared image in the infrared image sequence have the same spatial resolution and pixel alignment relation; performing multi-scale feature extraction on the infrared image by using a pre-constructed appearance feature extraction model to obtain an appearance feature set, and performing multi-scale feature extraction on the motion feature map by using a pre-constructed motion feature extraction model to obtain a motion feature set; Performing bidirectional cross-modal fusion on the appearance features in the appearance feature set and the motion features in the motion feature set to obtain enhanced appearance features and enhanced motion features; and performing target decoding on the fusion feature pyramid obtained based on the enhanced appearance feature and the enhanced motion feature by a detection decoder based on a transducer to obtain a target detection result.
2. The method of claim 1, wherein processing the sequence of infrared images to generate a sequence of motion features comprises processing the sequence of infrared images with a recursive spatiotemporal differential filter model to generate a sequence of motion features.
3. The method for detecting an infrared target according to claim 2, wherein, Processing the infrared image sequence to generate a motion feature map sequence, including processing the infrared image sequence through a recursive spatiotemporal differential filter model, including: performing self-adaptive threshold processing on the infrared image to obtain an input response state; Processing the input response state through Gaussian kernel convolution and a lateral suppression mechanism to obtain a filtering state; calculating the center-surrounding contrast of the filtering state to obtain a contrast diagram and a contrast state; updating the time difference between the contrast map of the current frame and the historical state of the previous frame through exponential smoothing to obtain a time sequence difference state; fusing the spatial motion component derived from the contrast state with the temporal motion component represented by the time sequence difference state, and generating a motion characteristic diagram aligned with the pixel level of the infrared image through thresholding and nonlinear enhancement processing; And circularly executing the steps until all infrared images in the infrared image sequence are traversed, and obtaining the motion characteristic image sequence.
4. The method for detecting an infrared target according to claim 1, wherein the appearance feature extraction model and the motion feature extraction model adopt the same network architecture but are independent in parameters, the network architecture is a residual neural network, the extracted multi-scale features comprise three levels of P3, P4 and P5, the resolution is sequentially reduced, and the semantic level is sequentially improved.
5. The method of claim 1, wherein performing bi-directional cross-modal fusion of the appearance features in the set of appearance features and the motion features in the set of motion features to obtain enhanced appearance features and enhanced motion features, comprises: Under the same scale, taking appearance features as queries and motion features as keys and values, generating first attention output of appearance to motion, and taking the motion features as the queries and the appearance features as the keys and values, generating second attention output of the motion to the appearance; Fusing with the first attention output and the appearance feature to generate an enhanced appearance feature; and fusing the second attention output and the motion feature to generate an enhanced motion feature.
6. The method of infrared target detection according to claim 5, wherein generating a first attention output of appearance to motion using appearance features as queries, motion features as keys and values, and generating a second attention output of motion to appearance using motion features as queries, appearance features as keys and values, at the same scale, comprises: respectively carrying out self-adaptive spatial pooling on the appearance characteristic and the motion characteristic, and reducing the respective spatial resolution to a preset fixed size; Respectively adding position codes to the pooled appearance features and the pooled motion features, and flattening the appearance features and the pooled motion features into a one-dimensional feature sequence; Taking the flattened appearance characteristic sequence as a query, taking the flattened motion characteristic sequence as a key and a value, and calculating to obtain a first attention output through a cross attention mechanism; And taking the flattened motion characteristic sequence as a query, taking the flattened appearance characteristic sequence as a key and a value, and calculating through a cross attention mechanism to obtain a second attention output.
7. The method of claim 6, wherein the calculation of the first attention and the second attention uses a multi-headed attention mechanism and wherein a layer normalization and feed forward network is applied after the attention calculation.
8. The method according to claim 1, wherein the target decoding of the fused feature pyramid obtained based on the enhanced appearance feature and the enhanced motion feature by the transducer-based detection decoder to obtain a target detection result comprises: Splicing the enhanced appearance features and the enhanced motion features into a fusion feature pyramid; Sequentially passing through a top-down feature pyramid network and a bottom-up path aggregation network for the fusion feature pyramid to generate an enhanced fusion feature pyramid; Inputting the enhanced fusion feature pyramid to a detection decoder based on a transducer, wherein the detection decoder performs multi-round cross attention interaction with the enhanced fusion feature pyramid by utilizing a group of learnable object query vectors, and updates the query representation layer by layer to obtain a final object query representation; and mapping the final object query representation through a detection head to obtain the target detection result, wherein the target detection result comprises the center coordinates, width, height and classification confidence of the boundary frame of each detection target.
9. The method of claim 8, wherein the detection head comprises two parallel fully connected sub-networks for bounding box regression and object classification, respectively, and the step of mapping the final object query representation by the detection head to obtain the object detection result comprises: inputting the final object query representation to a boundary box regression full-connection sub-network and a target classification full-connection sub-network which are arranged in parallel respectively; Processing the final object query representation by utilizing the bounding box regression full-connection sub-network, outputting four-dimensional bounding box parameters, and activating by a Sigmoid function to obtain a normalized bounding box center abscissa, center ordinate, width and height; Processing the final object query representation in a single-class target detection mode by utilizing the target classification full-connection sub-network, outputting a single-dimensional score, and activating by a Sigmoid function to obtain the classification confidence; Inversely normalizing the central abscissa, the central ordinate, the width and the height of the normalized bounding box to obtain the central abscissa, the central ordinate, the width and the height of the bounding box taking pixels as units; And combining the center abscissa, the center ordinate, the width and the height of the bounding box taking the pixel as a unit with the classification confidence to form the target detection result.
10. An infrared target detection device, comprising: A memory storing a program or instructions, A processor, which when executed by the processor, causes the processor to perform the infrared target detection method according to any one of claims 1-9.

Description

Infrared target detection method and device based on double-channel feature fusion Technical Field The disclosure relates to the field of computer vision, in particular to an infrared target detection method and device based on double-channel feature fusion. Background Targets in far-distance infrared imaging generally have the characteristics of small size, low contrast, weak signal-to-noise ratio and the like, lack of clear shapes, textures or obvious brightness characteristics, are easily shielded by complex background clutter, and bring great challenges to the existing infrared small target detection method. The existing infrared small target detection method is mainly divided into a single-frame method and a multi-frame method. The single frame method only uses the spatial information of a single image, and is difficult to effectively distinguish targets and clutter in a complex background. Although the multi-frame method can utilize timing information to improve detection performance, the following problems generally exist: 1) Implicit in a motion modeling mode, the existing multi-frame method usually learns space-time characteristics implicitly through a deep neural network, and the expression of motion information is not explicit and clear enough, so that the discrimination capability of the motion characteristics is limited. 2) Additional supervision information is needed-part of the method adds significant labeling costs by introducing semantic motion descriptions (e.g., target position, velocity, direction, etc.) as additional supervision, while improving motion characterization. 3) Feature alignment problem multi-frame methods need to deal with feature alignment problems between different frames, often requiring the design of special alignment modules or optical flow estimation networks, increasing model complexity. 4) The fusion of the appearance and the motion characteristics is insufficient, and the existing method mostly adopts simple splicing or adding to the fusion of the appearance characteristics and the motion characteristics, and lacks a deep characteristic interaction mechanism. Disclosure of Invention It is an object of the present disclosure to provide an infrared target detection method that can eliminate the need for additional labeling. It is an object of the present disclosure to provide an infrared target detection method that can eliminate the need for an alignment module. It is an object of the present disclosure to provide an infrared target detection method capable of improving discrimination capability of features. According to the first aspect of the disclosure, an infrared target detection method comprises the steps of obtaining an infrared image sequence, processing the infrared image sequence to generate a motion feature image sequence, enabling the motion feature image in the motion feature image sequence to have the same spatial resolution and pixel alignment relation with corresponding infrared images in the infrared image sequence, carrying out multi-scale feature extraction on the infrared images by utilizing a pre-built appearance feature extraction model to obtain an appearance feature set, carrying out multi-scale feature extraction on the motion feature image by utilizing the pre-built motion feature extraction model to obtain a motion feature set, carrying out bidirectional cross-modal fusion on appearance features in the appearance feature set and the motion features in the motion feature set to obtain enhanced appearance features and enhanced motion features, and carrying out target decoding on a fusion feature pyramid obtained on the basis of the enhanced appearance features and the enhanced motion features by utilizing a detection decoder based on a Transformer to obtain a target detection result. Optionally processing the sequence of infrared images to generate a sequence of motion feature maps may include processing the sequence of infrared images through a recursive spatiotemporal differential filter model to generate the sequence of motion feature maps. The method comprises the steps of processing an infrared image sequence to generate a motion characteristic image sequence, wherein the motion characteristic image sequence is generated by processing the infrared image sequence through a recursive space-time differential filter model, the motion characteristic image sequence comprises the steps of carrying out self-adaptive threshold processing on the infrared image to obtain an input response state, processing the input response state through Gaussian kernel convolution and a lateral suppression mechanism to obtain a filtering state, calculating center-surrounding contrast of the filtering state to obtain a contrast image and a contrast state, updating time difference between the contrast image of a current frame and a historical state of a previous frame through exponential smoothing to obtain a time sequence differential state, fusing spatial motion