CN-121767888-B - Multi-mode progressive fusion aerial small target image detection method

CN121767888BCN 121767888 BCN121767888 BCN 121767888BCN-121767888-B

Abstract

The invention relates to the technical field of computer vision and discloses a multi-mode progressive fusion aerial small target image detection method which comprises the steps of collecting aerial small target images, inputting the aerial small target images into an improved cross-mode target detection model, outputting the position, the category and the confidence of a small target predicted in the aerial small target images, wherein the cross-mode target detection model comprises a backbone network, a neck network and a detection head, the backbone network comprises a visible light branch, an infrared branch, four dynamic enhancement modules and finally a multi-scale small target cross screening mechanism module fuses the visible light branch and the infrared branch, and the neck network replaces a C2k3 module in a YOLO11 network with the multi-scale small target cross screening mechanism module. According to the method, under the condition that bimodal data are simultaneously input, a cross-modal target detection model is constructed, double-flow features are extracted and fused, and the same-dimension feature enhancement and cross-dimension feature fusion at different stages are realized.

Inventors

CHENG QING
JIANG YAN
WANG DECHAO
GAO YUAN
QIU YUN
GAO ZENG
HU YAN

Assignees

中国民用航空飞行学院

Dates

Publication Date: 20260512
Application Date: 20260302

Claims (2)

1. The multi-mode progressive fusion aerial small target image detection method is characterized by comprising the following steps of: inputting the aerial small target image into a cross-modal target detection model based on the YOLO11 network improvement, and outputting the position, the category and the confidence of the small target predicted in the aerial small target image; The cross-modal target detection model comprises a backbone network, a neck network and a detection head, wherein the backbone network comprises a visible light branch, an infrared branch, four dynamic enhancement modules and a multi-scale small target cross-screening mechanism module, wherein the visible light branch and the infrared branch are fused together finally; the processing flow of each dynamic enhancement module is as follows: The features F rgb of the visible light branch and the features F ir of the infrared branch input to the dynamic enhancement module are respectively: ; ; Wherein, the The method is characterized in that the method is a real number set, B is a batch, C is the number of channels, H is the height, and W is the width; Generating a dynamic convolution kernel weight W k : ; Wherein the dimension of W k corresponds to the size of the dynamic convolution kernel, softmax is Softmax activation function, conv 2 is second layer convolution processing; Is a ReLU activation function, conv 1 is layer 1 convolution processing, avgPool is global pooling operation; The method comprises the steps of using a multi-head cross attention mechanism to take the characteristic F rgb of a visible light branch as a query Q, taking the characteristic F ir of an infrared branch as a key K and a value V, calculating the similarity of Q and K, weighting and aggregating V, and rearranging the result back to a space dimension [ B, C, H, W ] after calculation is completed, wherein the method specifically comprises the following steps: ; wherein F att is interaction characteristic, and Attention is multi-head cross Attention mechanism operation; The interactive feature F att is enhanced by utilizing a dynamic convolution kernel weight W k , specifically: ; Wherein F dyn is a shared dynamic feature; Is a dynamic convolution operation; Calculating weights: ; Wherein G rgb is the weight of the visible light branch, G ir is the weight of the infrared branch; the method comprises the steps of activating a function for Sigmoid, concat for splicing operation and Conv for convolution operation; The dynamic characteristics are shared by using gating regulation, and the dynamic characteristics are added back to the original characteristics as residual errors, specifically: ; ; Wherein, the The visible light branch characteristics are output by the dynamic enhancement module; The characteristics of the infrared branches output by the dynamic enhancement module are obtained; the processing flow of each multi-scale small target cross screening mechanism module is as follows: Dividing the image feature F input into a multi-scale small target cross screening mechanism module into a main stream feature F pri and a complementary feature F comp , and calculating a feature matrix exclusive to the main stream feature F pri and the complementary feature F comp : ; ; Wherein M (F pri ) is the aggregate feature of the main stream feature, n=3, 5,9 represents the vertical bar convolution with the convolution kernel size of n× 1;W n×1 and with the convolution kernel size of n×1, LN is layer normalization, and Q pri 、K pri 、V pri represents the query matrix, key matrix and value matrix exclusive to the main stream feature F pri respectively; a linear mapping for a1 x 1 convolution; ; ; Wherein M (F comp ) is an aggregate feature of the complementary feature, n=3, 5,9 represents a convolution kernel size of 1×n, W 1×n represents a horizontal bar convolution with a convolution kernel size of 1×n, and Q comp 、K comp 、V comp represents a query matrix, a key matrix and a value matrix exclusive to the complementary feature F comp respectively; Calculate mainstream attention profile a 1 and complementary attention profile a 2 : ; ; Wherein Softmax is a Softmax activation function, d n is a scaling factor; transpose the matrix; The mainstream feature subset Z 1 and the complementary feature subset Z 2 are calculated: ; ; Wherein, the Representation through mapping layer Reducing the channel dimension from C to C/2; ; Wherein O final represents tensors output by the multi-scale small-target cross screening mechanism module.
2. The method for detecting the multi-modal progressive fusion aerial small target image according to claim 1, wherein, Four dynamic enhancement modules in the backbone network are DRFA-1, DRFA-2, DRFA-3 and DRFA-4 respectively; the multi-scale small target cross screening mechanism module is represented by MS_CSM; The visible light branch comprises IN-1, multiin-1, conv-2, C3k2-1, conv-3, C3k2-2, conv-4, C3k2-3, conv-5, C3k2-4, SPPF-1, C2PSA-1 which are connected IN sequence; the infrared branches comprise sequentially connected IN-2、Multiin-2、Conv-6、Conv-7、C3k2-5、Conv-8、C3k2-6、MS_CSM-1、Conv-9、C3k2-7、MS_CSM-2、Conv-10、C3k2-8、SPPF-2、C2PSA-2; The output end of C3k2-1 and the output end of C3k2-5 are respectively connected with the input end of DRFA-1, the output end of DRFA-1 is respectively connected with the input end of Conv-3 and the input end of Conv-8, the output end of C3k2-2 and the output end of C3k2-6 are respectively connected with the input end of DRFA-2, the output end of DRFA-2 is respectively connected with the input end of Conv-4 and the input end of MS_CSM-1, the output end of C3k2-2 is also connected with the input end of MS_CSM-1, the output end of C3k2-3 and the output end of C3k2-7 are respectively connected with the input end of DRFA-3, the output end of DRFA-3 is respectively connected with the input end of Conv-5 and the input end of MS_CSM-2, the output end of C3k2-6 is also connected with the input end of MS_2, the output end of C2-1 is respectively connected with the input end of DRFA-4, the input end of CSM-3, and the input end of PSM-2 are respectively connected with the input end of PSM-62-3, and the input end of PSM-3.

Description

Multi-mode progressive fusion aerial small target image detection method Technical Field The invention relates to the technical field of computer vision, in particular to a multi-mode progressive fusion aerial small target image detection method. Background The target detection technology is used as the leading direction in the field of computer vision, and has great potential in tracking and positioning, emergency rescue and other aspects. Along with the development of low-altitude economy, the fusion application of the remote sensing aerial image in the target detection algorithm is rapidly developed. However, due to the differences in the manner in which the image sensor and data source information are formed and derived, most detection technology networks are designed and built based on single modality image (e.g., visible light images, infrared thermal imaging, etc.) data. Although good results are achieved in some scenes, the information extraction capability under scene migration is limited, and due to the lack of multi-mode feature complementation, the target features are seriously lost when a downstream detection task is performed. The core definition of the small target in aerial small target detection is a target with extremely low pixel ratio and blurred visual characteristics in aerial images or videos (industry general threshold: pixel side length <32×32 or proportion of the whole image < 0.1%), and the target is high in aerial visual angle and long in shooting distance, and only presents a small number of pixel points or small color blocks in a single frame without obvious contour details. At present, remote sensing aerial photography of small target images in a target detection background faces a plurality of challenges (1) aerial photography visual angles are unique, and the environment is complex, so that most of the target is shielded in multiple times. The static shielding mainly originates from structures such as buildings, bridges, trees and the like, the dynamic shielding is caused by moving objects such as vehicles, pedestrians and the like, (2) extreme lighting conditions such as strong light, shadow and the like and complex weather such as haze, rain and snow and the like increase the difficulty of target detection, and (3) targets are more and smaller, and the edge information characteristics are easier to lose along with the penetration of a fusion layer and a detection layer. Therefore, there is still a need for optimizing the engineering recognition capability for remote sensing aerial small targets. Visible light images (RGB) are good at capturing colors and details, have excellent ability to characterize target edge information, detail characteristics, on the basis of preserving visual features, but have significantly reduced performance in low light, night and complex weather scenarios. Infrared thermal Imaging (IR) is not affected by illumination, retains the ability of dominant features of the target under the condition of perceived temperature differences, but lacks feature textures, color characterization and various background information, and is more suitable for night monitored scenes. In the development of multi-mode aerial photography target detection, the visible light image features and the infrared thermal imaging features are extracted and then fused, but the features are simply overlapped on the feature map, and the problems of semantic alignment, weight distribution and the like among modes are not fully considered although the feature complementation to a certain degree is realized, so that the fusion effect is unstable under a complex scene. Disclosure of Invention The invention aims to construct a cross-modal target detection model under the condition that bimodal data are simultaneously input, extract and fuse double-flow features, realize the enhancement of the same-dimension features and the fusion of the cross-dimension features at different stages, and provide an aerial small target image detection method with multi-modal progressive fusion. In order to achieve the above object, the embodiment of the present invention provides the following technical solutions: A multi-mode progressive fusion aerial small target image detection method comprises the following steps: inputting the aerial small target image into a cross-modal target detection model based on the YOLO11 network improvement, and outputting the position, the category and the confidence of the small target predicted in the aerial small target image; The cross-modal target detection model comprises a backbone network, a neck network and a detection head, wherein the backbone network comprises a visible light branch, an infrared branch, four dynamic enhancement modules and a multi-scale small target cross-screening mechanism module, the visible light branch and the infrared branch are fused, and the neck network replaces a C2k3 module in the YOLO11 network with the multi-scale small target cross