CN-121999394-A - Unmanned aerial vehicle small target detection system and method based on space and frequency domain cooperative enhancement
Abstract
The invention discloses a small target detection system and method of an unmanned aerial vehicle based on space and frequency domain collaborative enhancement, wherein the method comprises the steps of preprocessing input images, carrying out multi-level feature extraction on the preprocessed images, providing multi-scale feature map data for hybrid_ Encoder, fusing cross-scale features and improving global modeling capacity by hybrid_ Encoder, carrying out global context modeling on a deepest feature map by using an attention interaction FAPAM module and enhancing detail features of local edges and textures, outputting the enhanced deepest feature map, constructing a double-domain collaborative feature fusion frame by using a multi-scale feature enhancement DD-AFFM module to carry out multi-scale feature fusion, outputting a feature array formed by the enhanced multiple deep feature maps, and decoding according to global feature representation to generate specific target frames and category predictions, and carrying out target query interaction and target instance prediction. The method has the advantages of effectively improving the target detection precision and having stronger cross-scene generalization capability under the complex scene.
Inventors
- CHEN MING
- WEN XIAOBO
- CHU YANGYANG
- ZHANG ZHIFENG
- LIANG SHUJUN
- CHENG JUNQIANG
- QI PING
- WANG FUCHENG
Assignees
- 铜陵学院
- 郑州轻工业大学
Dates
- Publication Date
- 20260508
- Application Date
- 20260116
Claims (10)
- 1. Unmanned aerial vehicle small target detecting system based on space and frequency domain collaborative enhancement, which is characterized by comprising: a multi-scale feature extraction Backbone unit for extracting multi-level feature representation from an input image; The global context modeling and multi-scale feature fusion hybrid_ Encoder unit is connected to the multi-scale feature extraction back unit and used for fusing trans-scale features and improving global modeling capacity, the global context modeling and multi-scale feature fusion hybrid_ Encoder unit comprises an attention interaction FAPAM module and a multi-scale feature enhancement DD-AFFM module, the attention interaction FAPAM module is used for realizing collaborative optimization of global dependency modeling and local detail enhancement and improving the resolvability and stability of small target features, and the multi-scale feature enhancement DD-AFFM module is used for constructing a double-domain collaborative feature fusion framework and used for excavating fine granularity structural information of shallow features and semantic information of deep features so as to enhance the characterization of the small target features; And the target query interaction and target instance prediction Decoder unit is connected with the global context modeling and multi-scale feature fusion hybrid_ Encoder unit and is used for decoding the global feature representation output by the global context modeling and multi-scale feature fusion hybrid_ Encoder unit into a specific target frame and category prediction.
- 2. The unmanned aerial vehicle small target detection system of claim 1, wherein the attention interaction FAPAM module comprises a polarity aware linear attention sub-module and an adaptive window frequency domain modulation AWFM sub-module; The polar perception linear attention submodule is used for effectively modeling long-range dependency relationship while maintaining linear calculation complexity so as to enhance the global perception capability of the model on small targets in a complex background; the self-adaptive window frequency modulation AWFM submodule is used for dynamically adjusting response intensity of different frequency components, and on the premise of ensuring semantic consistency, the representation capability of high-frequency details is enhanced so as to enhance the discrimination of small target features.
- 3. The unmanned aerial vehicle small target detection system of claim 1, wherein the multi-scale feature enhancement DD-AFFM module comprises: The HFD and PD downsampling submodule is used for relieving the problem of dilution or loss of detail information in the downsampling process and comprises a segmented linear fusion downsampling PD module and a hybrid fusion downsampling HFD module, wherein the segmented linear fusion downsampling PD module is used for compensating the loss of space resolution in downsampling in the channel dimension so as to reduce the loss of small target information, and the hybrid fusion downsampling HFD module is used for fusing three paths of convolution branches, maximum pooling branches and linear mapping branches so as to integrate shallow features into deep features and enrich the deep small target features; The frequency domain sensing modulation FAM submodule introduces a frequency domain enhancement mechanism, performs local window frequency domain transformation on input features, and realizes self-adaptive modulation of high-frequency and low-frequency components through a learnable weight parameter, so that the model can automatically and selectively strengthen high-frequency components or low-frequency components according to scenes, and is used for highlighting key details while maintaining overall semantic consistency so as to improve the feature extraction capability of the model on small targets; The bidirectional cross-layer characteristic self-adaptive fusion BiCAF sub-module introduces a self-adaptive weight mechanism into the bidirectional cross-layer characteristic self-adaptive fusion BiCAF sub-module, and is used for automatically adjusting the dependence proportion of detail characteristics and semantic characteristics in fusion according to different scenes and target scales so as to flexibly regulate and control information, and the bidirectional cross-layer characteristic self-adaptive fusion BiCAF sub-module also introduces a residual error connection structure so as to ensure the stability of characteristic semantics after cross-layer fusion and the continuity of gradient transfer, thereby enhancing the robustness and consistency of fusion characteristics.
- 4. The unmanned aerial vehicle small target detection method based on space and frequency domain cooperative enhancement is realized by adopting the unmanned aerial vehicle small target detection system based on space and frequency domain cooperative enhancement as claimed in any one of claims 1 to 3, and is characterized by comprising the following steps: S1, carrying out data preprocessing on an input image, and sequentially executing data enhancement and data type conversion operations; s2, carrying out multi-level feature extraction on the preprocessed image, generating feature graphs with different scales through a convolution network, outputting a multi-scale feature set forming a feature pyramid, and providing multi-scale feature graph data for a global context modeling and multi-scale feature fusion hybrid_ Encoder unit; S3, global context modeling and multi-scale feature fusion hybrid_ Encoder unit fusion trans-scale features and global modeling capacity is improved, firstly, a attention interaction FAPAM module is utilized to conduct global context modeling on the deepest feature map and strengthen detail features of local edges and textures, the strengthened deepest feature map is output, then a multi-scale feature enhancement DD-AFFM module is utilized to construct a double-domain collaborative feature fusion framework to conduct multi-scale feature fusion, and a feature array formed by the strengthened multiple deep feature maps is output; S4, decoding to generate a specific target frame and category prediction according to the global context modeling and the global feature representation output by the multi-scale feature fusion unit, and carrying out target query interaction and target instance prediction.
- 5. The unmanned aerial vehicle small target detection method based on spatial and frequency domain cooperative enhancement of claim 4, wherein S1 further comprises: S1.1, performing data enhancement operation on an input image, applying random photometric distortion to adjust brightness, contrast and color saturation, performing random scaling operation and filling black pixels at the edges, and performing intelligent clipping based on IoU to ensure that an effective target area is reserved; s1.2, after the data enhancement operation is completed, removing invalid boundary boxes and corresponding labels thereof, and increasing data diversity by applying random horizontal overturn; S1.3, uniformly scaling the image subjected to data diversity enhancement to a fixed size of 640x640 pixels, converting the image into a tensor format, and performing data type conversion; S1.4, verifying validity of the bounding box again, removing the invalid bounding box and the corresponding label, and converting the coordinates of the bounding box from xyxy format to normalized cxcywh format.
- 6. The unmanned aerial vehicle small target detection method based on spatial and frequency domain cooperative reinforcement according to claim 4, wherein in S3, the processing procedure of the attention interaction FAPAM module further comprises: S3, introducing a polarity-aware linear attention mechanism to replace multi-head self-attention of a standard transducer, dividing Query features and Key features into homopolar interaction and heteropolarity interaction paths according to symbol relations in a global modeling stage, and respectively calculating similarity mapping to enable the model to draw separation expression among negative correlation features while capturing aggregation relations among positive correlation regions; S3, a2, introducing an adaptive window frequency modulation mechanism, dividing the characteristics into a plurality of adaptive windows, carrying out frequency domain transformation in the windows, decomposing the spatial characteristics into high-frequency and low-frequency components, and enabling the model to adaptively and dynamically adjust the response intensity of different frequency components according to an input scene.
- 7. The unmanned aerial vehicle small target detection method based on space and frequency domain cooperative enhancement according to claim 6, wherein in the s3.A2, the adaptive window frequency modulation mechanism calculation process is as follows: Wherein, O 1 represents an intermediate output combined with the residual after the attention calculation, O 2 represents an output after the forward feedback, O 3 represents a result after the frequency domain calculation, input represents an input, polaLinearAttention (̇) represents a polarity-aware attention, norm (̇) represents a normalization operation, FFN (̇) represents a feedforward neural network, patch (̇) represents dividing an input feature map into windows, FFT (̇) represents a fast fourier transform, IFFT (̇) represents an inverse fast fourier transform, weight represents a learnable weight for adaptively fusing frequency domain components, reshape (̇) represents restoring the fourier transformed features to an input shape, and output represents a final output.
- 8. The unmanned aerial vehicle small target detection method based on spatial and frequency domain cooperative enhancement as claimed in claim 4, wherein in S3, the processing procedure of the multi-scale feature enhancement DD-AFFM module further comprises: s3, dividing non-overlapping 2x2 windows of the feature map by utilizing a block linear fusion downsampling PD, splicing feature channels in the windows to obtain regional features, wherein the expression is as follows: Where input represents input, unFold (̇) represents extraction of local spatial features for converting spatial data into a flat vector sequence, linear (̇) represents Linear projection mapping flat feature vectors into new embedded space, reshape (̇) represents restoration of fourier transformed features to input shape, output represents final output; S3, merging three paths of a convolution branch, a maximum pooling branch and a linear mapping branch by utilizing multi-branch downsampling HFD, and merging shallow features into deep features; S3.B3, introducing a frequency domain enhancement mechanism, carrying out local window frequency domain transformation on input features, and realizing self-adaptive modulation of high-low frequency components through a learnable weight parameter, so that a model can automatically and selectively strengthen high-frequency components or low-frequency components according to a scene, the problem that a small target is misjudged due to weakening of textures in a complex background is solved, and the distinguishing capability of the model on the small target is improved, wherein the process formula is expressed as follows: Wherein, C 1 represents the output after convolution by 1x1, C 2 represents the output after frequency domain computation, conv (̇) represents convolution operation, patch (̇) represents dividing the input feature map into windows, FFT (̇) represents fast fourier transform, IFFT (̇) represents inverse fast fourier transform, weight represents learnable weights for adaptive frequency domain component fusion; S3, introducing a self-adaptive weight mechanism, automatically adjusting the dependence ratio of detail features and semantic features in fusion according to different scenes and target scales, and introducing a residual error connection structure to ensure the stability of feature semantics and continuity of gradient transfer after cross-layer fusion, enhancing the dynamic fusion of model cross-layer features, and realizing the bidirectional information interaction of shallow features and deep features, wherein the process is expressed as follows: Wherein, F 1 denotes stitching two features according to channel dimensions, F 2 denotes output after weight calculation, F low denotes shallow features, F high denotes deep features, concat (̇) denotes a learnable parameter, α denotes a value normalized by a sigmoid function, bottom2top denotes bottom-up feature fusion, top2bottom denotes top-down feature fusion.
- 9. The unmanned aerial vehicle small target detection method based on spatial and frequency domain co-enhancement according to claim 8, wherein in s3.B2, the convolution branch is used for extracting local context, the max-pooling branch is used for maintaining spatial structure stability, and the linear mapping branch is used for enhancing fine-grained feature expression through global channel relational modeling.
- 10. The unmanned aerial vehicle small target detection method based on spatial and frequency domain cooperative enhancement of claim 4, wherein S4 further comprises: S4.1, generating a corresponding initial anchor frame anchor box according to the resolution of each scale feature map output by the global context modeling and multi-scale feature fusion hybrid_ Encoder unit; S4.2, flattening the multi-scale two-dimensional feature map output by the global context modeling and multi-scale feature fusion hybrid_ Encoder unit into a sequence dimension, and splicing and outputting sequence data memory; S4.3, predicting the offset of the initial anchor frame anchor box through the classification detection head and the bounding box regression head, predicting the sequence data memory, taking out the first k feature vectors with the maximum probability in the sequence data memory according to the maximum class probability in the classification prediction result, taking the prediction result of the k feature vectors into participation in loss calculation, and using the prediction result as the initialization value of the object query vector object of the target query interaction and target instance prediction Decoder unit; S4.4, generating a noise query vector denosing queries with disturbance according to a real boundary box GT, accelerating model convergence, inputting a sequence data memory, an object query vector object query and a query vector query of a noise query vector denosing queries to the target query interaction and target instance prediction Decoder unit, refining information of the query vector query through a plurality of layers of Decoder layers, predicting output of each layer of Decoder layers, and calculating loss of the output of each layer of Decoder layers; S4.5, taking the prediction result of the decoder layer of the last layer of decoder layer as a model final prediction result, converting xywh format of the final prediction result into xyxy format through post-processing, and multiplying the image width and height to obtain pixel position coordinates of the frame.
Description
Unmanned aerial vehicle small target detection system and method based on space and frequency domain cooperative enhancement Technical Field The invention relates to the technical field of computer vision detection, in particular to an unmanned aerial vehicle small target detection system and method based on space and frequency domain cooperative enhancement. Background With the vigorous development of low-altitude economy, the unmanned aerial vehicle is widely applied in the fields of traffic monitoring, industrial inspection and the like by virtue of the advantages of flexibility, wide visual field, low deployment cost and the like. However, because the shooting angle is changeable, the imaging environment is complex, challenges such as small target scale, various gesture distribution, serious shielding and the like of images acquired by the unmanned aerial vehicle generally exist, and the method provides a serious challenge for the accuracy of the existing target detection algorithm. The existing general target detection algorithm has the problems of low detection precision, low recall rate, poor robustness and the like under the unmanned airport scene. The DETR model brings dawn to small target detection due to the end-to-end detection capability and the strong global context modeling characteristic, but uses a single-scale feature map to predict, so that targets with extremely small scales are difficult to effectively characterize. As an important milestone model for end-to-end detector trend practicality, RT-DETR successfully solves the efficiency problem of the existing DETR series model in real-time scene, and it realizes competitive real-time speed and high precision under the simple flow of maintaining end-to-end NMS-free. However, RT-DETR mainly relies on spatial domain feature interaction, and lacks an explicit enhancement mechanism at the frequency domain level, so that the model is limited in capturing high-frequency detail features, and therefore small target detection accuracy of the model in complex scenes such as unmanned aerial vehicle aerial photography is limited. Disclosure of Invention The unmanned aerial vehicle small target detection system and the unmanned aerial vehicle small target detection method based on space and frequency domain cooperative enhancement, which are provided by the invention, realize the characteristic characterization of the small target more comprehensively and more robustly, and at least solve one of the technical problems. In order to solve the technical problems, the invention adopts the following technical scheme: unmanned aerial vehicle small target detecting system based on space and frequency domain collaborative enhancement includes: a multi-scale feature extraction Backbone unit for extracting multi-level feature representation from an input image; The global context modeling and multi-scale feature fusion hybrid_ Encoder unit is connected to the multi-scale feature extraction back unit and used for fusing trans-scale features and improving global modeling capacity, the global context modeling and multi-scale feature fusion hybrid_ Encoder unit comprises an attention interaction FAPAM module and a multi-scale feature enhancement DD-AFFM module, the attention interaction FAPAM module is used for realizing collaborative optimization of global dependency modeling and local detail enhancement and improving the resolvability and stability of small target features, and the multi-scale feature enhancement DD-AFFM module is used for constructing a double-domain collaborative feature fusion framework and used for excavating fine granularity structural information of shallow features and semantic information of deep features so as to enhance the characterization of the small target features; And the target query interaction and target instance prediction Decoder unit is connected with the global context modeling and multi-scale feature fusion hybrid_ Encoder unit and is used for decoding the global feature representation output by the global context modeling and multi-scale feature fusion hybrid_ Encoder unit into a specific target frame and category prediction. Further, the attention interaction FAPAM module includes a polarity-aware linear attention sub-module and an adaptive window frequency domain modulation AWFM sub-module; The polar perception linear attention submodule is used for effectively modeling long-range dependency relationship while maintaining linear calculation complexity so as to enhance the global perception capability of the model on small targets in a complex background; the self-adaptive window frequency modulation AWFM submodule is used for dynamically adjusting response intensity of different frequency components, and on the premise of ensuring semantic consistency, the representation capability of high-frequency details is enhanced so as to enhance the discrimination of small target features. Further, the multi-scale feature enhanceme