CN-121982748-A - Fine-granularity feature aggregation-based end-to-end UAV infrared pedestrian detection method

CN121982748ACN 121982748 ACN121982748 ACN 121982748ACN-121982748-A

Abstract

The invention discloses an end-to-end UAV infrared pedestrian detection method based on fine-grained feature aggregation, and relates to the technical field of computer vision, infrared detection technology and intelligent edge computing intersection. The method comprises the steps of obtaining infrared pedestrian image data, preprocessing the infrared pedestrian image data, inputting the preprocessed infrared pedestrian image data into a detection model, extracting multi-scale features of the preprocessed infrared pedestrian image data through a backbone network of the detection model, carrying out multi-scale feature fusion on shallow features extracted by the backbone network by adopting an attention mechanism, carrying out multi-scale feature fusion learning on deep features extracted by the backbone network by adopting a re-parameterization technology, optimizing the fused features through a multi-path attention mechanism detection head, and outputting a detection result. The invention realizes high-precision and high-real-time pedestrian detection under the complex infrared scene.

Inventors

XU HUIYING
ZHU XINZHONG
LI HONGBO
LI YI
WANG RUIDONG
SHI WEI
SU WEIFENG
XU LINGLING
WANG ZHENGLONG

Assignees

浙江师范大学
北京极智嘉科技股份有限公司
合肥极智嘉机器人有限公司
杭州纵横通信股份有限公司

Dates

Publication Date: 20260505
Application Date: 20260203

Claims (10)

1. An end-to-end UAV infrared pedestrian detection method based on fine-grained feature aggregation is characterized by comprising the following steps: S1, acquiring infrared pedestrian image data, preprocessing the infrared pedestrian image data, and inputting the preprocessed infrared pedestrian image data into a detection model; S2, extracting multi-scale features of the preprocessed infrared pedestrian image data through a backbone network of the detection model; S3, carrying out multi-scale feature fusion on shallow features extracted by a backbone network by adopting an attention mechanism; S4, carrying out multi-scale feature fusion learning on deep features extracted from a backbone network by adopting a re-parameterization technology; And S5, optimizing the fused features through a multipath attention mechanism detection head and outputting a detection result.
2. The end-to-end UAV infrared pedestrian detection method based on fine-grained feature aggregation of claim 1, wherein the infrared pedestrian image data is derived from an infrared thermal imaging device onboard an unmanned aerial vehicle, an autonomous mobile robot, or an autonomous agent.
3. The end-to-end UAV infrared pedestrian detection method based on fine-grained feature aggregation of claim 1, wherein the preprocessing includes image size adjustment, the adjusted image size being 640 x 640.
4. The end-to-end UAV infrared pedestrian detection method based on fine-grained feature aggregation of claim 1 is characterized in that a backbone network is HGNetv <2>, HGNetv <2> after channel pruning, the backbone network comprises a Stem module, a plurality of HGBlock modules, an interlayer downsampling module, a global average pooling layer, a full-connection layer and a convolution layer, a HGBlock module adopts a multi-branch structure, basic features are constructed through stacking 3×3 convolutions, and features are extracted in parallel through five convolution kernels with different sizes and fused.
5. The end-to-end UAV infrared pedestrian detection method based on fine-grained feature aggregation of claim 4, wherein Stem modules comprise a convolution layer, a batch normalization layer and a nonlinear activation function, and feature map dimensions are compressed between adjacent HGBlock modules by an inter-layer downsampling operation.
6. The end-to-end UAV infrared pedestrian detection method based on fine-grained feature aggregation of claim 1, wherein the attention mechanism is a large-size separable kernel attention module that constructs a large-size convolution kernel from a depth separable convolution, an expanded depth separable convolution, and a continuous 1 x 1 convolution.
7. The fine-grained feature aggregation-based end-to-end UAV infrared pedestrian detection method of claim 6, wherein the input feature map of the large-size separable nuclear attention module is Wherein, C is the number of input channels, H and W are the height of the feature map and the width of the feature map, respectively, and the large-size separable kernel attention module decomposes the depth separable convolution into a combination of the depth separable convolution of the small-size convolution kernel and the dilation depth separable convolution of the dilation kernel.
8. The end-to-end UAV infrared pedestrian detection method based on fine-grained feature aggregation of claim 1, wherein the re-parameterization technique is online convolution re-parameterization, the online convolution re-parameterization comprises module linearization and module compression, the module linearization removes a nonlinear layer, introduces a scaling layer and adds a post-addition normalization layer, and the module compression merges linear modules into a single convolution layer.
9. The end-to-end UAV infrared pedestrian detection method based on fine-grained feature aggregation of claim 8, wherein the modular compression includes sequential structural simplification that incorporates multi-level convolution layers by weight matrix mapping and parallel structural simplification that fuses weights by aligning different convolution kernel centers.
10. The end-to-end UAV infrared pedestrian detection method based on fine-grained feature aggregation of claim 1, wherein the multi-path attention mechanism is a multi-path coordinate attention mechanism, the detection head comprises 4-scale detection branches, and the scales are 80 x 80, 40 x 40, 20 x 20 and 10 x 10, respectively.

Description

Fine-granularity feature aggregation-based end-to-end UAV infrared pedestrian detection method Technical Field The invention relates to the technical field of computer vision, infrared detection technology and edge intelligent computing intersection, in particular to an end-to-end UAV infrared pedestrian detection method based on fine-grained feature aggregation. Background The evolution of pedestrian detection technology can be summarized into three development stages from traditional feature engineering to deep learning driving, from visible light leading to multi-modal fusion, and from general scene to vertical scene deepening: first stage, traditional feature engineering era Early pedestrian detection relied primarily on manually designed feature descriptors such as Haar features, histogram of Oriented Gradients (HOG), local Binary Patterns (LBP), etc. The method realizes target detection by extracting shallow visual features such as edges, textures and the like and matching with a sliding window and a classifier (such as SVM). The method has the advantages of simple calculation and strong interpretability, but is severely limited by the feature expression capability, has poor robustness under the scenes of complex background, shielding, illumination change and the like, has higher false detection rate and omission rate, and cannot meet the actual application requirements. Second stage, deep learning driven visible light detection era With the breakthrough of Convolutional Neural Network (CNN), the target detection method (such as fast R-CNN, yolo series, SSD, etc.) based on deep learning greatly improves the detection precision and speed. These methods achieve performance on the visible image that approaches or even exceeds human level by learning multi-level feature representations end-to-end. However, physical limitations of visible light imaging lead to rapid degradation of performance under conditions of insufficient illumination or poor visibility such as night, backlight, haze and the like, and all-weather reliable perception cannot be realized. This problem is particularly prominent in scenes with high requirements for continuity, such as security monitoring and automatic driving. Third stage, the infrared and multi-mode perception is raised To break through the perceived boundary of visible light, infrared thermal imaging techniques were introduced as a key complementary modality. The infrared sensor achieves the ability of working throughout the day without regard to illumination by capturing the thermal radiation information of the object. With the development of infrared target detection methods based on deep learning, the early stage is mainly explored by improving or fine-tuning a visible light detection network (such as YOLO, retinaNet applied to an infrared data set). However, such direct migration methods face significant challenges: 1. the characteristic mismatch is that the infrared image lacks texture and color information, the contrast is low, the noise is large, and the characteristic prior learned by the visible light model pre-training is difficult to effectively migrate. 2. And the performance bottleneck of the small target is that under the high-altitude visual angle of the unmanned aerial vehicle, the pedestrian target always presents as a small target with extremely low resolution and weak heat signal in an infrared image, and the universal detector is difficult to combine high recall rate and low false alarm rate. 3. The model efficiency contradicts the deployment, and although many complex models with improved precision (such as attention introducing mechanisms, multi-scale fusion, feature pyramid optimization and the like) are emerging in academia, the methods tend to greatly increase the computational complexity and seriously conflict with the limited computational power, memory and real-time requirements of edge platforms such as unmanned aerial vehicles, robots and the like. Currently, infrared pedestrian detection research oriented to mobile platforms such as unmanned aerial vehicles, robots and the like presents the following characteristics: Lightweight network design becomes a hotspot, namely researchers compress models through means of neural network architecture search, model pruning, knowledge distillation and the like, but often at the cost of sacrificing small target detection accuracy. Multi-scale feature fusion is emphasized by improving FPN, PANet and other structures to enhance small target feature representation, but feature redundancy and aliasing are easy to generate in the fusion process, and the calculation cost is high. Attention mechanisms have attempted to be introduced such as channel attention, spatial attention, etc. are used to enhance key features, but existing approaches have not been adequate to model long-range dependencies in the infrared scene and dynamic attention calculations have exacerbated delays. The reparameterization technology sta