CN-121811202-B - Aerial punctiform target detection method and system based on high-frequency perception and dynamic scaling loss

CN121811202BCN 121811202 BCN121811202 BCN 121811202BCN-121811202-B

Abstract

The invention belongs to the technical field of computer vision and remote sensing image processing, and provides an aerial punctiform target detection method and system based on high-frequency perception and dynamic scaling loss, wherein the method comprises the steps of receiving an aerial image, and extracting multi-scale space characteristics containing high-frequency details by utilizing a high-resolution Mamba backbone network; the multi-scale spatial features are fused through a PANet neck network to generate a multi-scale fusion feature map, the multi-scale fusion feature map is enhanced by utilizing high-frequency refocusing Jiao Mokuai HFRM, the enhanced multi-scale fusion feature map is used as input, the position and the category of a target are predicted through a decoupling detection head, in a model training stage, model optimization is carried out by adopting a dynamic scaling loss function, and the loss function dynamically adjusts weights of classification loss and regression loss according to the real area of the target. The invention effectively solves the detection problem caused by semantic collapse, high-frequency information loss and unstable training of the spot-like targets in the aerial image.

Inventors

ZHANG XIANGQING
ZHAO ZICHENG
ZHANG ZHIYI
Jiao Tengyu
YANG ZHANHAI

Assignees

延安大学

Dates

Publication Date: 20260508
Application Date: 20260309

Claims (6)

1. The aerial punctiform target detection method based on high-frequency perception and dynamic scaling loss is characterized by comprising the following steps of: Receiving an aerial image, and extracting multi-scale space features containing high-frequency details by utilizing a high-resolution Mamba main network; fusing the multi-scale space features through PANet neck network to generate a multi-scale fusion feature map; And enhancing the multi-scale fusion characteristic map by utilizing high-frequency refocusing Jiao Mokuai HFRM, wherein the enhancement process comprises the following steps: converting the multi-scale fusion feature map to a frequency domain through FFT, separating low-frequency components and high-frequency components by using a binary mask, dynamically amplifying the high-frequency components through a channel attention mechanism, and returning the enhanced recombined spectrum to a space domain through IFFT; taking the enhanced multi-scale fusion feature map as input, and predicting the position and the category of the target through a decoupling detection head; detecting and optimizing the whole HFA-Mamba model frame during end-to-end training by adopting a dynamic scaling loss function, wherein the loss function dynamically adjusts the weights of classification loss and regression loss according to the area of a target real area; The dynamic scaling loss function comprises a regression loss weighting function And a class loss weighting function ; The regression loss weight function Is a negative exponential decay function with respect to the target real area A, and the calculation expression is: wherein A is the area of the target area, Is a positive super parameter for controlling the attenuation rate, epsilon is a small positive number constant for preventing zero division, i is the number of all positive samples in one training iteration, and Ai is the area calculated for each ith target instance; the classification loss weight function Is an exponential enhancement function with respect to the target real area A, and the calculation expression is as follows: Wherein A is the area of the target area, beta c and gamma c are positive super parameters, i is the number of all positive samples in one training iteration, and Ai is the area calculated for each ith target instance.
2. The aerial punctiform target detection method based on high-frequency perception and dynamic scaling loss according to claim 1, wherein the high-resolution Mamba main network comprises an initial patch embedding layer and three layering stages; The process of extracting the multi-scale space features containing high-frequency details is as follows: The input image I is firstly subjected to initial downsampling by a convolution layer to obtain a characteristic image X 0 , and three layering stages respectively output multi-scale characteristic images with downsampling rates of 8 times, 16 times and 32 times Wherein the output of the i-th stage is: Each MambaStage i consists of a downsampling layer and Ni sequential HG-MFTBlock for modeling global context dependencies while maintaining high frequency details, where, First of all The feature map output by the individual layering stages, In the time-course of which the first and second contact surfaces, corresponding to the multi-scale output feature map, The multi-scale characteristic diagrams output by the backbone network correspond to the relative input images respectively The scale of the downsampling is such that, First, the The overall mapping function of the hierarchical stage, which is used to generalize all the operator combinations inside the stage, First, the The feature map output by the individual layering stages, First of all The specific network structure of the layering stage, ni, the number of stacks of HG-MFTBlock in the ith layering stage, is used to characterize the depth of this stage.
3. The aerial punctiform target detection method based on high-frequency perception and dynamic scaling loss according to claim 1, wherein the PANet neck network fuses multiscale characteristics through a dual-path mechanism, and specifically comprises the following steps: the top-down path starts from the deepest feature map C 5 , gradually propagates high-level semantic information to the shallow layer through upsampling and transverse connection, and generates a semantic enhanced feature pyramid Starting from the shallowest feature map P 3 , injecting low-level space details into deep layers through downsampling and connection, and finally outputting a multi-scale fusion feature map The calculation process is as follows: Wherein the method comprises the steps of N is an intermediate feature, and is, First of all Of layers of Convolution, for channel alignment to achieve feature fusion, Is a scale-level index in a multi-scale feature pyramid, used to distinguish feature layers of different resolutions, Representation PANet top-to-bottom path on the first A feature map of the output of the individual scales, Representation of The set of indices is used to determine the index, Downsampling operators for characterizing higher resolution intermediate features Downsampling to a spatial dimension consistent with the current scale feature, Is PANet a middle fusion characteristic diagram of the upper layer scale in the bottom-to-top path, A convolution module for feature fusion and refining, representing fusion transformation, Convolution module for feature fusion and refinement, representing output refinement process, index Representing a multi-scale hierarchy index, To represent the gradual fusion generation from shallow layer to deep layer in the bottom-top path 。
4. The aerial punctiform target detection method based on high-frequency perception and dynamic scaling loss according to claim 1, wherein the specific process of enhancing the multi-scale fusion feature map by the high-frequency refocusing Jiao Mokuai HFRM is as follows: Converting the multi-scale fusion feature map into a frequency domain through fast Fourier transform to obtain a complex frequency spectrum tensor, and carrying out centering treatment on the complex frequency spectrum tensor; Separating the centered spectrum into a low frequency component and a high frequency component by Hadamard product by using a binary low-pass mask; designing a channel attention function, and learning a channel specific amplification factor according to the amplitude spectrum of the high-frequency component Representing the presence of a real vector W in dimension C, where each dimension corresponds to an intensity coefficient amplifying the high frequency component of the i-th characteristic channel, W represents a channel-specific amplification factor, The amplification coefficient is obtained by self-adaptive learning of a channel attention function according to the amplitude spectrum of the high-frequency component and acts on each spatial position of the high-frequency component in a broadcasting mode to realize the dynamic enhancement of the high-frequency information of different channels; Recombining the enhanced high-frequency component and the original low-frequency component to obtain an enhanced frequency spectrum; And converting the enhanced frequency spectrum back to a space domain through inverse centralization processing and inverse fast Fourier transformation to obtain a multi-scale fusion characteristic diagram with enhanced details.
5. The method for detecting an aerial punctiform object based on high-frequency perception and dynamic scaling loss as claimed in claim 4, wherein said channel attention function comprises global averaging pooling of the magnitude spectra of high-frequency components to extract channel descriptors, processing the channel descriptors by a multi-layer perceptron network composed of one-dimensional convolution layers to generate said channel-specific amplification factors Representing the presence of a real vector W in dimension C, where each dimension corresponds to an intensity coefficient amplifying the high frequency component of the i-th characteristic channel, W represents a channel-specific amplification factor, Representing the real set and C representing the number of channels of the feature map.
6. The aerial punctiform target detection system based on high-frequency perception and dynamic scaling loss is characterized by comprising the following modules: the feature extraction module is used for receiving the aerial image and extracting multi-scale space features containing high-frequency details by utilizing a high-resolution Mamba main network; the feature fusion module is used for fusing the multi-scale space features through PANet neck networks to generate a multi-scale fusion feature map; The feature enhancement module is used for enhancing the multi-scale fusion feature map by utilizing high-frequency refocusing Jiao Mokuai HFRM, and the enhancement process comprises the following steps: converting the multi-scale fusion feature map to a frequency domain through FFT, separating low-frequency components and high-frequency components by using a binary mask, dynamically amplifying the high-frequency components through a channel attention mechanism, and returning the enhanced recombined spectrum to a space domain through IFFT; the target detection output module takes the enhanced multi-scale fusion characteristic diagram as input, and predicts the position and the category of the target through a decoupling detection head; The global detection optimization module is used for detecting and optimizing the whole HFA-Mamba model framework in an end-to-end training period by adopting a dynamic scaling loss function, and the loss function dynamically adjusts the weights of classification loss and regression loss according to the area of a target real area; The dynamic scaling loss function comprises a regression loss weighting function And a class loss weighting function ; The regression loss weight function Is a negative exponential decay function with respect to the target real area A, and the calculation expression is: wherein A is the area of the target area, Is a positive super parameter for controlling the attenuation rate, epsilon is a small positive number constant for preventing zero division, i is the number of all positive samples in one training iteration, and Ai is the area calculated for each ith target instance; the classification loss weight function Is an exponential enhancement function with respect to the target real area A, and the calculation expression is as follows: Wherein A is the area of the target area, beta c and gamma c are positive super parameters, i is the number of all positive samples in one training iteration, and Ai is the area calculated for each ith target instance.

Description

Aerial punctiform target detection method and system based on high-frequency perception and dynamic scaling loss Technical Field The invention belongs to the technical field of computer vision and remote sensing image processing, and particularly relates to an aerial punctiform target detection method and system based on high-frequency perception and dynamic scaling loss. Background With the popularization of unmanned aerial vehicle technology, pedestrian detection in aerial images becomes vital in the fields of public safety, emergency search and rescue and the like. However, the high-altitude visual angle downlink human target is degraded into a dot shape, a short line shape or a small rectangle shape due to extreme feature compression, and only weak intensity or texture difference from the background exists, so that the performance of the traditional convolutional neural network detector depending on the middle layer feature is drastically reduced. This phenomenon, known as "semantic collapse," has become a central challenge in the current field of target detection. Although the existing method attempts to improve small target detection through multi-scale feature fusion, attention mechanism or super-resolution technology, two fundamental defects still exist when processing aerial punctiform targets, namely, firstly, the inherent low-pass filtering effect of a depth network weakens key high-frequency details layer by layer, and the recognition of punctiform targets just depends on the fine edges and gray level changes, secondly, the extreme scale differences and marking noise are not fully considered by a general training strategy and a loss function, so that the loss gradient of the large-size targets dominates the optimization process, and the small targets cause gradient sharp fluctuation due to slight marking deviation, so that training is extremely unstable. Therefore, how to effectively represent and amplify high-frequency characteristics and how to realize stable and efficient supervision training on tiny targets has become two main bottlenecks for restricting the performance of the contemporary aerial pedestrian detection system. In the prior art, CNN-based detectors essentially act as low-pass filters due to their convolution and stacking of downsampling layers, attenuating the high-frequency signals in the image, whereas the fine edges and gray-scale variations of the punctiform object are the very high-frequency features necessary for their identification. In addition, the conventional regression loss function (e.g., CIoU loss) is at the same time for all scale targets, resulting in a slightly larger size target that produces orders of magnitude higher gradients due to its wider absolute error range, thus dominateing the optimization process, such that the supervisory signal for the tiny targets is submerged. Meanwhile, the boundary box marks of the punctiform targets have inherent ambiguity, and the mark offset of only one pixel can cause the severe change of the cross ratio (IoU) value, so that an extremely unstable gradient is generated, and the effective convergence of the model is seriously hindered by label noise. Therefore, the invention provides an aerial punctiform target detection method and system based on high-frequency perception and dynamic scaling loss. Disclosure of Invention In order to overcome the deficiencies of the prior art, at least one technical problem presented in the background art is solved. The technical scheme adopted for solving the technical problems is that the aerial punctiform target detection method based on high-frequency perception and dynamic scaling loss comprises the following steps: Receiving an aerial image, and extracting multi-scale space features containing high-frequency details by utilizing a high-resolution Mamba main network; fusing the multi-scale space features through PANet neck network to generate a multi-scale fusion feature map; And enhancing the multi-scale fusion characteristic map by utilizing high-frequency refocusing Jiao Mokuai HFRM, wherein the enhancement process comprises the following steps: converting the multi-scale fusion feature map to a frequency domain through FFT, separating low-frequency components and high-frequency components by using a binary mask, dynamically amplifying the high-frequency components through a channel attention mechanism, and returning the enhanced recombined spectrum to a space domain through IFFT; and taking the enhanced multi-scale fusion characteristic diagram as input, and predicting the position and the category of the target through a decoupling detection head. As a further technical scheme of the invention, the high-resolution Mamba main network comprises an initial patch embedding layer and three layering stages; The process of extracting the multi-scale space features containing high-frequency details is as follows: The input image I is firstly subjected to initial downsampling by a convolut