CN-122024113-A - Aerial photographing target detection method based on asymmetric implicit field and continuous decoding

CN122024113ACN 122024113 ACN122024113 ACN 122024113ACN-122024113-A

Abstract

The invention discloses an aerial photographic target detection method based on an asymmetric implicit field and continuous decoding, and belongs to the technical field of computer vision. The method comprises the steps of firstly independently extracting multi-scale feature tensors by a backbone network, solidifying and storing, secondly generating query coordinates in a continuous two-dimensional space and performing high-frequency position coding, secondly projecting the query coordinates to each scale feature to perform bilinear sampling, combining high-frequency variance with adaptive distribution of asymmetric weights of a gating network, fusing to generate a single high-dimensional feature, thus constructing an asymmetric implicit feature field, finally splicing the position coding and the fused feature to an implicit coordinate decoder, and mapping and outputting a target category and a bounding box size in parallel in the continuous coordinate domain. According to the invention, global feature up-sampling is not needed, so that the calculation cost is greatly reduced, the quantization error caused by the traditional discrete grid is thoroughly eliminated, and the high-precision determination of the aerial extremely small target sub-pixel level is realized.

Inventors

TANG XIN
ZHU HU
XU GUOXIA
DENG LIZHEN

Assignees

南京邮电大学

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (10)

1. An aerial photographing target detection method based on an asymmetric implicit field and continuous decoding is characterized by comprising the following steps of: Step S1, independently extracting a plurality of original feature tensors with different scales from an aerial input image of an unmanned aerial vehicle by utilizing a backbone network, maintaining original resolution to be solidified in a video memory as a static feature base map of cross-scale sampling, Step S2, sampling in a two-dimensional continuous physical space corresponding to the input image to generate floating point query coordinates, mapping the floating point query coordinates into high-frequency position feature vectors with high dimensions through a position coding module, Step S3, performing cross-scale bilinear interpolation sampling on the static feature base map of each scale by utilizing floating point query coordinates, extracting local feature vectors of each scale, calculating local high-frequency feature variances as gradient indication factors, adaptively distributing asymmetric weights through a gating network, fusing the local feature vectors of each scale to generate single high-dimensional fused feature vectors, Step S4, the high-frequency position feature vector and the high-dimensional fusion feature vector are subjected to channel splicing, then are input into a continuous space implicit coordinate decoder, the target category confidence coefficient corresponding to the floating point query coordinate and the continuous boundary frame size predicted value are mapped and output in parallel in a continuous coordinate domain, Step S5, calculating composite total loss comprising classified focus loss and bounding box regression loss based on the network predicted value and the real label, iteratively updating network parameters through a back propagation algorithm, And S6, in the real machine reasoning stage, carrying out parallel reasoning on the unknown single-frame aerial image by utilizing a dense and uniform floating point coordinate grid, and outputting a continuous single target boundary frame through a threshold filtering and non-maximum suppression algorithm.
2. The method for detecting an aerial photographing target based on asymmetric implicit field and continuous decoding according to claim 1, wherein the specific process of extracting the original feature tensor in step S1 is as follows: acquiring an aerial RGB image of an unmanned aerial vehicle as original input data, uniformly scaling the size of the aerial RGB image into a standard size with 1024 pixels wide and 1024 pixels high by utilizing a bilinear interpolation algorithm, normalizing the pixel value to a range from 0 to 1, and obtaining an input image tensor Its dimension is ; Tensor of input image Inputting the feature graphs into a pre-trained ResNet-50 main feature extraction network, cutting off and discarding a feature pyramid up-sampling module, and directly leading out the feature graphs from the conv3_x layer, the conv4_x layer and the conv5_x layer of ResNet-50; the three extracted feature images are respectively set to 256 through one output channel number Performing dimension reduction processing on the standard convolution layer to obtain a final bottom layer support characteristic tensor, namely a shallow layer characteristic tensor Tensor of middle layer characteristics Deep layer characteristic tensor The dimensions of the three feature tensors are respectively 、 And The original resolution is kept independently and stored in a video memory in a solidifying way.
3. The method for detecting an aerial photographing target based on asymmetric implicit field and continuous decoding according to claim 2, wherein the specific process of generating the floating point query coordinates and the mapped high frequency position feature vector in step S2 is as follows: At the input image tensor Generating training query point set in corresponding two-dimensional continuous coordinate system Comprises positive sample coordinate points uniformly and randomly sampled in continuous space inside a real target boundary frame and negative sample coordinate points uniformly and randomly sampled in continuous space outside a real boundary frame in a background area, and a point set Any floating point query coordinate is It is strictly normalized to by linear transformation A section; Each floating point query coordinate to be generated The fourier position encoding module is independently input, Defining a position-coding function The calculation formula is as follows: Wherein, the For the preset number of high frequency bands, 16 is set.
4. The method for detecting an aerial photographic target based on asymmetric implicit fields and continuous decoding according to claim 3, wherein the specific process of generating a single high-dimensional fusion feature vector in the step S3 is as follows: continuous floating point query coordinates using a set coordinate pair salvo shadow matrix Respectively mapped to shallow feature tensors Tensor of middle layer characteristics Deep layer characteristic tensor In the space coordinate system of (2), corresponding relative floating point query coordinates are calculated 、、 In (1) 、、 Performing bilinear interpolation sampling operation according to the relative floating point coordinates to extract three local feature vectors with corresponding scales 、、 ; Querying coordinates with relative floating points Centered, in shallow feature tensor Upper interception Calculating variance values of feature vectors of 9 pixel points in the feature block Variance value The fusion weight is calculated by inputting the three feature vectors which are extracted into a gating network after being spliced, and the calculation formula is as follows: Wherein, the Performing a tensor stitching operation; A single-layer full-connection learnable weight matrix for a gating network; Is a learnable bias vector; Is a normalized activation function; is the weight vector of the output; 、、 The self-adaptive distribution coefficients of the shallow layer, the middle layer and the deep layer are respectively, According to the scalar weight obtained by calculation, element weighted summation is carried out in a continuous feature domain to generate floating point query coordinates Dedicated high-dimensional fusion feature vector The calculation formula is as follows: 。
5. The method for detecting an aerial photographing target based on asymmetric implicit field and continuous decoding according to claim 4, wherein the specific process of parallel mapping output predicted values in step S4 is as follows: for the first Querying the coordinates, encoding the position into a function Fusing feature vectors with high dimensions Performing splicing operation on channel dimension to generate joint input tensor ; Will combine the input tensors Sending the deep hidden state vector into an implicit coordinate decoder formed by stacking 3 layers of all connection layers, and outputting the deep hidden state vector after LayerNorm normalization and nonlinear mapping of a ReLU activation function ; Deep hidden state vector Input to two parallel single-layer fully-connected branch networks, classification prediction branch output dimension is Class logic value vector of (a) Wherein The total number of the categories of the aerial photographing targets is set, and the branch output of the regression prediction of the boundary frame is set with the current floating point query coordinates For the target center, the relative width predicted value of the corresponding continuous boundary box under the whole graph proportion And relative altitude predictions 。
6. The method for detecting an aerial photographing target based on asymmetric implicit field and continuous decoding according to claim 5, wherein the specific process of calculating the total loss of the complex in step S5 is as follows: Construction of the Total loss function The calculation formula is as follows: Wherein, the The weight coefficient is lost for classification; The regression loss weight coefficient; to classify focus loss; Regression loss for bounding boxes; Classification of focus loss The calculation formula is as follows: Wherein, the Query the total number of points for positive samples; The total number of query points generated for a single training; The total number of categories of the aerial photography targets; a true tag binary indicator; Class logic value vector output for classifying branches Predictive probability values mapped by Sigmoid activation functions; Is a positive and negative sample balance factor; Parameters are adjusted for difficult samples; boundary box regression loss The calculation formula is as follows: Wherein, the Querying a set of point indexes for all positive samples; And The relative width and height label values of the corresponding real target boundary box; As a result of the relative width prediction value, Is a relative altitude predictor; as a function of the smoothed absolute error.
7. The method for detecting an aerial photographing target based on asymmetric implicit field and continuous decoding according to claim 6, wherein the specific process of outputting a continuous single target bounding box in step S6 is as follows: receiving real-time single-frame aerial image, inputting the real-time single-frame aerial image into a trained backbone network, extracting and caching fixed shallow feature tensor Tensor of middle layer characteristics Deep layer characteristic tensor ; Directly at In the normalized two-dimensional coordinate system of (2), 0.02 is taken as a fixed step length, a floating point coordinate grid matrix containing 10000 fixed continuous coordinate points is generated, and an inference query point set is formed ; Query the point set by reasoning The method comprises the steps of obtaining high-frequency characteristics by using a batch parallel input position coding module, then flowing into an asymmetric cross-scale implicit characteristic field construction module to perform dynamic sampling and weighted fusion, obtaining a fusion result, inputting the fusion result into an implicit coordinate decoder, and calculating class probability predicted values and wide-high predicted values corresponding to all fixed continuous coordinate points in parallel; Traversing the category probability prediction value distribution of all the fixed continuous coordinate points, discarding background points with the maximum probability value lower than a preset confidence threshold value of 0.45, and regarding the reserved target center candidate points Combining the corresponding relative width predicted values And relative altitude prediction value And restoring four absolute coordinate angular points of the continuous boundary frame of the target, removing the redundant prediction frame by adopting a non-maximum suppression algorithm with the cross ratio threshold set to be 0.5, and outputting single category names and continuous coordinate boundary frame data.
8. An aerial photographic target detection apparatus based on an asymmetric implicit field and continuous decoding, for implementing an aerial photographic target detection method based on an asymmetric implicit field and continuous decoding as claimed in any one of claims 1 to 7, the apparatus comprising: The multi-scale feature extraction and solidification module is used for independently extracting a plurality of original feature tensors with different scales from an input image by utilizing a truncated backbone network, and maintaining original resolution solidification in a video memory for subsequent dynamic sampling; The continuous coordinate generation and encoding module is used for sampling in the continuous two-dimensional space of the image to generate a floating point query coordinate flow direction characteristic field construction module, and carrying out high-dimensional position encoding flow direction joint decoding module through the position encoding module; The asymmetric feature field construction module is used for receiving the query coordinates and projecting the query coordinates to each solidified scale feature tensor to perform bilinear interpolation sampling, calculating local high-frequency feature variance, outputting asymmetric weights through a gating network, and fusing to generate single high-dimensional features; The implicit joint decoding module is used for receiving and splicing the high-dimensional position coding and the high-dimensional fusion characteristics, inputting the high-dimensional position coding and the high-dimensional fusion characteristics into the continuous space implicit coordinate decoder, and mapping and outputting the target category confidence coefficient corresponding to the floating point query coordinate and the continuous boundary frame size predicted value in a parallel manner in a continuous coordinate domain; The model training and reasoning output module is used for iteratively updating the whole network parameters through a back propagation algorithm based on the total composite loss, carrying out parallel reasoning on the unknown single-frame aerial image by utilizing a dense and uniform floating point coordinate grid during reasoning, and outputting a final continuous single target boundary frame through a threshold filtering and non-maximum value suppression algorithm.
9. An electronic device comprising a memory for storing computer program code and a processor for invoking and executing the computer program code to implement an asymmetric implicit field and continuous decoding based aerial target detection method as claimed in any of claims 1 to 7.
10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor of an electronic device, implements an asymmetric implicit field and continuous decoding based aerial target detection method as claimed in any one of claims 1 to 7.

Description

Aerial photographing target detection method based on asymmetric implicit field and continuous decoding Technical Field The invention belongs to the technical field of computer vision and deep learning, in particular relates to unmanned aerial vehicle aerial image processing and target recognition technology, and particularly relates to an aerial target detection method based on asymmetric implicit field and continuous decoding. Background Unmanned aerial vehicle aerial photography target detection is the core link of guaranteeing high-efficient and accurate operation of high-altitude visual systems such as intelligent inspection, high-altitude reconnaissance and disaster rescue. The main task is to accurately locate and identify solid objects of interest (e.g., vehicles, pedestrians, small facilities, etc. in a distance) from complex background images of high resolution, large viewing angles. However, the effectiveness of existing detection methods is highly dependent on their ability to extract and resolve polar size variations and small target features. In a real aerial photographing application environment, only extracting global semantic features of an image is far from enough, and the key is how to improve recall rate of a micro target, reduce calculation power consumption and guarantee real-time reasoning performance of edge-end equipment (unmanned aerial vehicle) through optimal feature reconstruction and decoding decision. In order to solve the problem, the existing target detection technology represented by the mainstream deep convolutional neural network mainly adopts a static detection strategy based on a Feature Pyramid (FPN) and a discrete grid anchor frame. While these approaches achieve automated targeting to some extent in conventional scenarios, there are two key limitations in dealing with modern complex aerial imaging scenarios and extremely limited computing power on-board platforms: First, existing methods mostly stay in global-oriented, symmetrical, and dense cross-scale feature evaluation. In order to compensate for the loss of tiny target features in the deep downsampling process of a backbone network, the existing detector mainly performs top-down global deconvolution and upsampling operation according to a feature pyramid network when making a decision. This approach ignores the extremely sparse and local high frequency characteristics exhibited by small targets in aerial images. The method blindly stretches and aligns the whole feature map forcibly, not only introduces a large amount of interpolation noise (which leads to the further smooth blurring of weak small target features), but also generates extremely huge redundant calculation overhead and memory occupation. A dense fusion model with good performance on a server with sufficient computing power often causes that an airborne computing node of an unmanned aerial vehicle faces exhaustion of computing power and serious reasoning delay due to extremely high computing complexity, and the existing method cannot solve the problem of efficiently and on-demand extraction of the trans-scale features. Secondly, most of the existing methods employ rigid decoding decision logic based on discrete pixel grids. The detection strategy consists of a series of predefined discrete grid divisions or rigid Anchor boxes (anchors), for example, forcing the model to predict within which discrete pixel grid the target center falls. This rigid strategy is difficult to adapt to the extremely small and continuously varying physical characteristics of the target size in the aerial image, resulting in sub-optimality of the positioning decision. More importantly, it is a "quantization compromise" mode due to computational power and architectural limitations, limited by the physical granularity of the deep feature map resolution, when the size of a very small target is even smaller or equal to a single grid size, the target is very likely to "drop into" the grid gaps or boundaries, thereby creating irreversible quantization errors and severe positioning drift. The inherent drawbacks of such an underlying frame cannot fundamentally break through the physical limitations of discrete pixel level accuracy. Therefore, how to design an intelligent target detection method which can surpass the traditional intensive feature pyramid stretching, accurately and asymmetrically fuse the trans-scale high-frequency features, simultaneously actively break the constraint of discrete grids and proactively make sub-pixel level optimal positioning decisions in a continuous coordinate domain is a technical problem to be solved in the current unmanned aerial vehicle aerial vision field. Disclosure of Invention The invention aims to solve the problems that the characteristics of a very small target in a deep network are easy to lose, the traditional characteristic pyramid (FPN) up-sampling calculation cost is huge, and the positioning is inaccurate due to quantization errors of