Search

CN-121999206-A - Unmanned aerial vehicle small target detection method based on local window cross attention up-sampling and dynamic combined large kernel selection and application thereof

CN121999206ACN 121999206 ACN121999206 ACN 121999206ACN-121999206-A

Abstract

The application relates to the technical field of deep learning, in particular to an unmanned aerial vehicle small target detection method based on local window cross attention up-sampling and dynamic combined large kernel selection and application thereof. And on the other hand, the dynamic combination large kernel selection module is designed for dynamically selecting the space characteristic information favorable for small target detection, and the adaptive expansion of the receptive field and the maintenance of local details are realized in a selective convolution mode, so that the recall rate and the accuracy of detection are both considered in the unmanned aerial vehicle small target detection task. The method aims at solving the problem of how to provide a small target detection method which is adaptive to the computational effort overhead of the performance of the existing unmanned aerial vehicle so as to reduce the quantity of parameters and the calculated quantity in the detection process.

Inventors

  • YUN LIJUN
  • Chao Manxin
  • JIN XUESONG
  • CHEN ZAIQING
  • PENG CAN

Assignees

  • 云南师范大学

Dates

Publication Date
20260508
Application Date
20260317

Claims (9)

  1. 1. The unmanned aerial vehicle small target detection method based on local window cross attention up-sampling and dynamic combined large kernel selection is characterized by being applied to an unmanned aerial vehicle, wherein the unmanned aerial vehicle comprises a deep learning network, a neck network of the deep learning network comprises a local window cross attention up-sampling module and a dynamic combined large kernel selection module, and the unmanned aerial vehicle small target detection method comprises the following steps: s10, acquiring an image dataset which is acquired by an unmanned aerial vehicle and contains a plurality of small targets; S20, inputting the image data set into the local window cross attention up sampling module for up sampling, wherein the local window cross attention up sampling module comprises the steps of determining high-resolution features and low-resolution features in the image data set, carrying out point-by-point convolution and linear deformation on the high-resolution features and the low-resolution features in parallel, generating a query matrix, a key matrix and a value matrix by depth separation three-dimensional convolution on the obtained preprocessed high-resolution features and preprocessed low-resolution features, and generating a query matrix, a key matrix and a value matrix based on the query matrix, the key matrix and the value matrix in the three-dimensional convolution Self-attention weighted aggregation is performed within a local window of size, resulting in a high-low resolution combined feature, wherein, Representing the last dimension to which all pixels within the window have been expanded; S30, inputting the image data set into the dynamic combination large kernel selection module, wherein the dynamic combination large kernel selection module comprises a plurality of depth separable convolutions combined in series, extracting the image data set in multiple scales to obtain multi-scale features, carrying out attention modeling on the obtained multi-scale features in a channel dimension and a space dimension in sequence, multiplying the obtained multi-scale features element by element to obtain a weight map, and multiplying the weight map with each multi-scale feature element by element to obtain a selection feature; and S40, small target detection is carried out based on the high-low resolution combination feature and the selection feature so as to mark the target position and the target category corresponding to each small target in the image data set.
  2. 2. The unmanned aerial vehicle small target detection method based on local window cross-attention upsampling and dynamic combined large kernel selection of claim 1, wherein in S20, the point-wise convolution and linear deformation of the high resolution features and the low resolution features are performed in parallel, comprising: s21, defining high-resolution features Low resolution features ; S22, adjusting the channel number of the high-resolution feature and the low-resolution feature by using a point-by-point convolution operation, and performing linear deformation Obtaining a preprocessed high-resolution feature And preprocessing low resolution features : ; ; ; ; Wherein, the Representing the number of channels after point-by-point convolution; representing the number of channels after point-by-point convolution; the sampling rate is indicated as the number of samples, Representing a local window corresponding to each low resolution position, Indicating that all pixels within the window are expanded to the last dimension, 。
  3. 3. The unmanned aerial vehicle small target detection method based on local window cross-attention upsampling and dynamic combined large kernel selection of claim 2, wherein in S20, the expressions of query matrix Q, key matrix K, and value matrix V are; ; ; in the formula, Representing a depth-separated three-dimensional convolution with a kernel size of 1 x1, To pre-process high resolution features; to pre-process the low resolution features.
  4. 4. The unmanned aerial vehicle small object detection method of claim 3, wherein in S20, the method based on the query matrix, the key matrix, and the value matrix is based on local window cross-attention upsampling and dynamic combined large kernel selection Self-attention weighted aggregation within a local window of size to obtain high-low resolution combined features, comprising: s23, adjusting the dimension sequence of the query matrix Q, the key matrix K and the value matrix V as follows: So that The dimension participates in the attention score calculation, while The dimension is used for broadcast parallel computation; S24, constructing an attention score matrix according to the key matrix K and the query matrix Q : ; Where d represents the dimension of the feature vector, using Scaling the similarity result of the query vector and the key vector as a scaling factor to prevent unstable values caused by overlarge inner product result; S25, attention score matrix The product of the sum matrix V is used as a processing characteristic : ; Wherein, the The dimension is the channel dimension S26, processing the characteristics Respectively through linear deformation Pixel rearrangement Obtaining high resolution features : ; ; Wherein, the ; S27, obtaining the key matrix V through the same operation of the step S26 Position-coding and combining Outputting high-low resolution combined features : ; Wherein, the The input is represented to pass through the multi-layer perceptron first and then is connected with the input residual error; representing position coding.
  5. 5. The unmanned aerial vehicle small target detection method based on local window cross-attention upsampling and dynamic combined large kernel selection of claim 1, wherein in S30, the expression of the multi-scale feature is: ; in the formula, In the case of a multi-scale feature, Features that are an input image dataset; Representing a depth separable convolution with an i-th layer size of 3 x 3; the use of a1 x 1 convolution at the first layer for channel transforms is shown; Representing a serial combination of convolutions.
  6. 6. The unmanned aerial vehicle small target detection method based on local window cross-attention upsampling and dynamic combined large kernel selection of claim 5, wherein in S30, the resulting multi-scale features are respectively attention modeled in the channel dimension and the space dimension and multiplied element by element to obtain a weight map, comprising: S31, for multi-scale features Key matrices are generated separately using 1x 1 convolution Sum matrix : ; ; S32, generating key matrix by using Softmax activation function Spatial dimension weights of individual channels of (a) and a matrix of values Matrix multiplication is carried out to obtain the global characteristic of each channel : ; In the formula, A linear transformation is represented and is used to represent, Representing matrix multiplication In the spatial dimension, global average pooling is used And maximum pooling Generating And aggregated into a single channel spatial weight map SW using a convolutional layer: ; s33, combining the space weight graph SW with the global features Element-by-element multiplication and generation of a weight map DW fusing space and channel double selection through a Sigmoid activation function: 。
  7. 7. The unmanned aerial vehicle small target detection method based on local window cross-attention upsampling and dynamic combined large kernel selection of claim 1, wherein in S40 small target detection based on the high-low resolution combined feature and the selected feature comprises: s41, acquiring a plurality of feature maps C2, C3, C4 and C5 with different scales from a backbone network; S42, performing up-sampling on C5 by using local window cross attention up-sampling, performing feature fusion on the C5 and C4, and performing feature extraction on the C4' by using dynamic combination large kernel selection; S43, performing up-sampling on C4' by using local window cross attention up-sampling, performing feature fusion on the C4' and C3, and performing feature extraction on the C3' by using dynamic combination large kernel selection; s44, performing up-sampling on C3' by using local window cross attention up-sampling, performing feature fusion on the C3' and C2, and performing feature extraction on the C2' by using dynamic combination large kernel selection; s45, performing up-sampling on C5 by using local window cross attention up-sampling, performing feature fusion with C4', and performing feature integration by using 1X 1 convolution to obtain C4' '; s46, performing up-sampling on C4' by using local window cross attention up-sampling, performing feature fusion on the C4' and C3', and performing feature integration by using 1X 1 convolution to obtain C3' '; S47, performing up-sampling on C3' by using local window cross attention up-sampling, performing feature fusion on the C3' and C2', and performing feature integration by using 1X 1 convolution to obtain C2' '; S48, respectively sending the C4', the C3' and the C2 'into a detection unit to execute target detection tasks, and setting a classification branch and a boundary frame regression branch which are mutually independent in the detection unit, wherein the classification branch outputs the class confidence degree of the target based on the enhanced feature graphs C4', C3 'and C2', and simultaneously the boundary frame regression branch predicts the spatial position information of the target to obtain the central position of the boundary frame and the size parameter thereof; S49, mapping the boundary frame parameters obtained by regression back to an original image coordinate system, generating candidate target boundary frames, and carrying out unified scale transformation on the candidate frames from different scale feature images; And S410, screening the candidate boundary frames according to the category confidence, removing redundant overlapped frames through non-maximum suppression, reserving an optimal result as a small target detection result, and completing generation and drawing of target positions and target categories in the image dataset based on the small target detection result.
  8. 8. A computer system comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of the method for unmanned aerial vehicle small object detection based on local window cross-attention upsampling and dynamic combined large kernel selection as claimed in any one of claims 1 to 7.
  9. 9. A computer readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the unmanned aerial vehicle small object detection method based on local window cross-attention upsampling and dynamic combined large kernel selection as claimed in any of claims 1 to 7.

Description

Unmanned aerial vehicle small target detection method based on local window cross attention up-sampling and dynamic combined large kernel selection and application thereof Technical Field The application relates to the technical field of deep learning, in particular to an unmanned aerial vehicle small target detection method based on local window cross attention up-sampling and dynamic combined large kernel selection and application thereof. Background Detection of small targets (also known as lightweight targets) from the perspective of unmanned aerial vehicles is a technical difficulty that is currently in need of overcoming. Conventional object detection is implemented by using a feature pyramid network (Feature Pyramid Networks, FPN) method, but its top-down feature fusion logic makes it difficult for the lower layer positioning information to be utilized by the higher layer. In a related aspect, a path aggregation network (Path Aggregation Network, PANet) proposes adding a bottom-up enhancement path to a FPN base to form a top-down + bottom-up "bi-directional" feature pyramid. However, the inventor finds that in the process of designing and realizing the application, under the small target detection scene of the unmanned aerial vehicle, the traditional FPN+ PANet detection scheme does not greatly improve the detection precision in the actual detection process due to the limited computational power resource of the unmanned aerial vehicle, but the parameter quantity and the computational load are obviously increased. In view of the above, the application provides a small target detection method adapting to the computational effort overhead of the performance of the existing unmanned aerial vehicle, which aims to reduce the parameter quantity and the calculated quantity in the detection process, thereby improving the detection precision of the small target of the unmanned aerial vehicle. Disclosure of Invention The application mainly aims to provide a small target detection method of an unmanned aerial vehicle based on local window cross attention up-sampling and dynamic combination large kernel selection, and aims to solve the problem of how to provide a small target detection method which is adaptive to the computational overhead of the performance of the existing unmanned aerial vehicle so as to reduce the quantity of parameters and the calculated quantity in the detection process. In order to achieve the above purpose, the unmanned aerial vehicle small target detection method based on local window cross attention up-sampling and dynamic combined large kernel selection provided by the application is applied to an unmanned aerial vehicle, the unmanned aerial vehicle comprises a deep learning network, a neck network of the deep learning network comprises a local window cross attention up-sampling module and a dynamic combined large kernel selection module, and the unmanned aerial vehicle small target detection method comprises the following steps: s10, acquiring an image dataset which is acquired by an unmanned aerial vehicle and contains a plurality of small targets; S20, inputting the image data set into the local window cross attention up sampling module for up sampling, wherein the local window cross attention up sampling module comprises the steps of determining high-resolution features and low-resolution features in the image data set, carrying out point-by-point convolution and linear deformation on the high-resolution features and the low-resolution features in parallel, generating a query matrix, a key matrix and a value matrix by depth separation three-dimensional convolution on the obtained preprocessed high-resolution features and preprocessed low-resolution features, and generating a query matrix, a key matrix and a value matrix based on the query matrix, the key matrix and the value matrix in the three-dimensional convolution Self-attention weighted aggregation is performed within a local window of size, resulting in a high-low resolution combined feature, wherein,Representing the last dimension to which all pixels within the window have been expanded; S30, inputting the image data set into the dynamic combination large kernel selection module, wherein the dynamic combination large kernel selection module comprises a plurality of depth separable convolutions combined in series, extracting the image data set in multiple scales to obtain multi-scale features, carrying out attention modeling on the obtained multi-scale features in a channel dimension and a space dimension in sequence, multiplying the obtained multi-scale features element by element to obtain a weight map, and multiplying the weight map with each multi-scale feature element by element to obtain a selection feature; and S40, small target detection is carried out based on the high-low resolution combination feature and the selection feature so as to mark the target position and the target category corresponding to each small targ