CN-122023781-A - Unmanned aerial vehicle image reference detection method and system based on mixed granularity expert

CN122023781ACN 122023781 ACN122023781 ACN 122023781ACN-122023781-A

Abstract

The invention provides an unmanned aerial vehicle image indication detection method based on mixed granularity expert, which belongs to the unmanned aerial vehicle image indication detection field, and comprises the steps of splicing visual sequences and language sequences, inputting a cross-modal sequence into a multi-layer stacked coding block, outputting cross-modal scale comprehensive memory after multi-layer iteration, inputting cross-modal scale comprehensive memory and target query into a multi-layer stacked scale comprehensive decoder, outputting coarse query representation, inputting the coarse query representation and the cross-modal scale comprehensive memory into a multi-layer stacked scale sensitive decoder, inputting the output fine granularity query representation into a regression pre-measurement head after multi-layer iteration to obtain indication detection results, and further providing a detection system, wherein the problem of low detection precision of the existing method under an aerial photographing scene is solved.

Inventors

HU GUYUE
SONG HAO
TONG YUXING
ZHAO JIAQI
ZHENG AIHUA
LI CHENGLONG
TANG JIN

Assignees

安徽大学

Dates

Publication Date: 20260512
Application Date: 20260409

Claims (10)

1. The unmanned aerial vehicle image reference detection method based on the mixed granularity expert is characterized by comprising the following steps: splicing a visual sequence of the aerial image and a language sequence of the text description to obtain a cross-modal sequence; inputting a cross-modal sequence into a plurality of layers of stacked coding blocks, wherein each layer of coding blocks comprises a MHSA layer, a MoG layer and an FFN layer which are sequentially connected, the cross-modal sequence enters a MHSA layer, the output enhancement features enter a plurality of parallel attention branches with different expansion coefficients, the output of each attention branch enters the FFN layer after weighted fusion, and the multi-layer iteration is followed by output of cross-modal scale comprehensive memory; Inputting a cross-modal scale comprehensive memory and a target query into a multi-layer stacked scale comprehensive decoder, wherein each layer of scale comprehensive decoder comprises MHSA layers, an MHCA layer, a MoG layer and an FFN layer which are sequentially connected, the target query enters MHSA layers, the output updated query sequence and the cross-modal scale comprehensive memory are spliced and then enter the MHCA layer, the output coarse granularity alignment result enters the MoG layer, the output multi-granularity fused query characteristics enter the FFN layer, and the coarse query representation is output after multi-layer iteration; The method comprises the steps of inputting a coarse query representation and cross-modal scale comprehensive memory into a multi-layer stacked scale sensitive decoder, wherein each layer of scale sensitive decoder comprises MHSA layers, an MHCA layer and an FFN layer which are sequentially connected, the coarse query representation enters MHSA layers, an output coarse query sequence and the cross-modal scale comprehensive memory are spliced and then enter the MHCA layer, an output fine granularity alignment result enters the FFN layer, and after multi-layer iteration, an output fine granularity query representation enters a regression pre-measurement head to obtain an index detection result.
2. The unmanned aerial vehicle image reference detection method based on the mixed granularity expert as claimed in claim 1, wherein the cross-modal sequence enters MHSA layers, and the process of outputting the enhanced features comprises the following steps: cross-modal sequence enters MHSA layers to obtain query vectors Key vector Vector of values Corresponding attention score : Wherein the cross-modal sequence , 、、 Respectively representing the batch size, the sequence length and the embedding dimension, Representation is directed to query vectors Key vector Vector of values Is a function of the linear projection of (c), A dimension for each key vector; attention score Applying a Softmax function to obtain an attention weight, and using the attention weight pair value vector And carrying out weighted summation to obtain the output of the attention heads, splicing the output of all the attention heads in the last dimension, and projecting the spliced result through a learnable linear transformation layer to obtain the enhancement characteristic.
3. The unmanned aerial vehicle image reference detection method based on the mixed granularity expert as claimed in claim 1, wherein the output of each attention branch is calculated by the following way: Wherein, the Is of particle size of Is provided with an output of the attention branch of (c), For the new attention weight to be given, , In order for the attention to be weighted, In the case of a binary mask, Is a value.
4. The unmanned aerial vehicle image index detection method based on mixed granularity expert as set forth in claim 3, wherein the binary mask is The method comprises the following steps: Wherein, the 、 To mark the position index, mod represents the modulo operation, The indication function is represented by a representation of the indication function, Is the expansion coefficient.
5. The unmanned aerial vehicle image indication detection method based on the mixed granularity expert as claimed in claim 1, wherein the MoG layer further comprises a granularity fusion network, and the enhanced features pass through the granularity fusion network to obtain weights : Wherein, the 、 Respectively, are parameters which can be learned, In order to enhance the characteristics of the features, Represents an average pooling of the data in the pool, Representation layer normalization.
6. The unmanned aerial vehicle image reference detection method based on the mixed granularity expert as claimed in claim 5, wherein the output weighted fusion mode of each attention branch is as follows: Wherein, the The outputs of the attention branches are weighted and fused to obtain a scale synthesis sequence representation, For the output of each of the attention branches, Is the total number of parallel attention branches in the MoG layer.
7. The unmanned aerial vehicle image reference detection method based on the mixed granularity expert of claim 1, wherein the process of outputting the cross-modal scale comprehensive memory after multi-layer iteration comprises the steps of: The output of each attention branch is weighted and fused to obtain a scale synthesis sequence representation, the scale synthesis sequence representation enters an FFN layer, projection is carried out through a first full-connection linear layer, and feature dimensions of the scale synthesis sequence representation are expanded to obtain expanded features; Transforming the expanded features through a nonlinear activation function to obtain features of activation processing; The feature of the activation processing enters a second full-connection linear layer, and the feature dimension of the feature of the activation processing is recompressed back to the original embedded dimension; and carrying out residual connection and layer normalization processing on the output of the second fully-connected linear layer and the input of the FFN layer, outputting the characteristic representation of the layer, transmitting the characteristic representation of the layer as input to a coding block of the next layer, and carrying out multi-layer iteration until the last coding block outputs the cross-mode scale comprehensive memory.
8. The unmanned aerial vehicle image reference detection method based on the mixed granularity expert as claimed in claim 1, wherein the process of outputting the coarse granularity alignment result comprises the following steps of: The updating query sequence enters the MHCA layer, the linear projection layer is used for mapping the updating query sequence into a query vector Q, and the cross-modal scale comprehensive memory is mapped into a key vector K and a value vector V; calculating the dot product of the transpose of the query vector Q and the key vector K, and dividing the dot product of the transpose of the query vector Q and the key vector K by a scaling factor to obtain a correlation score matrix; Carrying out normalization processing on the correlation score matrix by applying a Softmax function to obtain a cross attention weight, and carrying out weighted summation on a value vector V by using the cross attention weight; And splicing the outputs of all the attention heads, and outputting coarse granularity alignment results through a linear transformation layer.
9. The unmanned aerial vehicle image reference detection method based on the mixed granularity expert as claimed in claim 1, wherein the process of outputting the fine granularity alignment result comprises the following steps of: The coarse query sequence enters an MHCA layer, the coarse query sequence is mapped into a query vector Q through a linear projection layer, and the cross-modal scale comprehensive memory is mapped into a key vector K and a value vector V; calculating the dot product of the transpose of the query vector Q and the key vector K, and dividing the dot product of the transpose of the query vector Q and the key vector K by a scaling factor to obtain a correlation score matrix; Carrying out normalization processing on the correlation score matrix by applying a Softmax function to obtain a cross attention weight, and carrying out weighted summation on a value vector V by using the cross attention weight; and splicing the outputs of all the attention heads, and outputting a fine granularity alignment result through a linear transformation layer.
10. Unmanned aerial vehicle image reference detecting system based on mixed granularity expert, which is characterized by comprising: the sequence splicing module is used for splicing the visual sequence of the aerial image and the language sequence of the text description to obtain a cross-modal sequence; The system comprises a scale comprehensive coding module, a multi-layer coding module and a multi-layer coding module, wherein the scale comprehensive coding module is used for inputting a cross-modal sequence into a multi-layer stacked coding block, each layer of coding block comprises MHSA layers, a MoG layer and an FFN layer which are sequentially connected, the cross-modal sequence enters MHSA layers, the output enhancement features enter a plurality of parallel attention branches with different expansion coefficients, the outputs of the attention branches enter the FFN layer after being weighted and fused, and the multi-layer iteration is carried out to output cross-modal scale comprehensive memory; The system comprises a scale comprehensive decoding module, a multi-layer stacking module and a multi-layer stacking module, wherein the scale comprehensive decoding module is used for inputting cross-modal scale comprehensive memory and target query into the multi-layer stacking scale comprehensive decoder, each layer of scale comprehensive decoder comprises MHSA layers, an MHCA layer, a MoG layer and an FFN layer which are sequentially connected, the target query enters MHSA layers, an output updating query sequence and the cross-modal scale comprehensive memory are spliced and then enter the MHCA layer, an output coarse granularity alignment result enters the MoG layer, an output multi-granularity fused query feature enters the FFN layer, and a coarse query representation is output after multi-layer iteration; The scale sensitive decoding module is used for inputting a coarse query representation and a cross-modal scale comprehensive memory into a multi-layer stacked scale sensitive decoder, each layer of scale sensitive decoder comprises MHSA layers, an MHCA layer and an FFN layer which are sequentially connected, the coarse query representation enters MHSA layers, an output coarse query sequence and the cross-modal scale comprehensive memory are spliced and then enter the MHCA layer, an output fine granularity alignment result enters the FFN layer, and after multi-layer iteration, the output fine granularity query representation enters a regression prediction head to obtain an index detection result.

Description

Unmanned aerial vehicle image reference detection method and system based on mixed granularity expert Technical Field The invention relates to the technical field of unmanned aerial vehicle image reference detection, in particular to an unmanned aerial vehicle image reference detection method and system based on mixed granularity expert. Background Unmanned aerial vehicle image reference detection is a multi-modal understanding task combining computer vision and natural language processing technology, and aims to accurately position a corresponding target in an image shot by an unmanned aerial vehicle according to language description. In short, the model can "understand" the text description, such as "a man riding a yellow bicycle by wearing a blue coat beside a bus", and correctly find the target in a complex unmanned airport scene. The general flow of the reference detection comprises data acquisition and labeling, language description generation, vision and language feature extraction, cross-modal semantic alignment modeling and target positioning prediction. However, most of the existing researches focus on image data of ground view angles, and feature adaptability of unmanned aerial vehicle view angles to small target dimensions, high similarity, complex scenes and the like is poor, so that accurate pointing is difficult to achieve in a high-density target environment. In an actual aerial intelligent perception and visual positioning scene, due to unique attributes of aerial imaging, such as height change, pitching visual angle, large visual field range, extremely small target size, dense targets, uneven distribution and the like, the same aerial image possibly contains a large number of candidate targets with huge size difference at the same time, and text representations generally have complex semantics of fine granularity, multiple attributes, multiple relations and the like. This poses a serious challenge for vision-language alignment and targeting, resulting in the problem of rapid degradation of performance of existing finger-representative detection models trained on ground close-range images in aerial scenes. Therefore, there is a need to develop new visual language localization methods suitable for aerial viewing angles, with both cross-scale perception and fine-grained semantic parsing capabilities. With the development of multi-mode learning and visual language models, the research of the pointing detection of unmanned aerial vehicle images becomes particularly important. In the prior art, the chinese patent application CN120013992a (natural language description based unmanned aerial vehicle multi-modal feature fusion target tracking method and system) proposes a scene-context feature pyramid network to detect objects, the network is a multi-scale feature extraction network based on BiFPN frame, a bidirectional feature fusion path of BiFPN structure is used to obtain an enhanced image, an encoder Swin transducer is used as a visual encoder to encode the enhanced image to obtain visual features, a language conversion model BERT is used as a language encoder to encode natural language description to obtain language feature vectors, the visual features are processed into visual feature vectors and are subjected to visual-language dual-mode feature local alignment with the corresponding language feature vectors, the obtained aligned new language features are fully fused with the visual features to obtain multi-mode features, a tracking result obtained in the previous frame is used as a history feature and is decoded with the multi-mode feature of the previous frame, the decoded result is subjected to positioning head to obtain a final tracking result, and the frame of the decoder is an improved transducer decoder. Disclosure of Invention The invention aims to solve the technical problems of extremely small target size, severe scale change, dense targets and more interference in an aerial photographing scene, and low detection precision of the existing image reference detection method in the aerial photographing scene. The invention solves the technical problems by adopting the following technical scheme that the unmanned aerial vehicle image indication detection method based on the mixed granularity expert comprises the following steps: splicing a visual sequence of the aerial image and a language sequence of the text description to obtain a cross-modal sequence; inputting a cross-modal sequence into a plurality of layers of stacked coding blocks, wherein each layer of coding blocks comprises a MHSA layer, a MoG layer and an FFN layer which are sequentially connected, the cross-modal sequence enters a MHSA layer, the output enhancement features enter a plurality of parallel attention branches with different expansion coefficients, the output of each attention branch enters the FFN layer after weighted fusion, and the multi-layer iteration is followed by output of cross-modal scale comprehensive