CN-121982593-A - Unmanned aerial vehicle road surface type identification method and device based on double-spectrum cross attention

CN121982593ACN 121982593 ACN121982593 ACN 121982593ACN-121982593-A

Abstract

The invention provides an unmanned aerial vehicle road surface type identification method and device based on double-spectrum cross attention, and the method and device comprise the steps of obtaining an unmanned aerial vehicle image, preprocessing the unmanned aerial vehicle image to obtain a visible light image and an infrared image, inputting the visible light image and the infrared image into a main network, outputting a fusion feature map, wherein the main network adopts an optimized feature extraction process model of a YOLOv target detection frame as a basic unit, and comprises a feature extraction module, a cross attention fusion module and a rapid space pyramid pooling module, inputting the fusion feature map into a neck network, outputting rural road feature maps of multiple scales, inputting the rural road feature maps of multiple scales into a detection head, and outputting rural road detection results. The unique acquisition mode of the unmanned aerial vehicle image is identified through improving YOLOV model, the characteristic that natural images are distinct appears is presented, and rural highway pavement type target identification application is realized.

Inventors

CUI YINGSHOU
ZHANG XIAOZHENG
HUANG YAOHUI
Shang Chunlu
LU ZILIN
MA JIE

Assignees

交通运输部科学研究院

Dates

Publication Date: 20260505
Application Date: 20260402

Claims (10)

1. An unmanned aerial vehicle road surface type identification method based on double-spectrum cross attention, which is characterized by comprising the following steps: acquiring an unmanned aerial vehicle image, and preprocessing the unmanned aerial vehicle image to obtain a visible light image and an infrared image; Inputting the visible light image and the infrared image into a main network and outputting a fusion feature map, wherein the main network adopts an optimized feature extraction process model of a YOLOv target detection frame as a basic unit, and comprises a feature extraction module, a cross attention fusion module and a rapid space pyramid pooling module; Inputting the fusion feature map into a neck network, and outputting rural highway feature maps with multiple scales, wherein the neck network adopts a top-down path and a bottom-up path to respectively transmit information and positioning information, and comprises a small-scale rural highway information enhancement module; inputting the rural highway feature map with multiple scales into a detection head, and outputting a rural highway detection result, wherein the detection head respectively predicts the position and the category of the target in the feature map by adopting a decoupled prediction mode, and generates a three-dimensional tensor comprising boundary frame position information, confidence coefficient and category as the rural highway detection result through a convolution layer.
2. The method of claim 1, wherein the feature extraction module comprises 2 parallel feature extraction networks, and wherein the step of inputting the visible light image and the infrared image into a backbone network and outputting a fused feature map comprises: Respectively extracting multi-scale feature images of the visible light image and the infrared image through 2 parallel feature extraction networks; The cross attention fusion module is used for extracting, interacting and fusing information of the feature images of the visible light image and the feature images of the infrared image to generate a multi-scale rural highway fusion feature image; And carrying out information extraction, splicing and fusion on the fusion feature map through the rapid spatial pyramid pooling module to generate a final fusion feature map.
3. The method of claim 2, wherein the feature extraction network comprises a plurality of object detection mode architecture paradigm processing stages, each time one of the object detection mode architecture paradigm processing stages passes, the spatial resolution of the feature map is halved, and the number of channels is doubled.
4. The method of claim 2, wherein the step of generating a multi-scale rural highway fusion feature map by extracting, interacting and fusing information from the feature map of the visible light image and the feature map of the infrared image by the cross-attention fusion module comprises: The characteristic images of the visible light images and the information carried by the characteristic images of the infrared images are adaptively integrated through a channel dimension splicing operation and a convolution module, so that a primary fusion characteristic image is obtained; respectively downsampling the feature map of the visible light image, the feature map of the infrared image and the primary fusion feature map; Respectively carrying out linear projection on the feature map of the downsampled visible light image, the feature map of the infrared image and the primary fusion feature map through a convolution layer to obtain a visible light query matrix, an infrared access matrix and a fusion key value matrix; Converting the visible light query matrix, the infrared access matrix and the fusion key value matrix into sequence data, and calculating scaling dot product attention based on the sequence data to obtain a visible light attention matrix and an infrared attention matrix; And adjusting the channel number of the primary fusion feature map through a convolution layer, and adding the adjusted primary fusion feature map, the visible light attention matrix and the infrared attention moment matrix to obtain a multi-scale rural highway fusion feature map.
5. The method of claim 2, wherein the step of information extracting, stitching and fusing the fused feature map by the rapid spatial pyramid pooling module to generate a final fused feature map comprises: Extracting rural highway multi-scale information of the fusion feature map layer by layer through a plurality of maximum pooling layers to obtain a rural highway feature map; extracting channel information of the fusion feature map through parallel branches formed by a plurality of convolution modules; And splicing the compressed fusion feature map, rural highway feature maps generated by the plurality of maximum pooling layers and channel information of the fusion feature map, and performing information fusion by using a convolution module to generate a final fusion feature map.
6. The method of claim 1, wherein the step of inputting the fused profile into a neck network and outputting a multi-scale rural highway profile comprises: inputting the fusion feature graphs of a plurality of scales output by the backbone network into the small-scale rural highway information enhancement module, and respectively performing first downsampling, second downsampling and convolution module processing on the fusion feature graphs of a plurality of scales, wherein the application feature scaling factors of the first downsampling and the second downsampling are different; Integrating the processed fusion feature graphs of multiple scales through a channel dimension splicing operation and a convolution module to obtain a simple fusion feature graph; fusing the detail information and the semantic information of the simple fusion feature map to obtain an inquiry matrix, a key matrix, a value matrix and an attention matrix; The method comprises the steps of carrying out up-sampling on the attention moment array, extracting information of a target fusion feature map in the fusion feature map of a plurality of scales through a convolution module, combining the up-sampled attention matrix with the information of the target fusion feature map through element-by-element addition, and obtaining multi-scale information; And performing linear projection and residual connection on the multi-scale information through a convolution layer to obtain a primary rural highway feature map, and determining a final rural highway feature map based on the fusion feature map output by the cross attention fusion module and the primary rural highway feature map.
7. The method of claim 1, wherein the step of inputting the rural highway feature map of a plurality of scales to a detection head and outputting rural highway detection results comprises: A convolution module and convolution are used for obtaining three-dimensional tensors of a plurality of scales; and determining a rural highway detection result based on the three-dimensional tensors of a plurality of scales by a non-maximum suppression post-processing mode.
8. The method according to any one of claims 1 to 7, wherein after the step of inputting the rural highway feature map of a plurality of scales to a detection head and outputting a rural highway detection result, the method further comprises: And identifying the pavement type of the rural highway based on the rural highway detection result.
9. The method of claim 8, wherein the step of identifying the road surface type of the rural road based on the rural road detection result comprises: determining rural highway feature images and position information based on the rural highway detection results; and identifying the road surface type of the rural road based on the rural road characteristic image and the position information and the spectrum information of the rural road.
10. An unmanned aerial vehicle road surface type recognition device based on double spectrum cross attention, characterized in that the device comprises: The unmanned aerial vehicle image preprocessing module is used for acquiring an unmanned aerial vehicle image, and preprocessing the unmanned aerial vehicle image to obtain a visible light image and an infrared image; The main network processing module is used for inputting the visible light image and the infrared image into a main network and outputting a fusion feature map, wherein the main network adopts an optimized feature extraction process model of a YOLOv target detection frame as a basic unit, and comprises a feature extraction module, a cross attention fusion module and a rapid space pyramid pooling module; The neck network processing module is used for inputting the fusion characteristic diagram into a neck network and outputting rural highway characteristic diagrams with multiple scales, wherein the neck network adopts a top-down path and a bottom-up path to respectively transmit information and positioning information, and comprises a small-scale rural highway information enhancement module; The detection head processing module is used for inputting the rural highway feature graphs with a plurality of scales into a detection head and outputting rural highway detection results, wherein the detection head respectively predicts the positions and the categories of targets in the feature graphs by adopting a decoupled prediction mode, and generates a three-dimensional tensor comprising boundary frame position information, confidence coefficient and categories as the rural highway detection results through a convolution layer.

Description

Unmanned aerial vehicle road surface type identification method and device based on double-spectrum cross attention Technical Field The invention relates to the technical field of neural networks, in particular to an unmanned aerial vehicle road surface type identification method and device based on double-spectrum cross attention. Background Under the high-definition remote sensing image, a convolutional neural network FCN (full convolutional network, fully Convolutional Network) mode is used for identifying and extracting targets on the remote sensing image, and then the rural highway pavement type on the current image is judged according to image classification. Although CNN (convolutional neural network ) and FCN architectures have certain advantages in the field of road network extraction, they require large-scale training samples, which are limited by factors such as manpower and capital, and the sample size is often difficult to meet the training requirements, so that all sample characteristics cannot be represented, resulting in network overfitting. Disclosure of Invention In view of the above, the invention aims to provide an unmanned aerial vehicle road surface type recognition method and device based on double-spectrum cross attention, which are used for realizing rural road surface type target recognition application by improving YOLOV model recognition unmanned aerial vehicle images and presenting features with distinct natural images in a unique acquisition mode. The embodiment of the invention provides an unmanned aerial vehicle road surface type identification method based on double-spectrum cross attention, which comprises the steps of obtaining an unmanned aerial vehicle image, preprocessing the unmanned aerial vehicle image to obtain a visible light image and an infrared image, inputting the visible light image and the infrared image into a main network, outputting a fusion characteristic image, wherein the main network adopts an optimized characteristic extraction process model of a YOLOv target detection frame as a basic unit, the main network comprises a characteristic extraction module, a cross attention fusion module and a rapid space pyramid pooling module, inputting the fusion characteristic image into a neck network, outputting rural road characteristic images with multiple scales, wherein the neck network adopts a top-down path and a bottom-up path to respectively transmit information and positioning information, the neck network comprises a small-scale rural road information enhancement module, inputting the rural road characteristic images with multiple scales into a detection head, outputting a rural road detection result, and respectively predicting positions and categories of targets in the characteristic images by a decoupling prediction mode through a convolution layer, and generating three-dimensional tensors comprising boundary frame position information, confidence and categories as the rural road detection result. In an alternative embodiment of the application, the feature extraction module comprises 2 parallel feature extraction networks, the step of inputting the visible light images and the infrared images into a backbone network and outputting the fused feature images comprises the steps of respectively extracting multi-scale feature images of the visible light images and the infrared images through the 2 parallel feature extraction networks, carrying out information extraction, interaction and fusion on the feature images of the visible light images and the feature images of the infrared images through a cross attention fusion module to generate multi-scale rural highway fusion feature images, and carrying out information extraction, splicing and fusion on the fusion feature images through a rapid spatial pyramid pooling module to generate a final fusion feature image. In an alternative embodiment of the present application, the feature extraction network includes a plurality of target detection mode architecture paradigm processing stages, and each time a target detection mode architecture paradigm processing stage is passed, the spatial resolution of the feature map is halved, and the number of channels is doubled. In an alternative embodiment of the application, the steps of performing information extraction, interaction and fusion on the feature map of the visible light image and the feature map of the infrared image through the cross attention fusion module to generate a multi-scale rural highway fusion feature map comprise the steps of adaptively integrating information carried by the feature map of the visible light image and the feature map of the infrared image through a channel dimension splicing operation and a convolution module to obtain a preliminary fusion feature map, respectively performing downsampling on the feature map of the visible light image, the feature map of the infrared image and the preliminary fusion feature map, respectively perform