CN-121998832-A - Unmanned aerial vehicle infrared image superdivision method based on layering cross-mode Mamba

CN121998832ACN 121998832 ACN121998832 ACN 121998832ACN-121998832-A

Abstract

The invention provides an unmanned aerial vehicle infrared image super-division method based on a layered cross-mode Mamba, which comprises a space focusing type residual error state space model with space attention, a layered cross-mode bridging encoder, a multi-head collaborative reconstruction mechanism and a plurality of loss function constraint modules. The method comprises the steps of inputting a pair of low-resolution infrared image and high-resolution visible light image, capturing bimodal features through a space focusing residual state space model, solving modal differences through three-stage interaction by means of a layered cross-modal bridging encoder to achieve feature alignment, restoring global structural outline and appearance branches through structural branches to synthesize vivid high-frequency detail textures, and finally generating the high-resolution unmanned aerial vehicle infrared image with complete structure and rich details by combining multiple loss function constraint training.

Inventors

WANG HUAN
TAO FAZHAN
ZHU CHENQI
SUN LIFAN
LIU JIANGHUI
SI PENGJU
JIA MIAO
JI BAOFENG
LI YIWEI
ZHANG DONGKAI
LIU LEIPO
GAO SONG
WANG JUN

Assignees

河南科技大学

Dates

Publication Date: 20260508
Application Date: 20260130

Claims (7)

1. An unmanned aerial vehicle infrared image superdivision method based on layered cross-mode Mamba is characterized in that a spatial focusing residual state space model with spatial attention is adopted to capture bimodal features, a layered cross-mode bridging encoder is adopted to solve mode differences through three-stage interaction to achieve feature alignment, a multi-head collaborative reconstruction mechanism is adopted, structural branches are focused on restoring the global structural outline of an input image, and appearance branches are adopted to synthesize realistic high-frequency detail textures by utilizing textures.
2. The method of degradation-aware cross-modal Mamba for infrared unmanned aerial vehicle image blind super-resolution according to claim 1, comprising the steps of: S1, inputting paired low-resolution infrared images and high-resolution visible light images; S2, capturing bimodal features by adopting a space focusing residual state space model with space attention; S3, adopting a layered cross-mode bridging encoder, and solving mode differences through three-stage interaction to realize characteristic alignment; s4, adopting a structure branch to concentrate on restoring the global structure outline of the input image; S5, synthesizing a realistic high-frequency detail texture by using the texture by adopting the appearance branch.
3. The input pair of low resolution infrared images and high resolution visible light images of claim 2, wherein: the target image VGTSR data has the same 640 x 512 resolution visible and infrared image pair composition and is manually aligned. Meanwhile, the data set also has low-resolution infrared images under different scaling scales acquired by the degradation model.
4. The spatial focusing residual state space model with spatial attention of claim 2, wherein: the residual state space model in S2 is designed as: ; Wherein the method comprises the steps of Representing the first input visible or thermal infrared image The characteristics of the layer are such that, Is a learnable scaling factor to adjust the information flow in the jump connection; Representing a spatial focusing state space model; Representing channel attention; normalizing the layers; the focus state space model in S2 is designed as: , Wherein the method comprises the steps of Representing the characteristics of the incoming visible light, Representing the input visible light characteristics; representing a two-dimensional state space model, The representative layer is normalized and the data of the representative layer, Representing a convolution layer; representing spatial attention. The dual-branch design strategy retains the space dimension information of part of original features while realizing effective fusion of multi-mode features, and effectively relieves the problems of feature redundancy, dimension mismatch and the like common in a transform-based method.
5. A layered cross-modal bridging encoder according to claim 3, performing characteristic interactions between different modalities, characterized in that: The feature interaction module in S3 is designed as follows: , Wherein, the Representing visible light characteristics and infrared characteristics; Representing the initial fusion characteristics after summation; representing a global average pooling operation; Representing global max pooling operations; representing a fully connected layer; Representing a convolution layer; Representative feature stitching operation; Representing an activation function; representing a channel weight vector; representing a spatial weight map; Representing element-level multiplication operations; Representing the final fusion feature; The feature refinement module in S3 is designed to: , Wherein, the Representing the first input visible or thermal infrared image Layer characteristics; The feature enhancement module in S3 is designed to: , Wherein, the Representing a convolution layer; representing maximum pooling; representing average pooling; Representing an activation function; Representing a matrix dot product.
6. The adoption of structural branches according to claim 4, focusing on restoring a global structural profile of an input image, characterized by: The S4 structure reconstruction process formally defines: , Wherein, the Representing the actual structural component; is the structural component of the input image after smoothing; Features representing the encoder output; to ensure that the reconstructed structure is consistent with the target at the pixel level, structure loss is defined For predicting structure And true structure A kind of electronic device The distance is expressed as follows: , to further facilitate the generated structural distribution to approximate the real target distribution, an countermeasure generation framework is introduced into the structural branches, whose countermeasure losses can be expressed as: , Wherein, the A generator representing a structural branch; the generator of the structural branches and the discriminators can be trained jointly by the following optimization process: , Wherein, the And Respectively regularization coefficients.
7. The realistic high frequency detail texture using appearance branches according to claim 5, wherein: In the training phase, gaussian sampling is used instead of bilinear sampling to expand the receptive field, and the gaussian sampling operation is defined as: , Wherein, the Representing a neighborhood feature point selected around a sampling center point; And the Gaussian weights of the corresponding feature points are represented. The calculation of the gaussian weight is based on the spatial distance between the feature point and the sampling center point, and is expressed as follows: , Wherein, the And Representing the distance of the sampling point from the center point in the vertical and horizontal directions respectively, The variance of the gaussian distribution, i.e. the decay rate of the weights; to further force modal semantic alignment, sampling correction loss is proposed . The loss is calculated based on cosine similarity between features extracted by a pretrained VGG network, aims to uniformly evaluate consistency of all sampling area features and real target features in semantic space, and is defined as follows: , in the formula, Representing a cosine phase function; And Representing a sample offset predicted by the apparent stream; is a fixed scaling factor; Is a constant for ensuring stability of the numerical calculation. Appearance branching leverages prior knowledge provided by structural branching The texture output finally generated by the appearance branch is as follows: , to ensure sub-pixel accuracy of super-resolution, appearance reconstruction loss is defined To predict texture Distance from real texture: , To enhance the visual realism of the generated texture, appearance branches also employ a loss-of-fight training strategy. Its countering loss is expressed as: , in the formula, A texture generator representing the appearance branch, Representing the corresponding arbiter.

Description

Unmanned aerial vehicle infrared image superdivision method based on layering cross-mode Mamba Technical Field The invention relates to an unmanned aerial vehicle infrared image restoration technology, which is suitable for reconstructing and restoring low-resolution unmanned aerial vehicle infrared images, in particular to a layering cross-modal Mamba-based unmanned aerial vehicle infrared image super-resolution method, which can effectively improve the resolution and detail quality of the unmanned aerial vehicle infrared images and support downstream visual tasks such as target detection, track tracking and the like. Background The acquisition of the unmanned aerial vehicle infrared image is easily interfered by multiple factors, the inherent limitation of sensor hardware, the problems of atmospheric disturbance, platform movement and the like are interwoven, so that the defects of low spatial resolution, high noise intensity, contrast attenuation and the like of the image are commonly existed, the problems seriously restrict the performance exertion of downstream visual tasks such as target detection, track tracking and the like, and therefore, a high-efficiency image enhancement technology is needed to improve the current situation. Single image super-resolution technology is a key means to solve the problem of low resolution image reconstruction, but it faces special challenges when applied to unmanned aerial vehicle infrared images. Unlike natural images with rich textures and clear edges, infrared images have unique properties of high background uniformity, weak edge contrast and sparse high-frequency details, which makes the traditional model difficult to adapt. CNNs is limited by local receptive fields, and cannot effectively model long-distance dependence required by a globally consistent thermal imaging scene, and a transducer can capture global context, but has low efficiency when processing a typical high-resolution image in unmanned aerial vehicle application due to secondary calculation complexity caused by a self-attention mechanism. Early single-unmanned infrared super-resolution methods lack global context awareness, they cannot accurately reconstruct textures or local structures in the unmanned scene. Instead, these methods may generate artificial high frequency signals to fill in missing details, resulting in unnatural textures or repetitive artifacts, which severely degrade image quality. In recent years, by introducing abundant detail and texture information of a visible light image, the infrared super-resolution guiding method can compensate the missing information in the infrared image and can improve the definition and detail characterization force of the infrared image. However, texture information from the visible image to the infrared image may not be accurate enough due to significant modal differences, and furthermore, due to complex imaging conditions on the drone platform, scene alignment between the infrared image and the visible image is not always accurate, resulting in unnatural or unrealistic details in the recovered infrared image. Disclosure of Invention The invention provides an unmanned aerial vehicle infrared image super-division method based on a layered cross-mode Mamba, which comprises a space focusing type residual error state space model with space attention, a layered cross-mode bridging encoder, a multi-head collaborative reconstruction mechanism and a plurality of loss function constraint modules. The method comprises the steps of inputting a pair of low-resolution infrared image and high-resolution visible light image, capturing bimodal features through a space focusing residual state space model, solving modal differences through three-stage interaction by means of a layered cross-modal bridging encoder to achieve feature alignment, restoring global structural outline and appearance branches through structural branches to synthesize vivid high-frequency detail textures, and finally generating the high-resolution unmanned aerial vehicle infrared image with complete structure and rich details by combining multiple loss function constraint training. The method comprises the steps of capturing bimodal features by using a space focusing residual state space model with spatial attention, achieving feature alignment by using a hierarchical cross-modal bridging encoder and achieving mode difference through three-stage interaction, recovering global layout and real texture of an infrared image of the unmanned aerial vehicle by using a multi-head collaborative reconstruction mechanism through structural appearance branches, and performing constraint training by using various loss functions to generate a high-resolution infrared image of the unmanned aerial vehicle with complete structure and rich details. Optionally, the method comprises the following steps: S1, inputting paired low-resolution infrared images and high-resolution visible light images; S2, capturing b