CN-122023948-A - Multi-mode remote sensing image classification method under any mode deletion condition

CN122023948ACN 122023948 ACN122023948 ACN 122023948ACN-122023948-A

Abstract

The invention discloses a multi-mode remote sensing image classification method under any mode missing condition, which comprises the steps of obtaining multi-mode remote sensing image data, including hyperspectral images, other mode images and ground object type labels, constructing a hyperspectral image encoder aiming at the hyperspectral images, constructing other mode image encoders aiming at the images of the other modes, extracting real specific features of the images of the modes, constructing a mode sharing feature encoder, extracting sharing features of the images of the modes, constructing real specific feature loss and sharing feature alignment loss, generating missing modes and generating specific features, constructing total loss of one-stage pre-training, performing two-stage fine adjustment, constructing total classification features, and obtaining ground object classification of the multi-mode remote sensing images through a classifier.

Inventors

HANG RENLONG
XI YUE
WANG WENZHEN

Assignees

南京信息工程大学

Dates

Publication Date: 20260512
Application Date: 20260413

Claims (8)

1. A multi-mode remote sensing image classification method under any mode missing condition is characterized in that for a target area, the following steps S1-S8 are executed, a remote sensing image feature classification model is built and trained, and feature classification of the multi-mode remote sensing image is completed: S1, acquiring a multi-mode remote sensing image of a target area, wherein the multi-mode remote sensing image comprises a hyperspectral image, a synthetic aperture radar image and a digital surface model image; Step S2, constructing a hyperspectral image encoder for a hyperspectral image, extracting real specific features of the hyperspectral image, respectively constructing other modal image encoders for images of other modalities, extracting real specific features of images of other modalities, constructing a modal sharing feature encoder, and extracting sharing features of the hyperspectral image, sharing features of a synthetic aperture radar image and sharing features of a digital surface model image; Step S3, constructing real specific feature loss based on real specific features of the hyperspectral image, the synthetic aperture radar image and the digital surface model image, and constructing shared feature alignment loss in a shared feature alignment task based on shared features of the hyperspectral image, the synthetic aperture radar image and the digital surface model image and KL divergence loss; Step S4, constructing a missing mode generation module, randomly discarding the missing modes to generate missing modes, and generating generation specific features of the missing modes based on an attention mechanism, constructing total loss of one-stage pre-training based on the generation specific features of the missing modes and corresponding real specific features of the missing modes in the step S2, and performing one-stage pre-training; S5, after the one-stage pre-training is finished, randomly discarding the training samples aiming at each mode to generate training samples for two-stage fine adjustment; step S6, performing two-stage fine tuning, constructing a logic-guided gating fusion module, inputting real specific features of a missing mode in a training sample, generating specific features, combining a CNN gating network and a prompt learning method to obtain overall specific features, and constructing gating weight loss; step S7, obtaining overall sharing characteristics based on the sharing characteristics fusion of each mode, adding the overall sharing characteristics and overall specific characteristics to obtain overall classification characteristics, and obtaining the ground feature classification of the multi-mode remote sensing image through a classifier; And S8, after finishing the two-stage fine tuning, training the ground feature classification model of the remote sensing image, and applying the trained ground feature classification model of the remote sensing image to finish the ground feature classification of the multi-mode remote sensing image of the target area.
2. The method for classifying multi-modal remote sensing images under the condition of any modal absence according to claim 1, wherein the specific steps of step S1 are as follows: s1.1, acquiring a multi-mode remote sensing image of a target area and a corresponding feature classification truth value tag, wherein the feature classification truth value tag is a feature type; And S1.2, dividing the multi-mode remote sensing image and the corresponding ground object classification truth value label into a training set, a verification set and a test set according to a preset proportion.
3. The method for classifying multi-modal remote sensing images under the condition of any modal absence according to claim 1, wherein the specific steps of step S2 are as follows: step S2.1 for hyperspectral image Constructing a hyperspectral image encoder comprising a three-dimensional convolution module and a two-dimensional convolution module, wherein the three-dimensional convolution module comprises three-dimensional convolution layers; The first three-dimensional convolution layer comprises a convolution kernel with the size (7,3,3) and the step length of 1, and the filling of (3, 1) keeps the sizes of an input image and an output image unchanged, and the number of channels is increased from 1 to 8; The second three-dimensional convolution layer adopts a convolution kernel with the size of (7,1,1) and the step length of (3, 1), the space convolution is not performed any more, the spectrum dimension reduction is performed on the image, and the channel number is increased from 8 to 16; the structure of the third three-dimensional convolution layer is completely consistent with that of the second three-dimensional convolution layer, and the spectrum dimension reduction is further carried out and the channel number is increased for the image with the spectrum number larger than a preset threshold value; Each three-dimensional convolution layer is provided with a three-dimensional batch normalization layer and a correction linear unit, and three-dimensional output images are flattened into a two-dimensional image format; The two-dimensional convolution module consists of three residual layers, the channel numbers of the three residual layers are 64, 128 and 256 in sequence, and finally the hyperspectral image encoder outputs the real specific characteristics of the hyperspectral image through the global average pooling layer and the full-connection layer ; Step S2.2, respectively constructing a synthetic aperture radar image encoder and a digital surface model image encoder with the same structure aiming at the synthetic aperture radar image and the digital surface model image, wherein the structures are as follows: Firstly, through an initial two-dimensional convolution layer, the convolution kernel size is 3 multiplied by 3, the step length is 1, the channel number is uniformly mapped to 64, then through a residual layer of ResNet structure of three layers of standards, each layer comprises two convolution layers of 3 multiplied by 3 and residual connection steps, the step length is 2, downsampling operation is carried out, the channel numbers of the three layers of residual layers are 64, 128 and 256 in sequence, finally, through a global average pooling layer and a full connection layer, a synthetic aperture radar image encoder and a digital surface model image encoder respectively output the real specific characteristics of a synthetic aperture radar image True specific features of digital surface model images ; S2.3, constructing a mode sharing feature encoder, receiving images of any mode and mapping the images to the same sharing feature space, wherein the mode sharing feature encoder comprises three parts, namely a mode adaptation layer, unifying the input images into a 64-channel feature map, and a hyperspectral image The method comprises the steps of using a 1X 1 two-dimensional convolution layer to compress spectrum numbers, extracting features of images of other modes by using a 3X 3 convolution layer, carrying out dimension ascending, using a shared trunk layer as a second part, carrying out adaptive conversion on the images of all modes into the shape of (B, 64, H, W), wherein the first dimension B represents batch size, the second dimension represents channel number, the third dimension H represents image height, the fourth dimension W represents image width, extracting deep features by using a set of parameters shared ResNet layers, using an output head as a standard pooling layer and a full-connection layer, and obtaining the shared features of hyperspectral images Shared features of synthetic aperture radar images Shared features of digital surface model images 。
4. The method for classifying multi-modal remote sensing images under the condition of any modal absence according to claim 3, wherein the specific steps in step S3 are as follows: Step S3.1. True specific features based on hyperspectral image True specific features of synthetic aperture radar images True specific features of digital surface model images Calculating the true specific characteristic loss according to the following formula by combining similarity between every two : ; Wherein, the Representing the true specific characteristics of the modality i, Representing the true specific characteristics of modality j, Representing the cosine similarity calculation, Represents the longitudinal number of modes, Representing average value calculation; Step S3.2 shared features for hyperspectral images Shared features of synthetic aperture radar images Shared features of digital surface model images Alignment is performed by KL divergence loss, and the shared characteristic distribution of the mode i is assumed to be compliant with Gaussian distribution Wherein And Mean and standard deviation of shared features for modality i, while the average statistic distribution for all modalities is considered as the central distribution Wherein 、 Average value of mean value and standard deviation of shared characteristics of all modes is represented, and KL divergence loss is specifically expressed as follows: ; Wherein, the Indicating KL divergence loss; In the shared feature alignment task, the average shared feature distribution is taken as the center, the KL divergence loss is utilized to distribute the shared features of all single modes to the center, and the average value of the loss of all modes and the center is recorded as the shared feature alignment loss 。
5. The method for classifying multi-modal remote sensing images under the condition of any modal absence according to claim 4, wherein the specific steps of step S4 are as follows: S4.1, constructing a missing mode generating module, receiving real specific features and shared features of all modes, discarding images of random modes, wherein the discarded modes are used as missing modes, and the reserved modes are used as available modes; S4.2, constructing a reconstruction module, fusing real specific features and shared features of all available modes based on an attention mechanism, and forming a key K and a value V in the attention mechanism through a linear layer; S4.3, generating descriptive text aiming at the inherent characteristics of each mode, inputting the descriptive text into a pre-trained text encoder to obtain language characteristics of the missing modes As query Q in the attention mechanism; step S4.4, executing multi-head attention operation to the obtained query Q, key K and value V to obtain context vector after interaction Generating specific features of missing modality i via multi-layer perceptron generation ; Step S4.5, generating specific characteristics of the missing mode i True specific features of the corresponding modality i obtained in step S2.2 Calculating MSE loss as reconstruction loss to generate specific features ; One-stage pre-training is performed, and the total loss of the one-stage pre-training is constructed as follows: ; Wherein, the Representing the total loss of a one-stage training, Indicating a loss of a true specific feature, Representing the reconstruction loss to generate a particular feature, Representing a loss of alignment of the shared features, Is an adjustable parameter.
6. The method for classifying multi-modal remote sensing images under the condition of any modal absence according to claim 5, wherein the specific steps of step S5 are as follows: s5.1, after the one-stage pre-training is finished, storing and freezing parameters of each mode image encoder and the missing mode generation module; step S5.2, for M modalities, randomly discarding Data of each modality, and the data at this time is recorded as multi-modality data And then for multi-modal data In (2) data re-random discard The data at this time are recorded as few-mode data 。
7. The method for classifying multi-modal remote sensing images under the condition of any modal absence according to claim 6, wherein the specific steps of step S6 are as follows: S6.1, performing two-stage fine tuning, wherein specific features of the missing mode are reconstructed and replaced by a trained missing mode generating module in one-stage pre-training, and shared features of the missing mode are directly abandoned; s6.2, after the missing mode generation module rebuilds the specific characteristics, the specific characteristics of all modes are divided into original real specific characteristics And reconstructed generation of specific features And then constructing a logic-guided gating fusion module, combining a CNN gating network and a prompt learning method, splicing a section of a leachable vector as a prompt no matter whether a real specific feature or a specific feature is generated, inputting all specific features with prompts into the CNN gating network to generate weight coefficients, and performing weighted fusion to obtain overall specific features ; Step S6.3, constructing a gating weight loss calculation formula as follows: ; Wherein, the Representing gating weight loss, i representing less modal data Index of missing modes in j represents few-mode data An index of available modalities in (a); Representing the total number of modes, n being few-mode data The total number of missing modalities; Representing few modality data The weight of the medium modality i to generate a specific feature, Representing multimodal data The weight of the real specific feature of the middle modality i; representing multimodal data The weight of the true specific feature of the middle modality j, Representing few modality data The weight of the true specific feature of the middle modality j.
8. The method for classifying multi-modal remote sensing images under the condition of any modal absence according to claim 7, wherein the specific steps of step S7 are as follows: step S7.1, obtaining overall sharing characteristics after averaging the sharing characteristics of all available modes General sharing features And overall specific features Adding to obtain the total classification feature And sending the result to a classifier to obtain a final logits value, and calculating the cross entropy loss of classification with the feature classification truth value label ; Step S7.2 constructing few modality data And multimodal data Distance loss of (c): ; Wherein, the Is few modal data And multimodal data Distance loss of (2); Representing data in multiple modes The cross-class entropy loss in the lower part, Represented in few modality data The lower class cross entropy loss; overall loss of build two-stage trim phase The calculation is as follows: ; ; Wherein, the In order to classify the cross entropy loss, Representing the loss of the gating weight, Representing a loss of distance of the few-modality data from the multi-modality data, And (3) with Is an adjustable parameter.

Description

Multi-mode remote sensing image classification method under any mode deletion condition Technical Field The invention relates to the field of satellite remote sensing images, in particular to a multi-mode remote sensing image classification method under any mode deletion condition. Background In the process of acquiring remote sensing data, the problems of weather and instability of sensors are often faced. When natural disasters such as earthquake, flood and typhoon occur, severe overcast and rainy weather is often accompanied, and the disaster situation on the ground is difficult to see by the traditional optical satellites (such as high-score series and Landsat). Accurately identifying the ground object type is important in responding to emergency disasters, agricultural detection, military investigation and the like. Most of the existing multi-mode remote sensing data fusion models can acquire complete original images without deviation by default, and the models are difficult to cope with the situation of mode missing caused by extreme weather (cloud and fog shielding) or specific time period (night imaging), so that the performance of the models is greatly influenced. With the development of deep learning, an AI model for classifying the ground features of the remote sensing image is endless, but is suitable for the scarcity of the problem of modal deletion. The existing method for solving the problem of modal missing is many, such as an early search model, wherein similar samples are searched from a database prepared in advance to replace the missing samples, and the generation model such as GANs, a diffusion model and the like also provides a means for recovering the original image, but most methods have some defects. The search method has the advantages that each sample processed by the search method is searched and compared in a huge database, the calculated amount is large, the reasoning speed is low, the database is required to be stored in an additional memory, the image restoration method is required to design a complex model, the training amount is extremely large, a common diffusion model is required to train four and five hundred rounds under thousands of diffusion time steps and is not superior in time and parameter magnitude, in addition, a plurality of models are designed based on bimodal deletion, the model structure is greatly changed from three modes and more, and the model structure is regulated according to each deletion combination, so that the method is particularly obvious in a network based on distillation and illusion. Therefore, how to design a method not only can solve the problem of modal missing, but also can consider the training cost and flexibility of the model so as to adapt to any modal combination, and the method becomes a problem to be solved in the current remote sensing image fusion field. Disclosure of Invention The invention aims to provide a multi-mode remote sensing image classification method under any mode missing condition, which can solve the problems that the existing method highly depends on data of a complete mode, is difficult to cope with mode missing caused by special weather (cloud and fog shielding) or a specific period (night imaging) and the like. In order to achieve the functions, the invention designs a multi-mode remote sensing image classification method under the condition of any modal loss, and aiming at a target area, the following steps S1-S8 are executed, a remote sensing image feature classification model is constructed and trained, and feature classification of the multi-mode remote sensing image is completed: S1, acquiring a multi-mode remote sensing image of a target area, wherein the multi-mode remote sensing image comprises a hyperspectral image, a synthetic aperture radar image and a digital surface model image; Step S2, constructing a hyperspectral image encoder for a hyperspectral image, extracting real specific features of the hyperspectral image, respectively constructing other modal image encoders for images of other modalities, extracting real specific features of images of other modalities, constructing a modal sharing feature encoder, and extracting sharing features of the hyperspectral image, sharing features of a synthetic aperture radar image and sharing features of a digital surface model image; Step S3, constructing real specific feature loss based on real specific features of the hyperspectral image, the synthetic aperture radar image and the digital surface model image, and constructing shared feature alignment loss in a shared feature alignment task based on shared features of the hyperspectral image, the synthetic aperture radar image and the digital surface model image and KL divergence loss; Step S4, constructing a missing mode generation module, randomly discarding the missing modes to generate missing modes, and generating generation specific features of the missing modes based on an attention mechanism, constru