CN-122024083-A - Cross-modal remote sensing image change detection method

CN122024083ACN 122024083 ACN122024083 ACN 122024083ACN-122024083-A

Abstract

The invention relates to the technical field of remote sensing change detection, in particular to a cross-mode remote sensing image change detection method, which comprises the steps of firstly, acquiring optical and SAR image pairs with different time phases and different imaging mechanisms, and inputs the images into a pre-trained domain adaptation network, maps the heterogeneous images to a shared feature space through resistance learning, and outputs pseudo-homogeneous images to bridge modal differences. And then inputting the processed image pair into a change detection network, wherein the network adopts an encoder-decoder structure, and an encoder of the network adopts a mixed architecture of a CNN module and an enhanced state space model module in a hierarchical alternation mode, so that the multi-scale context sensing is enhanced while long-range dependent modeling is realized. The decoding process introduces a multi-output supervision strategy, imposing assistance loss at different levels to stabilize the training. The overall network adopts a focus loss function, so that the problem of unbalanced category is relieved. The method and the device effectively improve the accuracy and the robustness of cross-modal change detection.

Inventors

HUANG LIANG
YANG TONGGEN
TANG BOHUI
ZHANG MAN
Pu Siming

Assignees

昆明理工大学

Dates

Publication Date: 20260512
Application Date: 20251224

Claims (9)

1. The method for detecting the change of the cross-mode remote sensing image is characterized by comprising the following steps of: acquiring a first mode image and a second mode image of the same geographic area in different time phases, wherein the imaging mechanisms of the first mode and the second mode are different; Processing the first modality image and the second modality image through a domain adaptation network to reduce modality differences therebetween, generating a pseudo-homogeneous image pair; processing the pseudo homogeneous image pair through a change detection network to generate a change detection image; wherein the change detection network comprises an encoder comprising local feature extraction units and global context modeling units alternately arranged; The global context modeling unit is constructed based on a state space model and is integrated with a multi-scale feature selection unit, and the multi-scale feature selection unit is used for dynamically adjusting feature weights according to multi-scale context information of input features in a sequence modeling process of the state space model.
2. The method of claim 1, wherein the domain adaptation network is a generation countermeasure network with cyclic consistency constraints, including a countermeasure loss L a , a reconstruction loss L r , and a cyclic loss L c , defined as follows: ; ; ; The image distribution probability is expressed as x-p data (x) and y-p data （y）,||.|| 1 , L1 specification is expressed, E represents matrix operation, x and y represent different types of input images respectively, G x (x) is an image generated by a generator domain x, D x (x) represents an image distinguished in a discriminator domain x, G (x) represents an output of the image x after the image x passes through the generator x, and F (y) is an output of the image y after the image y passes through the generator y.
3. The method of claim 1, wherein the multi-scale feature selection unit implementing feature weight adjustment comprises: Extracting multi-scale spatial context information of an input feature and generating a spatial attention map based on the multi-scale spatial context information; Extracting global channel statistical information of input features, and generating a channel attention map based on the global channel statistical information; features input to the state space model are modulated based on the spatial attention map and the channel attention map.
4. The method of claim 1, wherein the change detection network further comprises a decoder for fusing the multi-scale features extracted by the encoder, wherein at least one intermediate layer feature of the decoder is supervised during a training phase and the corresponding intermediate and final output supervision losses are used together to optimize network parameters; Four progressive refinement feature nodes generated in a decoder Applying a lightweight mapping operator phi i to each node to obtain an intermediate prediction p i , aligning the intermediate prediction p i with the labeling resolution through an upsampling operator U i , and carrying out channel-level fusion on the multi-scale prediction by using a linear mapping operator phi at the tail end to obtain a final change map p f : ; ; ; Wherein, the To characterize the feature nodes of different levels of the hierarchy, For the intermediate prediction of each feature node aligned with the labeling resolution after up-sampling, the training stage simultaneously monitors the output of each side and the final output, so that the shallow and deep nodes directly receive the semantic gradient, the backward propagation path is shortened, and the convergence stability is improved.
5. The method of claim 4, wherein the intermediate and final output monitor losses are calculated using a focus loss function : ; Where t is the position of the pixel, p t represents the probability of a class change, and α t represents the weighting factor.
6. The method of claim 1, wherein the encoder is a CNN-MSAM encoder, comprising MSAM blocks, and wherein the global interaction of the state space of linear complexity and the selective enhancement of multi-scale space-channel attention are fused into the same unit, the MSAM is formed by a global mixing layer and multi-scale attention aggregation branches, the global mixing layer firstly carries out importance weighted aggregation on an input sequence according to the state space to obtain a hidden state h, and completes gating and channel mixing in the hidden state space, and finally, the hidden state h is projected back to a token space output, wherein the specific form is that: ; Wherein, the Is in a hidden state and is in a closed state, 、、 In order to project the matrix of the light, 、 Input and output data sequences, respectively; the output of the hidden state mixing layer is as follows: ; ; ; Wherein, the For propagating a vector matrix, a is the weight predicted by the input feature, h is the shared hidden state, y is the output token sequence, W is the linear weight, x is the input data, B, C is the projection matrix, ch represents the transformation by matrix C and hidden state h, Representing the process of linear transformation of y by the activation function sigma and the weight matrix W, Representing a linear transformation of the input x, Representing hidden states formed by weighted combination with the input data, For further processing the hidden state h to generate a final output.
7. The method of claim 1, wherein the first modality is an optical imaging modality and the second modality is a synthetic aperture radar imaging modality.
8. The method of claim 1, further comprising the step of preprocessing the image for cropping and enhancing data prior to network processing.
9. The method of claim 4, wherein supervising the decoder intermediate layer features comprises mapping the layer features into an intermediate prediction graph and performing a loss calculation for the intermediate prediction graph and a real label.

Description

Cross-modal remote sensing image change detection method Technical Field The invention relates to the technical field of remote sensing change detection, in particular to a cross-mode remote sensing image change detection method. Background Remote sensing Change Detection (CD) is a technology for analyzing and determining surface changes in different periods, and is used as a core technology for monitoring surface dynamic processes, and plays an irreplaceable role in the fields of natural disaster assessment, urban expansion analysis, disaster emergency response and the like. Optical images are widely used for CD tasks as a primary data source due to their high spatial resolution and rich spectral information. However, optical imaging systems are susceptible to light conditions and bad weather, which may produce invalid observations in the acquired images, resulting in missed or false detections in CD tasks. In contrast, synthetic aperture radar SAR is an active microwave imaging system with all-weather observation capability that ensures reliable ground observation information even in extreme weather conditions. In view of the strong complementarity between the optical image and the SAR image, their combination has great potential in CD, becoming a key task in remote sensing image interpretation. Early optical and SAR CD approaches focused on building a mapping mechanism of "incomparable" data into "comparable" space. The strategy based on classification comparison is proposed first, and the core is to independently classify two-phase images and compare differences in a class space. For example, han et al (2021) train samples through a Hierarchical Extreme Learning Machine (HELM) to extract multi-temporal feature maps and generate classification results, and then compare the feature types to change, lan et al (2019) propose a collaborative multi-temporal segmentation and hierarchical composite classification (CMS-HCC) method to generate region objects through multi-scale segmentation, and build a region-based hierarchical Markov random field (RMH-MRF) model to optimize label configuration. However, this method relies to a large extent on the classification accuracy of the remote sensing image, the reduction of the classification accuracy in the initial stage is directly converted into the reduction of the CD accuracy in the later stage, and the errors are accumulated stepwise in the subsequent comparison chain. To avoid classification error accumulation, a paradigm based on similarity metrics has been developed with the goal of building invariant operators that measure image similarity through multi-modal image-to-image correlation, modeling image structural similarity, mapping heterogeneous images to a common feature space for comparison. The traditional method relies on an artificial design mode invariant operator, such as Lv et al (2022) to provide an object-oriented sequencing histogram similarity measurement (OSSM) to measure the variation amplitude between double-time remote sensing images, te et al (2024) to provide a new global structure map multi-mode CD method, and the global structure map (GSG) is constructed to express the structural information of each multi-time image by extracting the 'comparable' structural features between multi-mode data sets and then cross-map the structural information to another data domain. With the advent of deep learning, chen et al (2022) proposed an unsupervised multi-modal CD framework (SR-GCAE) that represents learning based on a structural relationship graph, builds modal-invariant structural relationships at the object layer and generates a difference graph with similarity metrics, and Liu et al (2025) further learn cross-modal public characterizations from a common feature representation that shares comparable subspaces, thereby discriminating the changes in the feature space with metrics. However, these methods have high false positive rates, and have high design requirements for similarity metric functions, and are not sufficient in generalization under complex scenarios. Further, the domain conversion method based on image translation realizes homogeneous comparison. Wei et al (2023) designed a cross-mapped automatic encoder (TCMA) to convert features into a common representation space, compare two public domain feature maps to obtain a difference map, luppino et al (2020) implemented optical-SAR bi-directional conversion using CycleGAN to constrain content preservation by cyclic consistency loss. In addition, domain conversion methods between diffusion model-based optics and SAR have emerged in recent years (Wang et al, 2024), providing a new tool for constructing high-fidelity homogeneous comparisons. In recent years, deep learning has driven optical and SAR remote sensing CDs into new stages. The convolutional neural network (Convolutional Neural Network, CNN) becomes an initial solution by virtue of the advantage of local feature extraction, for exampl