Search

CN-121811266-B - Remote sensing image change detection method

CN121811266BCN 121811266 BCN121811266 BCN 121811266BCN-121811266-B

Abstract

The application relates to the technical field of remote sensing image processing and computer vision intersection, and discloses a change detection method of remote sensing images, which comprises the following steps of according to a first time phase remote sensing image And a second time phase remote sensing image Obtaining the first time phase bottom layer pixel characteristic And a second temporal underlying pixel feature According to the first time phase bottom layer pixel characteristics And a second temporal underlying pixel feature Enhanced visual characteristics are obtained According to the first pixel-text similarity score map And a second pixel-text similarity score map Obtaining semantic guidance features According to the enhanced visual characteristics And semantic guidance features And carrying out pixel-level two-classification on the fusion characteristics of the multiple layers, and outputting a change detection binary image. The application improves the precision and accuracy of remote sensing image change detection.

Inventors

  • HAO WENRUI
  • ZHANG JIANGMEI
  • LIU HAOLIN
  • YANG GUOWEI
  • CHEN ANYU
  • WANG JIAQI
  • ZHANG CAOLIN

Assignees

  • 西南科技大学

Dates

Publication Date
20260505
Application Date
20260310

Claims (8)

  1. 1. The method for detecting the change of the remote sensing image is characterized by comprising the following steps of: step 1, according to the first time phase remote sensing image And a second time phase remote sensing image Obtaining the first time phase bottom layer pixel characteristic And a second temporal underlying pixel feature Obtaining a first pixel-text similarity score map according to the first text prompt word bank and the second text prompt word bank And a second pixel-text similarity score map Wherein the first text prompting word stock and the second text prompting word stock are text description sets constructed based on preset remote sensing ground object categories and are used for representing semantic category information of the ground object, and the first time phase remote sensing image And a second time phase remote sensing image Belongs to double-time-phase images; step2, according to the first time phase bottom layer pixel characteristics And a second temporal underlying pixel feature Enhanced visual characteristics are obtained ; Step 3, according to the first pixel-text similarity score diagram And a second pixel-text similarity score map Obtaining semantic guidance features And obtain the total loss function for optimizing the semantic guidance feature ; Step 4, according to the enhanced visual characteristics And semantic guidance features Obtaining fusion characteristics of a plurality of layers; Step 5, carrying out pixel-level two-classification on the fusion characteristics of a plurality of levels, and outputting a change detection binary image, wherein the change detection binary image is used for representing a final result image of the ground feature change spatial distribution between a first time-phase remote sensing image and a second time-phase remote sensing image, and each pixel value in the change detection binary image corresponds to a binary state of the ground surface spatial position and comprises a first state and a second state, wherein the first state represents a region where the ground feature changes, and the second state represents a background region where the ground feature does not change; the step 2 comprises the following steps: Step 21 based on first temporal underlying pixel characteristics And a second temporal underlying pixel feature Obtaining the difference characteristic through the difference and addition operation And consistency features ; Step 22, characterizing the differences Inputting difference branches, firstly extracting difference features through a convolution layer Then global average pooling AvgPool and global maximum pooling MaxPool are performed based on the local spatial context information, respectively, to obtain difference features Global statistics of (a) to be consistent with characteristics Inputting the consistency branches, and respectively extracting to obtain consistency features by adopting global average pooling and local feature pooling Is a statistical information of (1); step 23, utilizing the difference features obtained in step 22 Global statistics and consistency features of (a) The statistical information of the statistical information is respectively generated into a difference branch weight and a consistency branch weight through a channel compression-expansion mechanism, namely, the feature dimension of the statistical information is compressed and then expanded, and the Sigmoid activation function is combined Outputting the final difference branch weight and the consistency branch weight; step 24, adopting a cross fusion strategy to make the difference feature And consistency features Respectively combining the weights of the other sides to strengthen the difference characteristics after strengthening And enhanced consistency features Will enhance the difference characteristics And enhanced consistency features Splicing, namely fusing visual characteristics after enhancement through convolution layers ; The step 3 comprises the following steps: Step 31, based on the first pixel-text similarity score map And a second pixel-text similarity score map Obtaining semantic change characteristics Semantic consistency features ; Step 32, feature of semantic changes Semantic consistency features Channel attention ECA is respectively applied, channel statistical information is obtained through global average pooling GAP, attention weight is generated by combining one-dimensional convolution Conv1D with self-adaptive kernel size, and semantic change characteristics after channel enhancement are obtained Semantic consistency features enhanced with channels ; Step 33, based on the complementarity of the change and the invariable information, passing through a learning fusion coefficient Constructing a fusion door, carrying out weighted fusion on the two types of enhanced semantic features, and outputting semantic guidance features ; Step 34, obtaining a total loss function for restraining the boundary precision of the change area and optimizing the semantic guidance characteristics Is generated.
  2. 2. The method according to claim 1, wherein the step1 comprises: Step 11, loading CLIP-ViT pre-training model, fixing text encoder parameters to reserve image-text priori knowledge, and adjusting image encoder to adapt to first time phase remote sensing image And a second time phase remote sensing image The CLIP-ViT pre-training model includes the text encoder and the image encoder; Step 12, the first time phase remote sensing image is processed And a second time phase remote sensing image Inputting the image encoder to extract the first time phase bottom layer pixel characteristics respectively And a second temporal underlying pixel feature Wherein the first temporal underlying pixel feature And a second temporal underlying pixel feature All contain a transducer global token which can characterize global information and ensure the pixel characteristics of the first time phase bottom layer And a second temporal underlying pixel feature Is unified in feature dimension; step 13, inputting the first text prompting word stock and the second text prompting word stock into the text encoder to generate a text feature matrix containing feature category semantic information Respectively calculating the text feature matrix With first temporal underlying pixel features And a second temporal underlying pixel feature The similarity between the two images is obtained to obtain a first pixel-text similarity score graph And a second pixel-text similarity score map To reflect the first temporal remote sensing image Semantic home confidence distribution of each pixel in the first time phase and the second time phase remote sensing image Semantic home confidence distribution for each pixel in the second phase.
  3. 3. The method according to claim 2, wherein in the step 13, the first pixel-text similarity score map is calculated by the following formula, respectively And a second pixel-text similarity score map : Wherein, the For the cosine similarity it is the cosine similarity, Is the first The phase-time bottom layer pixel characteristics, The time phase is a value of 1 or 2.
  4. 4. The method according to claim 1, wherein the step 21 comprises: the difference characteristics are obtained by the following formula And consistency features : Wherein, the To find absolute sign, difference features For capturing pixel level variation information, consistency features Characterizing a first temporal remote sensing image And a second time phase remote sensing image Common background information, realizing the characteristic of explicit separation change and invariance; the step 23 includes: The weight of the difference branch is obtained by the following formula : Wherein, the Is a multi-layer sensing machine, which is a multi-layer sensing machine, For the purpose of global averaging pooling, For the purpose of global maximization of the pool, Is a convolution operation; the weight of the coherent branch is obtained by the following formula : Wherein, the Pooling local features; the step 24 includes: the enhanced difference features are obtained by the following formula And enhanced consistency features : Wherein, the For element-by-element multiplication; enhanced visual characteristics are obtained by the following formula : Wherein, the Is a one-dimensional convolution operation.
  5. 5. The method according to claim 1, wherein the step 31 comprises: Semantic change features are obtained by the following formula Semantic consistency features : Wherein, the For the first pixel-text similarity score graph, A second pixel-text similarity score map; semantic change features True change region for highlighting semantic class transformations, semantic consistency features For preserving first temporal remote sensing images And a second time phase remote sensing image Features of the medium background invariant region; the step 32 includes: obtaining the semantic change characteristics after channel enhancement through the following formula Semantic consistency features enhanced with channels : Wherein, the For the multiplication on an element-by-element basis, For a one-dimensional convolution with a kernel size k, For global average pooling, convolution kernel size According to the number of channels And (3) adaptive determination: In the formula, As a parameter of the channel proportion, As a parameter of the bias it is possible, To take the nearest odd number; The step 33 includes: semantic guidance features are derived by the following formula : Wherein, the For the learnable fusion coefficient, a fusion gate is formed for enhancing the semantic change characteristics of the channel Semantic consistency features enhanced with channels Performing self-adaptive weighted fusion; the step 34 includes: The total loss function is obtained by the following formula: Wherein, the As a function of the total loss, In order to assist in the loss of the material, As a major loss of the material, Is a loss weight coefficient.
  6. 6. The method according to claim 1, wherein the step 4 comprises: step 41: enhanced visual characteristics With semantic guidance features The spatial resolution is adjusted by interpolation or downsampling according to the corresponding matching of a plurality of scales, so that the enhanced visual characteristics under the same scale are ensured With semantic guidance features Is consistent with the dimension of the (c); Step 42, inputting the features matched in the step 41 into a feature pyramid network FPN, and splicing feature graphs of different levels through up-sampling from top to bottom and transverse connection from bottom to top; And 43, performing convolution compression and activation processing on each spliced hierarchical feature map, and outputting fusion features of a plurality of hierarchies, wherein the number of hierarchies of the plurality of hierarchies is the same as that of the plurality of scales.
  7. 7. The method according to claim 1, wherein the step 5 comprises: and 51, inputting the fusion characteristics of a plurality of layers into a decoder to carry out pixel-level two-classification and outputting a change detection binary image, wherein the decoder consists of a plurality of layers of convolution layers and a Sigmoid activation function.
  8. 8. The method according to claim 1, further comprising, after the step 5: and (3) evaluating detection performance by adopting a multi-dimensional evaluation index, wherein the multi-dimensional evaluation index comprises an accuracy rate, a recall rate, an F1 score, and an overall accuracy OA and cross ratio IoU.

Description

Remote sensing image change detection method Technical Field The application relates to the technical field of remote sensing image processing and computer vision intersection, in particular to a change detection method of remote sensing images. Background Remote sensing image change detection is widely applied to the fields of natural resource monitoring, disaster prevention and reduction, city dynamic analysis and the like by comparing observation data of the same area in different time phases to identify a ground surface change mode. With the development of deep learning technology, a twin network architecture based on a convolutional neural network CNN (Convolutional Neural Network) and a transducer (deep learning architecture based on a self-attention mechanism) has become a mainstream method. However, despite the advances in feature extraction in the prior art, the following serious challenges are faced when dealing with complex scenes, which limit further improvement in detection accuracy. Lack of collaborative modeling of difference features and background information existing methods based on difference features Diff-based typically directly differential the dual phase features to highlight the change region. This logic focuses mainly on the areas where changes occur, but ignores the areas where no changes occur. Due to the lack of explicit modeling of the background consistency features and interaction suppression mechanisms between the differences and the background features, the model has difficulty in filtering environmental noise by using background information, so that false alarms are extremely easy to generate in complex scenes. Visual features are mixed with semantic information, fine semantic guidance is lacking, and the visual features extracted by the existing method are generally mixed with lower-layer space texture information and higher-layer semantic information. Without external purely semantic guidance, it is difficult for the network to distinguish between visual differences and semantic changes. In the prior art, pixel-level semantic confidence distribution is difficult to generate as priori knowledge, so that refined change perception cannot be realized at a semantic level, and the problem of coexistence of false detection and omission detection is easy to occur. The multi-mode fusion depth is insufficient, and although a part of recent research attempts to introduce a vision-language pre-training model for auxiliary detection, the fusion mode is mostly remained in shallow interaction such as feature splicing. The prior art lacks a mechanism for aligning and deeply fusing text semantic features and visual features on a feature pyramid multi-scale level, so that complementary effects of multi-mode information on different resolution levels are limited, and accurate capturing of a fine change area is difficult to realize. Disclosure of Invention In view of this, the application provides a method for detecting the change of remote sensing images, which introduces text priori knowledge by using a CLIP-ViT pre-training model, performs explicit separation and cross fusion on difference features and consistency features by using a dual-flow attention module DAM, can realize the effective separation of semantic and spatial information, and performs multi-scale fusion and decoding based on the enhanced features after collaborative modeling, thereby realizing high-precision change detection. The application discloses a change detection method of a remote sensing image, which comprises the following steps: step 1, according to the first time phase remote sensing image And a second time phase remote sensing imageObtaining the first time phase bottom layer pixel characteristicAnd a second temporal underlying pixel featureObtaining a first pixel-text similarity score map according to the first text prompt word bank and the second text prompt word bankAnd a second pixel-text similarity score mapWherein the first text prompting word stock and the second text prompting word stock are text description sets constructed based on preset remote sensing ground object categories and are used for representing semantic category information of the ground object, and the first time phase remote sensing imageAnd a second time phase remote sensing imageBelongs to double-time-phase images; step2, according to the first time phase bottom layer pixel characteristics And a second temporal underlying pixel featureEnhanced visual characteristics are obtained; Step 3, according to the first pixel-text similarity score diagramAnd a second pixel-text similarity score mapObtaining semantic guidance featuresAnd obtain the total loss function for optimizing the semantic guidance feature; Step 4, according to the enhanced visual characteristicsAnd semantic guidance featuresObtaining fusion characteristics of a plurality of layers; And 5, carrying out pixel-level two-classification on the fusion characteristics of a plur