CN-121789066-B - Light-weight remote sensing geographic image change detection method and system based on Transformer
Abstract
The invention relates to the technical field of image processing, and discloses a method and a system for detecting the change of a lightweight remote sensing geographic image based on a Transformer, wherein a first image and a second image in an image pair to be detected are respectively input into two parallel and weight-shared coding branches, a plurality of initial feature images with different scales corresponding to each image are obtained, feature enhancement is carried out, and a plurality of enhancement feature images with different scales of each image are obtained; the method comprises the steps of carrying out feature fusion on two enhancement feature images with the same scale in a first image and a second image, obtaining a primary fusion feature image of each scale, obtaining a primary prediction image of the primary fusion feature image with the minimum resolution, carrying out fusion after splicing the first image and the second image, obtaining a bottom feature image, sequentially carrying out fusion on the primary fusion feature image corresponding to each scale and the bottom feature image, weighting by the primary prediction image, obtaining a target fusion image, carrying out classification prediction, and obtaining a change region prediction image.
Inventors
- YANG YICHEN
- NI JINGEN
- ZHU ZICONG
Assignees
- 苏州大学
Dates
- Publication Date
- 20260508
- Application Date
- 20260305
Claims (10)
- 1. The method for detecting the change of the lightweight remote sensing geographic image based on the Transformer is characterized by comprising the following steps of: Respectively inputting a first image and a second image in an image pair to be detected into two parallel and weight-shared transducer coding branches, and acquiring a plurality of initial feature images with different scales corresponding to each image; frequency separation is carried out on the initial feature images of each scale of each image to obtain each high-frequency component, spatial attention operation, jump connection and depth separable convolution are carried out, and a plurality of enhancement feature images of different scales of each image are obtained, wherein the method comprises the steps of carrying out the spatial attention operation on the high-frequency components, jumping connection with the initial feature images, and convolution output high-frequency intermediate feature images; For the first image and the second image, splicing two enhancement feature images with the same corresponding scale according to the channel dimension to obtain a plurality of spliced feature images, and respectively fusing to obtain primary fusion feature images corresponding to all the scales; Inputting the primary fusion feature map with the minimum resolution into a multi-layer perceptron MLP to obtain a primary prediction map; the first image and the second image are spliced and then fused to obtain a bottom layer feature map; The method comprises the steps of sequentially fusing a primary fusion feature map corresponding to each scale with a bottom layer feature map according to resolution from low to high, weighting the primary fusion feature map by utilizing a primary prediction map to obtain a target fusion map, wherein the steps of upsampling a third-scale primary fusion feature map to be the same as the size of a second-scale primary fusion feature map, fusing the third-scale primary fusion feature map and the second-scale primary fusion feature map by utilizing the primary prediction map to obtain a first intermediate feature map, upsampling the first intermediate feature map to be the same as the size of the first-scale primary fusion feature map, fusing the first intermediate feature map and the first intermediate feature map by utilizing the primary prediction map to obtain a second intermediate feature map, upsampling the second intermediate feature map to be the same as the size of the bottom layer feature map, and weighting the second intermediate feature map by utilizing the primary prediction map to obtain the target fusion map; And carrying out classified prediction on the target fusion map to obtain a change region prediction map.
- 2. The method for detecting the change of the lightweight remote sensing geographical image based on the transducer according to claim 1, wherein the image input transducer coding branch comprises the following steps: The method comprises the steps that images pass through a feedforward network with preset stacking layers connected in series in sequence along the forward transmission direction, and a first-scale initial feature map is obtained; enabling the first scale initial feature map to pass through a feedforward network with a preset stacking layer number which is sequentially connected in series along the positive propagation direction to obtain a second scale initial feature map; And enabling the second-scale initial feature map to pass through depth separable convolution attention blocks of a preset stacking layer number and a feedforward network of the preset stacking layer number which are sequentially connected in series along the positive propagation direction, and obtaining a third-scale initial feature map.
- 3. The method for detecting the change of the lightweight remote sensing geographic image based on the Transformer according to claim 2, wherein the initial feature diagrams of a plurality of different scales corresponding to each image are expressed as: ; ; ; Wherein, the , Representing the first image in the pair of images to be detected, Representing a second image of the pair of images to be detected; 、 and (3) with Respectively represent the first image pair to be detected Image processing apparatus A first scale initial feature map, a second scale initial feature map, and a third scale initial feature map; representing the number of preset stacking layers; Representing a feed-forward network and, A feed-forward network representing a preset number of stacked layers sequentially connected in series along a forward propagation direction; A convolution layer with a convolution kernel size of 3 and a step size of 2 is represented; representing a depth separable convolved attention block, Depth separable convolution attention blocks representing a preset number of stacked layers in series in the forward propagation direction.
- 4. The method for detecting the change of the lightweight remote sensing geographical image based on the Transformer according to claim 3, wherein the implementation of the depth separable convolution attention block comprises the following steps: adding the input features after depth separable convolution with the initial input features to obtain intermediate features; After passing through a feedforward network, the intermediate features are added with the initial intermediate features to obtain the output features of the depth separable convolution attention block; wherein the features are input Output through depth separable convolution Expressed as: ; Wherein, the A linear mapping is represented and is used to represent, Representing a bilinear interpolation up-sampling, Representing a multi-headed attention operation; representing a depth separable convolution unit, downsampling input features to obtain a query required by multiple attention MHSA Key and key Sum value 。
- 5. The method for detecting the change of the lightweight remote sensing geographic image based on the Transformer according to claim 1, wherein the step of performing frequency separation on the initial feature map of each scale of each image to obtain each high-frequency component comprises the following steps: and sequentially carrying out average pooling and bilinear interpolation on the initial feature map of any scale of any input image to obtain a low-frequency component, and subtracting the low-frequency component from the initial feature map to obtain a corresponding high-frequency component.
- 6. The method for detecting the change of the lightweight remote sensing geographical image based on the Transformer according to claim 5, wherein after performing the spatial attention operation on the high-frequency component, the method is connected with the initial feature map in a jumping manner and convolved to output the high-frequency intermediate feature map, and comprises the following steps: Performing a spatial attention operation on the high frequency component to obtain a spatial attention profile Expressed as: ; map spatial attention profile And initial feature map Jump connection, convolution output high-frequency intermediate feature diagram Expressed as: ; Wherein, the Representing the high-frequency component of the wave, The activation function is represented as a function of the activation, Representing a convolution with a convolution kernel of size 7, The representation is stitched according to the channel dimensions, Represents an average pooling of the data in the pool, The maximum pooling is indicated and the maximum pool is indicated, Representing a convolution with a convolution kernel size of 3.
- 7. The method for detecting the change of the lightweight remote sensing geographic image based on the Transformer according to claim 6, wherein after the channel attention operation is performed on the high-frequency intermediate feature map, the high-frequency intermediate feature map is connected with the high-frequency intermediate feature map in a jumping manner, and the corresponding enhanced feature map is output through depth separable convolution, wherein the method is expressed as follows: Executing channel attention operation on the high-frequency intermediate feature map to obtain a channel attention feature map Expressed as: ; Mapping channel attention profile With high-frequency intermediate feature map Jump connection, and through depth separable convolution, output enhancement characteristic diagram Expressed as: ; Wherein, the Representing a multi-layer perceptron.
- 8. The method for detecting the change of the lightweight remote sensing geographic image based on the Transformer according to claim 1, wherein for the first image and the second image, two enhancement feature images with the same corresponding scale are spliced according to the channel dimension to obtain a plurality of spliced feature images, and the spliced feature images are respectively fused to obtain preliminary fusion feature images corresponding to the scales, wherein the preliminary fusion feature images are expressed as: ; Wherein, the Represent the first The scale is preliminarily integrated with the feature map, Representing a feed-forward network and, The representations are spliced according to the channel dimension; and (3) with Respectively representing the first image and the second image Scale enhanced feature map.
- 9. The transform-based lightweight remote sensing geographic image change detection method of claim 8, wherein the method is characterized by: fusing the third scale primary fusion feature map Upsampling to a feature map preliminarily fused with a second scale After the sizes are the same, fusing the two, and utilizing a preliminary predictive picture Weighting to obtain a first intermediate feature map Expressed as: ; First intermediate feature map Upsampling to a first scale preliminary fusion feature map After the sizes are the same, fusing the two, and utilizing a preliminary predictive picture Weighting to obtain a second intermediate feature map Expressed as: ; Second intermediate feature map Upsampling to and from an underlying feature map After the sizes are the same, fusing the two, and utilizing a preliminary predictive picture Weighting to obtain a target fusion map Expressed as: ; Wherein, the Representing a feed-forward network and, The representation is spliced along the channel dimension, Representing bilinear interpolation.
- 10. A system based on the Transformer-based lightweight remote sensing geographical image change detection method according to any one of claims 1 to 9, comprising: The feature coding module is used for respectively inputting a first image and a second image in an image pair to be detected into two parallel and weight-shared transform coding branches to obtain a plurality of initial feature diagrams with different scales corresponding to each image; The feature enhancement module is used for carrying out frequency separation on the initial feature images of each scale of each image to obtain each high-frequency component, and carrying out spatial attention operation, jump connection and depth separable convolution to obtain a plurality of enhancement feature images of different scales of each image; The feature fusion module is used for splicing two enhancement feature images with the same corresponding scale for the first image and the second image according to the channel dimension to obtain a plurality of spliced feature images, and respectively fusing the spliced feature images to obtain primary fusion feature images corresponding to all the scales; the primary fusion module is used for inputting the primary fusion feature map with the minimum resolution into the multi-layer perceptron MLP to obtain a primary prediction map; The bottom layer feature acquisition module is used for fusing the first image and the second image after being spliced to acquire a bottom layer feature image; The weighted fusion module is used for fusing the primary fusion feature images corresponding to the scales with the bottom layer feature images in sequence according to the resolution ratio from low to high, and simultaneously weighting the primary prediction images to obtain a target fusion image; and the prediction module is used for carrying out classified prediction on the target fusion map and obtaining a change region prediction map.
Description
Light-weight remote sensing geographic image change detection method and system based on Transformer Technical Field The invention relates to the technical field of image processing, in particular to a method and a system for detecting the change of a lightweight remote sensing geographic image based on a transducer. Background The change detection is an important task in the field of remote sensing image analysis, and is widely applied to the fields of land utilization/coverage change monitoring, disaster assessment, urban expansion analysis and the like. The conventional change detection method is mainly divided into two major classes, namely a pixel level and an object level. The pixel level method directly compares the image pixels, but is sensitive to noise and registration errors, and the object level method analyzes the segmented object, so that the defect of the pixel level method can be reduced, but the performance of the object level method is highly dependent on segmentation quality. With the development of deep learning technology, a Convolutional Neural Network (CNN) -based change detection method is becoming a mainstream, and the method has the advantages of extracting multi-level features through double-phase relative images and showing accuracy and robustness superior to the traditional method. However, existing CNN-based approaches still have significant limitations. Since CNNs typically employ convolution kernels of fixed size, their receptive field is limited and it is difficult to model long-range contextual dependencies. In a change detection task, the double-phase images come from different phases of the same geographic region, so that the space-time dependency relationship between the images is effectively captured, and the method is very important for accurately distinguishing a change region from a constant region. The deficiency of CNN in global context modeling capability restricts the discrimination performance of CNN in complex scenes. In recent years, the Transformer architecture has received a great deal of attention in the field of computer vision due to its strong global context modeling capability. The self-attention mechanism of the core can establish the connection between any two positions in the sequence, thereby realizing the effective capturing of global information. However, the global self-attention computation complexity of the transducer is proportional to the square of the length of the input sequence, and huge computation overhead and memory consumption are brought when processing a high-resolution remote sensing image, so that the deployment feasibility of the model in a practical application scene, particularly on edge equipment with limited computing resources, is severely limited. Therefore, how to develop a lightweight change detection network model with both strong long-distance context modeling capability and high-efficiency computing performance, which is a technical problem to be solved currently, fully utilizes the global modeling advantage of a transducer and effectively controls the computing cost of the transducer in a change detection task. Disclosure of Invention Therefore, the invention aims to solve the technical problems of poor detection precision and low efficiency caused by the fact that the global image feature and the computing resource cost cannot be considered in the prior art. In order to solve the technical problems, the invention provides a change detection method of a lightweight remote sensing geographic image based on a transducer, which comprises the following steps: Respectively inputting a first image and a second image in an image pair to be detected into two parallel and weight-shared transducer coding branches, and acquiring a plurality of initial feature images with different scales corresponding to each image; Frequency separation is carried out on the initial feature images of each scale of each image, each high-frequency component is obtained, and spatial attention operation, jump connection and depth separable convolution are carried out, so that a plurality of enhancement feature images of different scales of each image are obtained; For the first image and the second image, splicing two enhancement feature images with the same corresponding scale according to the channel dimension to obtain a plurality of spliced feature images, and respectively fusing to obtain primary fusion feature images corresponding to all the scales; inputting the initial fusion feature map with the minimum resolution into an MLP (multi-level programming) to obtain an initial prediction map; the first image and the second image are spliced and then fused to obtain a bottom layer feature map; Fusing the primary fusion feature images corresponding to the scales with the bottom layer feature images in sequence according to the resolution ratio from low to high, and weighting by utilizing the primary prediction images to obtain a target fusion image