Search

CN-122024034-A - Monocular remote sensing image elevation estimation method based on local-global feature fusion

CN122024034ACN 122024034 ACN122024034 ACN 122024034ACN-122024034-A

Abstract

The invention discloses a monocular remote sensing image elevation estimation method based on local-global feature fusion, which comprises the steps of firstly constructing an elevation estimation model, adopting a dual-branch encoder-decoder architecture, adopting local branches to extract local features of an input image by a dual-branch encoder, adopting global branches to extract global features of the input image, carrying out fusion processing on the local features and the global features by a fusion module, outputting the fusion features, carrying out decoding operation on the fusion features by a decoder to complete elevation estimation, training the elevation estimation model, carrying out elevation estimation on the monocular remote sensing image to be detected by using the trained elevation estimation model, and outputting a corresponding DSM image.

Inventors

  • PAN HAIYAN
  • WANG JING
  • YANG SHUHU
  • HONG ZHONGHUA
  • YU HAO
  • ZHOU RUYAN
  • TAO JIANG
  • JIANG CHENCHEN
  • PENG BO
  • ZHANG YUN
  • HAN YANLING

Assignees

  • 上海海洋大学

Dates

Publication Date
20260512
Application Date
20251210

Claims (6)

  1. 1. A monocular remote sensing image elevation estimation method based on local-global feature fusion is characterized by comprising the following steps: Step one, constructing an elevation estimation model The elevation estimation model adopts a dual-branch encoder-decoder architecture, the dual-branch encoder adopts local branches to extract local features of an input image, adopts global branches to extract global features of the input image, and utilizes a fusion module to fuse the local features and the global features and output fusion features; Training the elevation estimation model, carrying out elevation estimation on the monocular remote sensing image to be detected by using the trained elevation estimation model, and outputting a corresponding DSM image.
  2. 2. The monocular remote sensing image elevation estimation method based on local-global feature fusion according to claim 1, wherein the fusion module is used for firstly splicing local features and global features to generate joint features, then carrying out global average pooling operation on the joint features to obtain channel level descriptors through calculation, inputting the channel level descriptors into a multi-layer perceptron network MLP to generate local weights and global weights, and finally carrying out self-adaptive fusion processing on the local features and the global features by combining the local weights and the global weights to obtain fusion features.
  3. 3. The monocular remote sensing image elevation estimation method based on local-global feature fusion of claim 2, wherein the local features are estimated by using the following formula Global features Performing self-adaptive fusion processing to obtain fusion characteristics , Wherein, the The global weight is represented as a function of the global weight, Representing local weights.
  4. 4. The monocular remote sensing image elevation estimation method based on local-global feature fusion of claim 1, wherein a CNN network is adopted to construct local branches, the CNN network is constructed by stacking hierarchical convolution modules, each convolution module comprises two convolution layers which are sequentially connected, and is matched with batch normalization Batch Normalization and a ReLU activation function, and finally, local features with the same dimension as global features are output through maximum pooling operation dimension reduction processing; The global branch is arranged into a multi-layer cascade structure, each layer comprises a VSS block module, and each layer except the last layer is connected with PATCH MERGING modules after the VSS block module and used for executing downsampling operation.
  5. 5. The monocular remote sensing image elevation estimation method based on local-global feature fusion of claim 4, wherein the decoder and the global branches are set in the same hierarchy, each layer comprises a VSS block module, other layers except the first layer are connected with a Patch expansion module after the VSS block module for executing up-sampling operation, and the output features of each layer are fused with the same-size output features of the corresponding layer of the global branches to serve as the input features of the next layer.
  6. 6. The monocular remote sensing image elevation estimation method based on local-global feature fusion of claim 1, wherein the overall loss function of the elevation estimation model is calculated by using the following formula ; Wherein, the And All of which represent the weight-exceeding parameters, Representing an estimated image output by the elevation estimation model, Representing the actual image of the input, A numerical stability constant is represented as a function of, Representing the number of pixels in the real/estimated image used to calculate the loss of L 1 , Representing the number of pixels in the gradient map in the S-th direction/scale, S representing the total number of directions/scales.

Description

Monocular remote sensing image elevation estimation method based on local-global feature fusion Technical Field The invention belongs to the technical field of image processing, and particularly relates to a monocular remote sensing image elevation estimation method based on local-global feature fusion. Background Elevation (DSM) information is critical in urban modeling, disaster assessment, etc. scenarios. Conventionally, the method is obtained by a multi-view/multi-source method such as LiDAR or stereo photogrammetry, but the cost and the acquisition limit of the method enable the direct regression of the elevation from a single optical image to be an important direction, and the monocular regression can obviously reduce the data town and is suitable for large-scale and high-frequency drawing. On the premise of consuming multi-view geometry, early monocular schemes often use geometry and physical priors (such as occlusion analysis) to estimate high, or use shallow regression/statistical learning. After deep learning rises, CNN-based encoder-decoders (e.g., U-Net) are mainstream, capturing local detail is improved by short circuit summarization, and the converters and their system variants enhance global loop modeling capability by self-attention mechanisms. Aiming at the respective characteristics of the architecture, a hybrid architecture appears, and the existing hybrid architecture mainly adopts two fusion strategies to realize the coordination of CNN and a transducer. The first is a tandem fusion scheme, whose implementation path is to extract the underlying detail features with CNN, aggregate them globally with a transducer (or variant thereof), and finally enter the decoder for reconstruction. The scheme is simple to implement, but has a remarkable information bottleneck problem that the low-level details captured by CNN are easily submerged or smoothed in the subsequent global aggregation, so that the edges and the high-frequency structure are difficult to recover completely in the decoding stage. The second is a static parallel fusion scheme, the implementation path is that the CNN branch and the transducer branch are extracted in parallel, and the CNN branch and the transducer branch are combined at the bottleneck according to fixed rules (splicing or point-by-point addition) and then are decoded and output. Compared with the serial connection scheme, the scheme can simultaneously reserve two paths of information, and the validity of the scheme is verified by representative works such as DepthFormer, PCTDepth and the like. However, the core defect of the fusion mode is that the fixation and the space independence of the fusion mode are that the two areas with distinct characteristics still adopt the same fusion proportion no matter in a large roof/open area or in a sharp boundary/texture dense area, and the weight distribution of local and global information can not be adjusted according to the content characteristics according to local conditions, so that the fusion mode belongs to the typical 'one-cut' problem. Disclosure of Invention Aiming at the technical problems that local detail fidelity and global structure consistency are difficult to be compatible in single-view remote sensing elevation estimation, the quantity and the calculation quantity of the existing diffusion/large model parameters are high, a single encoder has a short plate on global or local modeling, and the like, the invention provides a single-view remote sensing image elevation estimation method based on local-global feature fusion. In order to achieve the above purpose, the present invention provides the following technical solutions: A monocular remote sensing image elevation estimation method based on local-global feature fusion comprises the following steps: Step one, constructing an elevation estimation model The elevation estimation model adopts a dual-branch encoder-decoder architecture, the dual-branch encoder adopts local branches to extract local features of an input image, adopts global branches to extract global features of the input image, and utilizes a fusion module to fuse the local features and the global features and output fusion features; Training the elevation estimation model, carrying out elevation estimation on the monocular remote sensing image to be detected by using the trained elevation estimation model, and outputting a corresponding DSM image. Furthermore, the fusion module is used for firstly splicing the local features and the global features to generate joint features, then carrying out global average pooling operation on the joint features, calculating to obtain channel level descriptors, inputting the channel level descriptors into a multi-layer perceptron network MLP to generate local weights and global weights, and finally carrying out self-adaptive fusion processing on the local features and the global features by combining the local weights and the global weights to obtain fusion fea