CN-121982296-A - Remote sensing image map segmentation method based on mask self-encoder

CN121982296ACN 121982296 ACN121982296 ACN 121982296ACN-121982296-A

Abstract

The invention discloses a remote sensing image map segmentation method based on a mask self-encoder, and relates to the technical field of computer vision and remote sensing images. The method comprises the steps of an encoder, a decoder and an occupation prediction head, wherein the encoder is responsible for extracting multi-level semantic features of an input remote sensing image, the decoder is used for recovering original image block contents of a shielded area from part of observed features, the occupation prediction head is deployed on a shallow branch of the encoder and is used for predicting whether each image block corresponds to a truly existing ground object entity or not, an occupation confidence map is output, the confidence map is used as a supervision signal to train an occupation sensing module and priori knowledge, a subsequent mask strategy and reconstruction weight distribution are dynamically adjusted, and 'perception-shielding-reconstruction' closed-loop optimization is achieved. The invention solves the problems that the existing remote sensing image semantic segmentation model seriously depends on a large amount of manual annotation data in the training process, the shielding strategy in the self-supervision pre-training stage lacks semantic perceptibility, the ground object existence judgment is ignored in the reconstruction process, and the like.

Inventors

WEI ZHIHAO

Assignees

北京工业大学

Dates

Publication Date: 20260505
Application Date: 20251110

Claims (4)

1. A remote sensing image map segmentation method based on a mask self-encoder is characterized by comprising an encoder based on a transducer structure, a decoder used for reconstructing image content and a lightweight occupation pre-measurement head, wherein the encoder is used for extracting multi-level semantic features of an input remote sensing image, the decoder is used for recovering original image block content of a shielded area from partial observation features, the occupation pre-measurement head is arranged on a shallow branch of the encoder and used for predicting whether each image block corresponds to a ground object entity in real existence or not, an occupation confidence map between 0 and 1 is output, the confidence map is used as a supervision signal training occupation perception module and priori knowledge, a subsequent mask strategy and reconstruction weight distribution are dynamically adjusted, and 'perception-shielding-reconstruction' closed-loop optimization is realized, and the method comprises the following specific steps: s1, dividing a high-resolution remote sensing image into regular image blocks, and forming an input sequence after linear embedding; s2, the input sequence is sent to an encoder for feature extraction, and the occupation confidence of each image block is obtained Generating an occupancy confidence map; S3, executing a semantic perception non-uniform mask strategy based on the obtained occupation confidence map, and simulating a common shielding phenomenon in a real remote sensing scene, so that the model learns a ground object structure which is lack of reasoning through the context information in a pre-training stage; s4, after the non-uniform masking operation is completed, sending the image blocks which are not shielded into the deep layer of the encoder for continuous calculation, and obtaining global context characteristic representation; s5, reconstructing original pixel values or advanced features of all the shielded areas by using an independent decoder, optimizing the occupied prediction head by itself through binary cross entropy loss, wherein the total loss is a weighted sum of reconstruction loss and occupied prediction loss, and realizing multi-task combined training; S6, after self-supervision pre-training is completed, the main trunk of the encoder is moved to a downstream semantic segmentation task, a decoder and an occupied pre-measurement head are removed, a lightweight segmentation head is connected to the output end of the encoder, fine adjustment is performed on a small amount of marked data, pixel-level classification of roads, buildings and vegetation ground objects is achieved, and finally a structured remote sensing map is output.
2. The remote sensing image map segmentation method based on the mask self-encoder according to claim 1 is characterized in that a non-uniform mask strategy is adopted, wherein high masking probability is given to image blocks with high occupancy confidence, masking probability is reduced for low confidence areas, the image blocks with high occupancy confidence are significant feature areas, the significant feature areas comprise buildings or roads, and the low confidence areas comprise empty spaces, water edges or noise areas.
3. The method for segmenting the remote sensing image map based on the mask self-encoder according to claim 2, wherein in the step S5, an occupancy perception weighting mechanism is introduced in the reconstruction loss calculation, wherein a high-occupancy confidence shielding region is given a larger weight in a loss function, a low-occupancy confidence region is reduced in weight, an occupancy weighting L1 loss function is adopted, and the expression is as follows: wherein For the confidence of the occupancy of the ith image block, In order to reconstruct the result of the reconstruction, Is originally embedded.
4. The remote sensing image map segmentation method based on the mask self-encoder as claimed in claim 1, wherein an attention mechanism module SE Block is introduced into a key jump connection path and is divided into two steps of Squeeze and specification, wherein global information is compressed into a channel descriptor in the Squeeze step, and the two-dimensional feature map of each channel is converted into a single value through global average pooling GAP; The specification step is to generate the channel attention weight based on the compressed channel descriptors; the multi-layer perceptron MLP comprises a hidden layer and an output layer, wherein an activation function adopts a ReLU, and relates to two parameter matrixes, the dimension of which depends on the channel number of input features and the expected reduction ratio, the node number of the hidden layer is usually the result of dividing the input channel number by the reduction ratio, so that the computational complexity is reduced and the efficiency is improved; finally, the resulting channel attention weights are used to re-weight each channel in the original feature map so that important channels get more attention, while less relevant channels are suppressed.

Description

Remote sensing image map segmentation method based on mask self-encoder Technical Field The invention belongs to the technical field of computer vision and remote sensing images, and particularly relates to a remote sensing image map segmentation method based on a mask self-encoder. The method is suitable for high-resolution remote sensing image ground object classification and fine segmentation tasks under resource-limited platforms (such as unmanned aerial vehicles, mobile terminals and satellite-borne processing systems). Background Remote sensing image map segmentation is a key technology in a computer vision and Geographic Information System (GIS), and is widely applied to the fields of land utilization classification, urban planning, environment monitoring, disaster assessment, automatic driving high-precision map construction and the like. The method has the core tasks of accurately identifying and dividing semantic ground object categories such as roads, buildings, vegetation, water bodies and the like from the high-resolution remote sensing image and generating structured geospatial information. The traditional method mainly relies on a mode of combining manual feature extraction with a pixel-level classifier, such as a Support Vector Machine (SVM), a random forest and the like, but the method has weak generalization capability under a complex scene, and is difficult to cope with challenges such as illumination change, shadow shielding, diversity of ground objects and the like. With the development of deep learning technology, convolutional Neural Networks (CNNs) are widely applied to remote sensing image segmentation tasks, and segmentation accuracy is remarkably improved. The semantic segmentation model represented by U-Net, deepLab, PSPNet realizes multi-scale feature fusion and context modeling through the encoder-decoder structure, and achieves good performance on various remote sensing data sets. However, these methods generally rely on a large amount of high-quality labeling data for supervision training, but the pixel-level labeling of the remote sensing image has high cost and long period, which severely restricts the expandability and practical application of the model. In recent years, self-supervised learning (Self-Supervised Learning, SSL) has become an important research direction for alleviating annotation dependence. The mask self-encoder (Masked Autoencoder, MAE) serves as an emerging visual self-supervision framework, and can effectively learn rich semantic and geometric characteristics on unlabeled data by randomly shielding local areas of an input image and training a network to reconstruct original contents. The method has shown strong representation learning ability on natural images, but is still in an exploration stage in remote sensing image processing. In the existing method, a mask strategy of a natural image is mostly and directly adopted, and a uniform or random shielding mode is adopted, so that consideration of characteristics of uneven ground feature distribution, large scale difference, frequent shielding and the like in a remote sensing image is lacked, and a shielded area possibly contains key semantic information and cannot be effectively modeled. More importantly, most of the MAE methods currently rely on the decoder to reconstruct pixel level after masking, and lack a semantic perception mechanism for "which areas should be masked preferentially" or "whether the masked areas actually have valid content". In other words, the model cannot determine whether a certain area should have a ground object but is blocked (such as cloud cover and building projection blocking), or is a blank background. The model of blind shading and general reconstruction limits the reasoning capability of the model on complex occlusion scenes and also weakens the migration effect of the learned features in the pre-training stage in the downstream segmentation task. To improve the model's ability to understand spatial integrity, some studies have attempted to introduce geometric or structural priors, but have not yet been effectively combined with explicit occupancy state prediction mechanisms. The term "occupied" refers to whether a local area of an image corresponds to a truly existing ground object entity. If an Occupancy prediction Head (Occupancy Head) can be introduced in the masking process to estimate the semantic existence probability of each image block, intelligent masking can be implemented-e.g., masking known feature regions of high confidence preferentially to enhance model reasoning ability, or preserving regions of low confidence for verifying model population ability. Meanwhile, the occupied prediction is used as reconstruction weight in the decoding stage, so that the model can be guided to be more focused on recovering the truly existing blocked structure, rather than fitting noise or blank areas. Therefore, the prior art still faces the problems that the masking strate