CN-121999223-A - Semantic segmentation method and device for remote sensing image

CN121999223ACN 121999223 ACN121999223 ACN 121999223ACN-121999223-A

Abstract

The application discloses a semantic segmentation method and device for a remote sensing image, which relate to the field of deep learning, wherein category names obtained based on pixel-level semantic labels are input into a semantic segmentation model to obtain first optimized text features, a third layer visual feature map in a multi-layer visual feature map of an encoder output by an image encoder and a fourth layer visual feature map of a decoder output by a first multi-scale visual state space block in an image decoder are subjected to feature fusion to obtain a first fused visual feature map, the first fused visual feature map is subjected to cross-modal fusion to obtain a semantic enhanced multi-modal enhanced feature map, the semantic enhanced multi-modal enhanced feature map is input into the image decoder to obtain a semantic segmentation prediction map of a sample remote sensing image, the semantic segmentation model is trained based on calculated total loss to obtain a trained semantic segmentation model, and the obtained remote sensing image to be segmented is input into the trained semantic segmentation model to carry out semantic segmentation, so that segmentation accuracy is improved.

Inventors

Lin Yueni
WANG XILI

Assignees

陕西师范大学

Dates

Publication Date: 20260508
Application Date: 20260128

Claims (10)

1. The semantic segmentation method of the remote sensing image is characterized by comprising the following steps of: Acquiring a sample data set, wherein the sample data set comprises a sample remote sensing image and pixel-level semantic tags corresponding to the sample remote sensing image; acquiring at least one category name based on the pixel-level semantic tag to obtain a category name set; Inputting the category names into a semantic segmentation model to obtain first optimized text features; Inputting the sample remote sensing image into the semantic segmentation model to obtain an encoder multi-layer visual feature map output by an image encoder in the semantic segmentation model and a decoder fourth-layer visual feature map output by a first multi-scale visual state space block in an image decoder of the semantic segmentation model; Performing feature fusion on a third layer visual feature map in the encoder multi-layer visual feature map and a fourth layer visual feature map of the decoder to obtain a first fusion visual feature map; performing cross-modal fusion on the first fusion visual feature map to obtain a semantic enhanced multi-modal enhanced feature map; inputting the semantic enhanced multi-mode enhanced feature map into an image decoder of the semantic segmentation model to obtain a semantic segmentation prediction map of the sample remote sensing image; Calculating total loss based on the semantic segmentation prediction graph, the pixel-level semantic tag, the encoder multi-layer visual feature graph and the first optimized text feature of the sample remote sensing image; Training the semantic segmentation model based on the total loss to obtain a trained semantic segmentation model; Acquiring a remote sensing image to be segmented; Inputting the remote sensing image to be segmented into the trained semantic segmentation model to obtain a semantic segmentation prediction graph of the remote sensing image to be segmented.
2. The method of claim 1, wherein the inputting the category name into the semantic segmentation model to obtain the first optimized text feature comprises: inputting the category names into a text encoder of a semantic segmentation model to obtain initial text features; And optimizing the initial text features through an implicit prompt optimizing module in the semantic segmentation model to obtain first optimized text features.
3. The remote sensing image semantic segmentation method according to claim 2, wherein the optimizing the initial text feature by the implicit prompt optimization module in the semantic segmentation model to obtain a first optimized text feature comprises: Randomly initializing a learnable matrix through an implicit prompt optimization module, projecting the learnable matrix into a target learnable matrix, enabling the dimension of the target learnable matrix to be consistent with the dimension of the initial text feature, fusing the initial text feature and the target learnable matrix to obtain a fused text feature, and carrying out nonlinear transformation and adaptive re-weighting on the fused text feature to obtain a first optimized text feature.
4. The method of claim 1, wherein the cross-modal fusion of the first fused visual feature map to obtain a semantically enhanced multi-modal enhanced feature map comprises: carrying out masked self-attention enhancement on the first fusion visual feature map to obtain an enhanced visual feature map; projecting the first optimized text feature through linear mapping layer projection to obtain a second optimized text feature, wherein the channel dimension of the second optimized text feature is consistent with the channel dimension of the first fusion visual feature map; Taking the enhanced visual feature map as a query matrix, taking the second optimized text feature as a key matrix and a value matrix, and calculating to obtain a multi-mode fusion feature map; Calculating cosine similarity between each pixel and the second optimized text feature on the spatial position of the multi-mode fusion feature map to obtain a pixel-category similarity score map; Weighting and summing the second optimized text features by using the pixel-class similarity score map to obtain a multi-mode enhanced feature map; And summing the multi-modal enhancement feature map and the multi-modal fusion visual feature map through a leachable residual error to obtain a semantic enhancement multi-modal enhancement feature map.
5. The method for semantic segmentation of a remote sensing image according to claim 1, wherein the inputting the semantic enhancement multi-modal enhancement feature map into an image decoder of a semantic segmentation model to obtain a semantic segmentation prediction map of the sample remote sensing image comprises: Inputting the semantic enhanced multi-mode enhanced feature map into a second multi-scale visual state space block in an image decoder of the semantic segmentation model to obtain a decoder multi-layer visual feature map; Performing feature fusion on the last layer of visual feature map in the decoder multilayer visual feature map and the first layer of visual feature map in the encoder multilayer visual feature map to obtain a second fused visual feature map; The second fusion visual feature map is enhanced through a space channel aggregation enhancement module, and a second enhancement fusion visual feature map is obtained; and inputting the second enhanced fusion visual feature map into a segmentation head of the semantic segmentation model to obtain a semantic segmentation prediction map of the sample remote sensing image.
6. The method of claim 1, wherein the image decoder comprises a plurality of multi-scale visual state space blocks connected in series, each of the multi-scale visual state space blocks being configured to perform global spatial modeling, multi-scale context modeling, and upsampling resolution reconstruction on an input feature map, the input feature map being a feature output by the image encoder or the input feature map being a fusion of a feature output by the image encoder and a feature output by a previous multi-scale visual state space block.
7. The method of claim 1, wherein the calculating the total loss based on the semantic segmentation prediction graph, the pixel-level semantic label, the encoder multi-layer visual feature graph, and the first optimized text feature of the sample remote sensing image comprises: calculating semantic segmentation main loss based on the semantic segmentation predictive graph of the sample remote sensing image and pixel-level semantic tags; calculating pixel level contrast loss and pixel text contrast loss based on the encoder multi-layer visual feature map and a first optimized text feature; and calculating to obtain total loss based on the semantic segmentation main loss, the pixel level contrast loss and the pixel text contrast loss.
8. A remote sensing image semantic segmentation device, comprising: The first acquisition module is used for acquiring a sample data set, wherein the sample data set comprises a sample remote sensing image and pixel-level semantic tags corresponding to the sample remote sensing image; the second acquisition module is used for acquiring at least one category name based on the pixel-level semantic tag to obtain a category name set; The first input module is used for inputting the category names into a semantic segmentation model to obtain first optimized text features; The second input module is used for inputting the sample remote sensing image into the semantic segmentation model to obtain an encoder multi-layer visual feature map output by an image encoder in the semantic segmentation model and a decoder fourth-layer visual feature map output by a first multi-scale visual state space block in an image decoder of the semantic segmentation model; The first fusion module is used for carrying out feature fusion on a third layer visual feature map in the encoder multi-layer visual feature map and a fourth layer visual feature map of the decoder to obtain a first fusion visual feature map; the second fusion module is used for performing cross-modal fusion on the first fusion visual feature map to obtain a semantic enhanced multi-modal enhanced feature map; The third input module is used for inputting the semantic enhancement multi-mode enhancement feature map into an image decoder of the semantic segmentation model to obtain a semantic segmentation prediction map of the sample remote sensing image; the calculation module is used for calculating total loss based on the semantic segmentation prediction graph, the pixel-level semantic tag, the encoder multi-layer visual feature graph and the first optimized text feature of the sample remote sensing image; the training module is used for training the semantic segmentation model based on the total loss to obtain a trained semantic segmentation model; The third acquisition module is used for acquiring remote sensing images to be segmented; And the fourth input module is used for inputting the remote sensing image to be segmented into the trained semantic segmentation model to obtain a semantic segmentation prediction graph of the remote sensing image to be segmented.
9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the remote sensing image semantic segmentation method of any one of claims 1-7 when the computer program is executed.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which when executed by a processor implements the remote sensing image semantic segmentation method according to any one of claims 1-7.

Description

Semantic segmentation method and device for remote sensing image Technical Field The application relates to the technical field of deep learning, in particular to a remote sensing image semantic segmentation method and device. Background Semantic segmentation of remote sensing images is an important technology in remote sensing data analysis, and aims at assigning category labels to each pixel so as to generate detailed ground feature distribution diagrams. The technology can provide the spatial distribution and the category information of the ground object targets with the pixel-level precision, and plays a key role in urban planning, land utilization classification, ecological environment protection and other applications. Most existing remote sensing image semantic segmentation data sets only comprise optical images and pixel-level labels, and lack corresponding text description information. The remote sensing image has wide coverage range and complex scene, and a large amount of manual labeling or complex generation flow is required for acquiring accurate text description, so that the image-text multi-mode method is relatively lagged. The existing method tries to introduce text information to perform multi-mode modeling so as to alleviate the problem of text deletion, but generally relies on manual labeling or a large language model to generate high-quality text description, so that the flow is complex, the cost is high, and the method is difficult to apply on a large scale. Some methods also directly use the class name of the dataset as a text prompt, generate text features through a frozen CLIP text encoder, and add a learnable vector before the class name, but training is unstable, text features and visual features are difficult to align effectively, and segmentation accuracy is affected. Disclosure of Invention In view of the above, the application provides a semantic segmentation method and device for remote sensing images, so as to improve segmentation accuracy. The aim of the application can be achieved by the following technical scheme: The first aspect of the application provides a semantic segmentation method for a remote sensing image, which comprises the following steps: Acquiring a sample data set, wherein the sample data set comprises a sample remote sensing image and pixel-level semantic tags corresponding to the sample remote sensing image; Acquiring at least one category name based on the pixel-level semantic tag to obtain a category name set; inputting the category names into a semantic segmentation model to obtain first optimized text features; Inputting the sample remote sensing image into a semantic segmentation model to obtain an encoder multi-layer visual feature map output by an image encoder in the semantic segmentation model and a decoder fourth-layer visual feature map output by a first multi-scale visual state space block in an image decoder of the semantic segmentation model; Performing feature fusion on a third layer of visual feature map in the encoder multi-layer visual feature map and a fourth layer of visual feature map of the decoder to obtain a first fusion visual feature map; Performing cross-modal fusion on the first fusion visual feature map to obtain a semantic enhanced multi-modal enhanced feature map; Inputting the semantic enhancement multi-mode enhancement feature map into an image decoder of a semantic segmentation model to obtain a semantic segmentation prediction map of the sample remote sensing image; Calculating total loss based on the semantic segmentation prediction graph, the pixel-level semantic tag, the encoder multi-layer visual feature graph and the first optimized text feature of the sample remote sensing image; Training the semantic segmentation model based on the total loss to obtain a trained semantic segmentation model; Acquiring a remote sensing image to be segmented; inputting the remote sensing image to be segmented into the trained semantic segmentation model to obtain a semantic segmentation prediction graph of the remote sensing image to be segmented. In an alternative embodiment, inputting the category name into the semantic segmentation model to obtain the first optimized text feature includes: inputting the category names into a text encoder of the semantic segmentation model to obtain initial text features; and optimizing the initial text features through an implicit prompt optimizing module in the semantic segmentation model to obtain first optimized text features. In an alternative embodiment, the optimizing the initial text feature by the implicit prompt optimizing module in the semantic segmentation model to obtain a first optimized text feature includes: The method comprises the steps of randomly initializing a learnable matrix through an implicit prompt optimization module, projecting the learnable matrix into a target learnable matrix, enabling the dimension of the target learnable matrix to be consistent with the dimension of