CN-122024082-A - Remote sensing semantic segmentation method based on vision-language model encoder and double-branch weighted decoder
Abstract
The invention relates to the technical field of artificial intelligence and discloses a remote sensing semantic segmentation method based on a vision-language model encoder and a double-branch decoder. Next, a remote sensing semantic segmentation framework of the encoder-decoder is designed, wherein the framework comprises a double-branch weighted encoder structure, and a multi-scale feature fusion decoder and a feature difference aggregation decoder are arranged. In a third step, the pre-trained visual-language model image encoder is migrated into the segmentation framework as an image feature extractor. And then training the whole framework through the remote sensing semantic segmentation data set disclosed in a large scale, and respectively and independently learning the double-branch decoder by using a cross entropy loss function. To improve the segmentation accuracy, the performance of the dual-branch decoder is optimized by weight constraints. Finally, the segmentation frame weight obtained through training is saved, and the weight is loaded in an reasoning stage, so that a visualized segmentation result is generated. The invention effectively combines the vision-language pre-training model and the double-branch decoder structure to improve the precision of remote sensing image segmentation tasks.
Inventors
- ZHU JIALE
- LIU FAN
- WANG YANFANG
Assignees
- 河海大学
Dates
- Publication Date
- 20260512
- Application Date
- 20241112
Claims (7)
- 1. The remote sensing semantic segmentation method based on the vision-language model encoder and the double-branch weighted decoder is characterized by comprising the following steps of: Step 1, collecting public remote sensing image data, constructing a large-scale remote sensing image-text data set, and pre-training an image encoder and a text encoder of a vision-language model; step 2, constructing a remote sensing semantic segmentation framework of an encoder-decoder, and designing a double-branch weighted encoder structure, wherein the structure comprises a multi-scale feature fusion decoder and a feature difference aggregation decoder; Step 3, migrating the vision-language model image encoder obtained in the step 1 to the remote sensing semantic segmentation frame in the step 2 to be used as an encoder to extract image features; Step 4, training the whole remote sensing semantic segmentation framework through a large-scale public remote sensing semantic segmentation dataset, and respectively independently learning the double-branch decoder through two Cross entropy loss (Cross-Entropy Loss) serving as a loss function; Step 5, setting different weight constraint double-branch decoders, and improving the segmentation precision of the whole segmentation frame; And 6, saving the remote sensing semantic segmentation frame weight trained in the step 5, and loading weight parameters in an reasoning stage to obtain a visual segmentation result.
- 2. The remote sensing semantic segmentation method based on the vision-language model encoder and the dual-branch weighted decoder according to claim 1, wherein the specific process of the step 1 is as follows: 1.1 collecting public remote sensing image data from multiple sources. These images may cover different resolutions, different scenes (e.g., cities, forests, farms, etc.), and remote sensing images acquired at different heights. The model is facilitated to adapt to diversified remote sensing scenes through the diversified data sources, so that the generalization capability of the model in various scenes is improved. For each sample image X, a corresponding image text description Y caption is generated using a multi-modal large language model (e.g., CLIP, BLIP, etc.). These textual descriptions can provide a semantic global overview for each image, describing the image content, such as topographical features, environmental conditions, building type, and the like. Next, a similarity score of the sample image X and its corresponding text description Y caption is calculated using the teletext comparison pre-training model. The similarity score is used to evaluate consistency of the text description with the image content. And screening out high-quality image-text pairs according to the similarity score, so as to ensure that the descriptions can accurately reflect the key information of the image. 1.2 After the screening of high quality image-text pairs is completed, the model is further trimmed based on these data. The model gradually improves the capturing capability of the remote sensing image characteristics through InfoNCE contrast loss fine tuning of the visual encoder and the text encoder. Wherein the comparative loss function is calculated as follows: where N is the batch size, sim represents the similarity function, and τ is a temperature parameter that can be learned.
- 3. The remote sensing semantic segmentation method based on the vision-language model encoder and the dual-branch weighted decoder according to claim 1, wherein the specific process of the step 2 is as follows: 2.1 constructing a remote sensing semantic segmentation framework of an encoder-decoder, mainly comprising an image encoder and a double-branch weighted decoder structure. Wherein the image encoder is identical to the image encoder structure in the visual-language model, and the dual-branch weighted decoder structure comprises a multi-scale feature fusion decoder and a feature difference aggregation decoder. The multi-scale feature fusion decoder integrates features of different hierarchical scales to reduce the loss of semantic information in the multi-scale features. The feature difference aggregation decoder serves as an auxiliary branch for capturing differences between features at different hierarchical levels, thereby enhancing feature complementarity. 2.2 Multi-scale feature fusion decoder Multi-scale fusion is performed by simple convolution and upsampling to obtain Multi-scale features The specific calculation formula is as follows: where up represents the up-sampling operation and conv represents the convolution function. 2.3 Feature Difference aggregation decoder captures the differences between the multiscale feature maps by feature Difference subtraction, the resulting difference features are The specific calculation formula is as follows: where i denotes absolute value operation, up denotes 2-fold up-sampling by bilinear interpolation, conv denotes convolution function, convolution kernel k×k= {1×1,3×3,5×5}.
- 4. The remote sensing semantic segmentation method based on the vision-language model encoder and the dual-branch weighted decoder according to claim 1, wherein the specific process of the step 3 is as follows: Migrating the vision-language model image encoder obtained in the step 1 into the remote sensing semantic segmentation frame in the step 2 to be used as an encoder for extracting image features, implicitly utilizing rich remote sensing priori knowledge in the pre-trained vision-language model, and providing a good starting point for further training on a downstream remote sensing image semantic segmentation data set.
- 5. The remote sensing semantic segmentation method based on the vision-language model encoder and the dual-branch weighted decoder according to claim 1, wherein the specific process of the step 4 is as follows: The dual-branch decoders are independently learned using Cross entropy loss (Cross-Entropy Loss) as a loss function, respectively. The specific calculation formula is as follows: Where L 1 and L 2 represent the penalty functions of the multi-scale fusion decoder and the feature difference aggregation decoder, respectively. y i represents the true label of the ith pixel, And Respectively representing the prediction of the ith pixel by the multi-scale fusion decoder and the feature difference aggregation decoder, and N represents the number of pixels in the training sample image.
- 6. The remote sensing semantic segmentation method based on the vision-language model encoder and the dual-branch weighted decoder according to claim 1, wherein the specific process of the step 5 is as follows: the different weights are set to constrain the dual-branch decoder, the influence of the different branch decoders on the final prediction result is balanced, and the total loss of the whole partition frame is composed of weighted sums of the different branch decoders. The specific calculation formula is as follows: L total =αL 1 +(1-ɑ)L 2 Wherein, L 1 and L 2 are the losses of the dual-branch decoder in the step 4, and alpha and 1-alpha are the weights of each.
- 7. The remote sensing semantic segmentation method based on the vision-language model encoder and the dual-branch weighted decoder according to claim 1, wherein the specific process of the step 6 is as follows: And (3) saving the remote sensing semantic segmentation frame weight obtained by training in the step (5), inputting the test sample image into a segmentation frame in an inference stage to obtain a predicted segmentation result, and finally visualizing the predicted segmentation mask as a final output result of the invention.
Description
Remote sensing semantic segmentation method based on vision-language model encoder and double-branch weighted decoder Technical Field The invention relates to a remote sensing semantic segmentation method, in particular to a remote sensing semantic segmentation method of a vision-language model, and belongs to the technical field of computers. Background With the rapid development of remote sensing sensors and satellites, it becomes more convenient to acquire high-resolution remote sensing images. Semantic segmentation of remote sensing images has been applied to geographic information applications including urban planning, land resource management and disaster assessment. Traditional methods of semantic segmentation of remote sensing images rely mainly on manually designed feature extractors and classifiers, which lead to a lack of generalization capability in different scenarios of remote sensing. Some classical segmentation models based on Convolutional Neural Networks (CNNs) have achieved significant success in the task of remote sensing image segmentation. However, most of these works use only deep features for prediction, ignoring scale changes. In addition, some studies have shown that simple fusion strategies may also lead to redundant signature information. In recent years, visual language models have received a great deal of attention in the field of computer vision. These visual language models can acquire general knowledge by pre-training on a large-scale image-text dataset and reduce reliance on specific data. The direct application of a generic visual language model to remote sensing downstream tasks may lead to performance degradation due to the domain differences in image distribution. Some researchers fine-tune the visual-language model on a small-scale remote sensing image-text dataset, exhibiting excellent generalization performance in a range of downstream tasks. Therefore, the invention designs a remote sensing semantic segmentation framework of an encoder-decoder structure based on a visual-language model. Disclosure of Invention The invention aims to solve the technical problem of providing a remote sensing semantic segmentation method based on a vision-language model encoder and a double-branch weighted decoder, and providing an effective solution for the application of a vision-language model to a remote sensing semantic segmentation task. The invention adopts the following technical scheme for solving the technical problems: the remote sensing semantic segmentation method based on the vision-language model encoder and the double-branch weighted decoder comprises the following steps: Step 1, collecting public remote sensing image data, constructing a large-scale remote sensing image-text data set, and pre-training an image encoder and a text encoder of a vision-language model; step 2, constructing a remote sensing semantic segmentation framework of an encoder-decoder, and designing a double-branch weighted encoder structure, wherein the structure comprises a multi-scale feature fusion decoder and a feature difference aggregation decoder; Step 3, migrating the vision-language model image encoder obtained in the step 1 to the remote sensing semantic segmentation frame in the step 2 to be used as an encoder to extract image features; Step 4, training the whole remote sensing semantic segmentation framework through a large-scale public remote sensing semantic segmentation dataset, and respectively independently learning the double-branch decoder through two Cross entropy loss (Cross-Entropy Loss) serving as a loss function; Step 5, setting different weight constraint double-branch decoders, and improving the segmentation precision of the whole segmentation frame; And 6, saving the remote sensing semantic segmentation frame weight trained in the step 5, and loading weight parameters in an reasoning stage to obtain a visual segmentation result. As a preferable scheme of the invention, the specific process of the step 1 is as follows: 1.1 collecting public remote sensing image data from multiple sources. These images may cover different resolutions, different scenes (e.g., cities, forests, farms, etc.), and remote sensing images acquired at different heights. The model is facilitated to adapt to diversified remote sensing scenes through the diversified data sources, so that the generalization capability of the model in various scenes is improved. For each sample image X, a corresponding image text description Y caption is generated using a multi-modal large language model (e.g., CLIP, BLIP, etc.). These textual descriptions can provide a semantic global overview for each image, describing the image content, such as topographical features, environmental conditions, building type, and the like. Next, a similarity score of the sample image X and its corresponding text description Y caption is calculated using the teletext comparison pre-training model. The similarity score is used to evaluate