CN-120070895-B - Visual text guidance-based semantic segmentation method for few-label remote sensing image

CN120070895BCN 120070895 BCN120070895 BCN 120070895BCN-120070895-B

Abstract

The application relates to a visual text-guided low-annotation remote sensing image semantic segmentation method and device, which comprise the steps of processing image data by utilizing a visual feature encoder of a visual text model to obtain visual coding features, respectively processing supporting image coding features and query image coding features by utilizing a visual text priori decoupling model to obtain supporting image visual text priori and query image visual text priori, utilizing a high-confidence visual feature model to obtain mixed multi-level supporting image coding features, utilizing a multi-level priori computing model to obtain visual relation priori values, and utilizing a multi-level priori decoding network to decode the priori to obtain segmentation results of an input image.

Inventors

JIANG ZHIYU
YUAN YE
YUAN YUAN

Assignees

西北工业大学

Dates

Publication Date: 20260508
Application Date: 20250227

Claims (10)

1. A visual text guidance-based semantic segmentation method for a few-label remote sensing image is characterized by comprising the following steps of: Inputting pre-acquired image data and text data in a semantic segmentation model, wherein the image data comprises a support image, a support image tag and a query image, the text data comprises a class name to be segmented and a background class name, and the semantic segmentation model comprises a visual text model and a resnet network; Processing the image data by using a visual feature encoder of the visual text model to obtain visual coding features, wherein the visual coding features comprise supporting image coding features and query image coding features, and the supporting image coding features comprise foreground features; respectively processing the support image coding features and the query image coding features by using a pre-constructed visual text priori decoupling model, and correspondingly obtaining support image visual text priori and query image visual text priori of the support image and the query image about the category to be segmented; Processing visual text priori of the query image by using a pre-constructed high-confidence visual feature mixed model to obtain high-confidence features in visual features of the query image, determining a query prototype based on the high-confidence features and multi-level query image coding features, and carrying out weighted summation on foreground features in the query prototype and the multi-level support image coding features to obtain mixed multi-level support image coding features, wherein the multi-level support image coding features and the multi-level query image coding features are obtained by processing the query image and the support image based on resnet networks; Processing and inquiring the visual characteristics of the image and the coding characteristics of the mixed multi-level support image by using a preset multi-level priori calculation model to obtain a multi-level cosine affinity priori and a multi-level Euclidean distance normalization damage accumulation gain priori; And decoding the visual text prior, the multi-level cosine affinity prior and the multi-level Euclidean distance normalization break cumulative gain prior by using a preset multi-level prior decoding network to obtain the segmentation result of the input image.
2. The visual text-guided low-annotation remote sensing image semantic segmentation method according to claim 1, wherein the processing of background class names in text data by using the visual text model to obtain background class text coding features is characterized in that the processing of support image coding features and query image coding features by using a pre-constructed visual text prior decoupling model respectively corresponds to obtaining support image visual text prior and query image visual text prior of support image and query image with respect to the class to be segmented, and comprises: Calculating the visual text similarity of the support image and the text similarity of the query image according to the cosine similarity between the support image coding feature and the query image coding feature and the background text coding feature respectively; inputting the support image coding features and the query image coding features into a visual text priori decoupling model, and correspondingly obtaining a support image visual text priori and a query image visual text priori; The visual text prior decoupling model comprises a first 2D transposition convolutional layer, a first normalization layer, a first activation layer, a second 2D transposition convolutional layer, a second normalization layer, a second activation layer, a first 2D convolutional layer, a third normalization layer, a third activation layer, a first maximum pooling layer, a second 2D convolutional layer, a fourth normalization layer, a second maximum pooling layer, a third 2D convolutional layer and a Sigmoid activation function layer which are sequentially connected.
3. The visual text guidance-based low-annotation remote sensing image semantic segmentation method according to claim 2, further comprising: training the visual text prior decoupling model by using the binary cross entropy loss between the visual text prior of the support image and the support image label to obtain a final visual text prior decoupling model.
4. The visual text-guided low-annotation remote sensing image semantic segmentation method according to claim 1, wherein the steps of processing the visual text priori of the query image by using a pre-constructed high-confidence visual feature mixture model, determining high-confidence features in the visual features of the query image, determining a query prototype based on the high-confidence features and multi-level query image coding features, and performing weighted summation on foreground features in the query prototype and the multi-level support image coding features to obtain the mixed multi-level support image coding features comprise: inputting the support image and the query image into resnet networks respectively, and correspondingly obtaining multi-level support image coding features and multi-level query image coding features; Aligning the multi-level support image coding features and the multi-level query image coding features with the spatial resolution of the input image to obtain aligned multi-level query image coding features and aligned multi-level support image coding features; the following operations are performed for each level of the aligned multi-level query image encoding features and the aligned multi-level support image encoding features to obtain a hybrid multi-level support image encoding feature: Normalizing and inquiring visual text priori of the image to obtain normalized priori; Threshold calculation is carried out on the normalization prior to obtain confidence coefficient, the confidence coefficient higher than a preset threshold is set to be 1, and the confidence coefficient lower than and equal to the preset threshold is set to be 0; obtaining a query prototype according to the confidence coefficient and the query image coding characteristics of the corresponding level; the blended corresponding hierarchical support image encoding features are determined based on the support image tags, the query prototypes, and the corresponding hierarchical support image encoding features.
5. The visual text-guided low-annotation remote sensing image semantic segmentation method according to claim 4, wherein the determining the blended corresponding-level support image coding features based on the support image label, the query prototype, and the corresponding-level support image coding features comprises: And finding a target class area of the support image coding feature of the corresponding level according to the support image label, and carrying out weighted mixing on the target class area and the query prototype to obtain the mixed support image coding feature of the level.
6. The visual text guidance-based semantic segmentation method of a few-label remote sensing image according to claim 4, wherein before processing the visual characteristics of the query image and the mixed multi-level support image coding characteristics by using a preset multi-level prior calculation model to obtain a multi-level cosine affinity prior and a multi-level euclidean distance normalized break cumulative gain prior, the method comprises: after interpolation flattening is carried out on the support image label, a corrected support image label with the spatial resolution consistent with the visual characteristics is obtained; performing Hardmann product operation on the mixed multi-level support image coding features and the corrected support image labels respectively to obtain mask features of corresponding levels; And (3) respectively calculating the cosine similarity of the mask features and the query image coding features of each level, and then sequentially carrying out mask mean value calculation and dimension transformation on the cosine similarity along the dimension of the corrected support image label to obtain the cosine affinity priori of each level.
7. The visual text guidance-based semantic segmentation method of a few-label remote sensing image according to claim 6, wherein before the processing of the visual characteristics of the query image and the mixed multi-level support image coding characteristics by using a preset multi-level prior calculation model to obtain a multi-level cosine affinity prior and a multi-level euclidean distance normalized break cumulative gain prior, the method further comprises: Respectively calculating Euclidean distances between mixed multi-level support image coding features and query image coding features pixel by pixel to obtain Euclidean distances of all levels; calculating the number of foreground pixels of the corrected support image label; determining a coordinate subset of the mixed multi-level support image coding feature closest to each level of the query image coding feature, wherein the number of coordinate subsets is equal to the number of foreground pixels; Carrying out relevance scoring on the coordinate subsets, and calculating a damage accumulated gain and an ideal damage accumulated gain of which the query image visual characteristics are target classes according to the relevance scoring; and obtaining a multi-level Euclidean distance normalization damage accumulation gain priori according to the damage accumulation gain and the ideal damage accumulation gain of which the visual characteristics of the query image are the target class.
8. The visual text guidance-based low-annotation remote sensing image semantic segmentation method according to claim 3, wherein the preset multi-level prior decoding network comprises a 2D convolution layer, a group standardization layer and a group standardization layer And a function layer, wherein the decoding is performed on the visual text prior, the multi-level cosine affinity prior and the multi-level Euclidean distance normalization break cumulative gain prior by using a preset multi-level prior decoding network to obtain the segmentation result of the input image, and the function layer comprises the following steps: and up-sampling and cross-layer connection are carried out on visual text priori, multi-level cosine affinity priori and multi-level Euclidean distance normalization break cumulative gain priori of different levels by using a multi-level priori decoding network, so that a corresponding two-channel query image prediction probability map is obtained.
9. The visual text-guided low-annotation remote sensing image semantic segmentation method according to claim 8, further comprising: Calculating a query image prediction probability map and a corrected support image label to calculate a scale perception cross entropy segmentation loss; obtaining a total loss function according to the sum of the binary cross entropy loss and the scale perception cross entropy segmentation loss; training resnet the network and the visual text model by using the total loss function to obtain a semantic segmentation model with determined network parameters.
10. The utility model provides a few mark remote sensing image semantic segmentation device based on visual text guide which characterized in that includes: The input module is used for inputting pre-acquired image data and text data in a semantic segmentation model, wherein the image data comprises a support image, a support image tag and a query image, the text data comprises a class name to be segmented and a background class name, and the semantic segmentation model comprises a visual text model and a resnet network; The coding module is used for processing the image data by utilizing a visual characteristic coder of the visual text model to obtain visual coding characteristics, wherein the visual coding characteristics comprise supporting image coding characteristics and query image coding characteristics, and the supporting image coding characteristics comprise foreground characteristics; The first priori computing module is used for respectively processing the support image coding features and the query image coding features by utilizing a pre-constructed visual text priori decoupling model, and correspondingly obtaining a support image visual text priori and a query image visual text priori of the support image and the query image about the category to be segmented; The feature mixing module is used for processing the visual text priori of the query image by utilizing the pre-constructed high-confidence visual feature mixing model to obtain high-confidence features in the visual features of the query image, determining a query prototype based on the high-confidence features and multi-level query image coding features, and carrying out weighted summation on the foreground features in the query prototype and the multi-level support image coding features to obtain mixed multi-level support image coding features, wherein the multi-level support image coding features and the multi-level query image coding features are obtained by processing the query image and the support image based on resnet networks; The second priori computing module is used for processing the visual characteristics of the query image and the mixed multi-level support image coding characteristics by using a preset multi-level priori computing model to obtain a multi-level cosine affinity priori and a multi-level Euclidean distance normalization damage accumulation gain priori; And the image segmentation module is used for decoding the visual text prior, the multi-level cosine affinity prior and the multi-level Euclidean distance normalization break cumulative gain prior by utilizing a preset multi-level prior decoding network to obtain the segmentation result of the input image.

Description

Visual text guidance-based semantic segmentation method for few-label remote sensing image Technical Field The embodiment of the application relates to the field of computer vision, in particular to a semantic segmentation method and device for a few-label remote sensing image based on visual text guidance. Background The small sample semantic segmentation of remote sensing images aims at achieving pixel-level discrimination of one class with a small number of annotation images. However, because the remote sensing targets in the same category have larger difference, a small amount of labeling data cannot cover all semantic information in the category, so that the segmentation result is poor. The semantic segmentation method under a small number of labeling samples aims at extracting visual text prior with generalization by utilizing a visual text model, relieving difference problems in classes, providing high confidence feature mixing to mix high confidence query features with corresponding level support image coding features in order to further improve guiding robustness, normalizing damage accumulation gain by using Euclidean distance, and improving performance of non-parameter visual feature measurement. The technology has important application to the scenes of national defense technology, city planning, natural disaster prevention and control and the like. The existing semantic segmentation technology under a small number of labeled images can be divided into a prototype-based method and an affinity-based method. Although these methods improve the metric performance of small sample semantic segmentation to some extent, the degree of mitigation of this problem of class differences is still limited. Disclosure of Invention In view of this, the embodiment of the application provides a semantic segmentation method and a semantic segmentation device for a few-label remote sensing image based on visual text guidance, which aim to solve the problem that intra-class differences of small sample semantic segmentation greatly affect segmentation performance. In order to achieve the aim, the embodiment of the application provides a visual text-guided low-annotation remote sensing image semantic segmentation method, which comprises the steps of inputting pre-acquired image data and text data in a semantic segmentation model, wherein the image data comprises a support image, a support image label and a query image, the text data comprises a class name to be segmented and a background class name, and the semantic segmentation model comprises a visual text model and a resnet network; processing image data by using a visual feature encoder of a visual text model to obtain visual coding features, wherein the visual coding features comprise supporting image coding features and query image coding features, the supporting image coding features comprise foreground features, respectively processing the supporting image coding features and the query image coding features by using a pre-constructed visual text priori decoupling model to correspondingly obtain supporting image visual text priori and query image visual text priori of the supporting image and the query image relative to the category to be segmented, processing the query image visual text priori by using a pre-constructed high-confidence visual feature mixing model to obtain high-confidence features in the query image visual features, determining a query prototype based on the high-confidence features and the multi-level query image coding features, carrying out weighted summation on the foreground features in the query prototype and the multi-level supporting image coding features to obtain mixed multi-level supporting image coding features, wherein the multi-level supporting image coding features and the multi-level supporting image coding features are obtained by processing the query image visual features and the mixed multi-level supporting image based on resnet network, and decoding the visual text priori, the multi-level cosine affinity priori and the multi-level Euclidean distance normalization damage accumulation gain priori by using a preset multi-level priori decoding network to obtain a segmentation result of the query image. The method is characterized in that a pre-built visual text prior decoupling model is used for respectively processing the support image coding feature and the query image coding feature, the support image and the query image related to the support image visual text prior and the query image visual text prior of the type to be segmented are correspondingly obtained, the method comprises the steps of calculating the support image visual text similarity and the query image text similarity according to cosine similarity between the support image coding feature and the query image coding feature and the background text coding feature, respectively, inputting the support image coding feature and the query image coding feature into a visu