CN-121982500-A - Remote sensing image scene classification method based on Transformer

CN121982500ACN 121982500 ACN121982500 ACN 121982500ACN-121982500-A

Abstract

The invention belongs to the technical field of image recognition, and particularly relates to a remote sensing image scene classification method based on a transducer. Aiming at the problems of noise, cloud shadow interference, unbalanced sample distribution, large calculated amount of the existing Transformer model, insufficient multi-scale feature adaptation and the like of a remote sensing image, the method comprises the following steps of S1, denoising the remote sensing image, removing an interference area, unifying resolution and carrying out standardized pretreatment; S2, enhancing optimized data through geometric enhancement and class balance, S3, segmenting the processed image into image blocks, inputting an improved transducer comprising a hierarchical sparse attention module and multi-scale parallel branches, efficiently extracting and fusing different granularity characteristics, S4, outputting predictive probability distribution through a full-connection layer and a classification activation function, and determining a classification result based on a maximum probability criterion. The method has the advantages of considering classification precision and calculation efficiency, along with strong robustness, and is suitable for technical scenes such as national and earth space planning, environment monitoring and the like.

Inventors

WANG HAIBO
WANG LIANBEI
HE LANMAO

Assignees

上海银帆信息科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260128

Claims (6)

1. A remote sensing image scene classification method based on a transducer is characterized by comprising the following steps: S1, preprocessing a remote sensing image, wherein the preprocessing comprises image noise removal, cloud, fog and shadow areas are removed through threshold segmentation and morphological operation, the resolution of the image is unified to a preset size, and the pixel value is standardized based on the statistical mean value and standard deviation of a remote sensing data set to obtain a preprocessed image; S2, carrying out image enhancement on the preprocessed image, wherein the image enhancement comprises geometric enhancement and class balance enhancement; S3, segmenting the remote sensing image subjected to pretreatment and image enhancement into non-overlapping image blocks, mapping the non-overlapping image blocks into feature vectors with fixed dimensions, adding position codes, and inputting the feature vectors into an improved transducer backbone network, wherein the improved transducer backbone network optimizes feature extraction efficiency through a hierarchical sparse attention module, simultaneously constructs multi-scale parallel branches, respectively extracts features with different granularity, completes fusion, and outputs multi-scale fusion basic features; S4, outputting scene classification prediction probability distribution of the remote sensing image after the multi-scale fusion basic features pass through the full-connection layer and the classification activation function, and determining a final scene classification result based on the maximum probability criterion.
2. The method for classifying remote sensing image scenes based on Transformer according to claim 1, wherein the preprocessing is performed on the remote sensing image in the step S1, the preprocessing includes removing image noise, eliminating cloud, fog and shadow areas through threshold segmentation and morphological operation, unifying image resolution to a preset size, and performing standardization processing on pixel values based on statistical mean and standard deviation of a remote sensing data set to obtain a preprocessed image, which is specifically implemented as follows: Decomposing an original remote sensing image I into a plurality of scales through a Gaussian pyramid, and for each scale Pixels are calculated Local noise variance at: , wherein, To take the following measures Is a 3 x 3 neighborhood of the center, For the pixel mean of the neighborhood, an adaptive weight is built based on the variance of the noise of each scale ; Denoising each scale image by adopting a non-local mean variant: , wherein, For a search window of 11 x 11 a, The denoising result of each scale is fused to obtain a denoising image: Where L is the total number of dimensions, Is the first The mean value of denoising of all points in the scale; Based on priori statistics of remote sensing data sets, self-adaptive thresholds of cloud, fog and shadow are determined, and region elimination is carried out through morphological optimization operation to obtain an image after elimination ; For a pair of Unifying to preset size: wherein Is taken as a point Four pixel coordinates around; Will be Divided into a plurality of non-overlapping sub-regions, the mth sub-region being denoted as Calculating to obtain a subarea Performs a normalization operation for each pixel: wherein To avoid a very small constant with a denominator of 0, And respectively obtaining a mean value and a standard deviation of the mth region to obtain a preprocessed image.
3. The method for classifying remote sensing image scenes based on Transformer according to claim 1, wherein the image enhancement in the step S2 includes geometric enhancement and class balance enhancement, which is specifically implemented as follows: carrying out geometric enhancement on the preprocessed image, and obtaining various images with different angles through rotation operation; performing class balance enhancement on all geometrically enhanced images, and extracting pixel characteristics F in the images; Weighted distance calculation is performed based on pixel characteristics: where D is the total dimension of the pixel features, As a dimension weight, the weight of the dimension is, Weighting distances of pixel feature vectors of two remote sensing image samples with distances to be calculated; For each sample in the small sample class Calculate it with other samples of the same kind Selecting the minimum distance Obtaining a neighbor set by the samples; Randomly selecting a sample from a neighbor set Generating intermediate features by interpolation ; Counting the pixel value distribution of all original samples of all small sample categories to obtain a reasonable area of each channel, and interpolating Restoring the image, calculating pixel variance of 3 multiplied by 3 neighborhood for each pixel in the restored image, and if the pixel variance is larger than the average local neighborhood variance of the set small sample type original sample, re-interpolating until the pixel variance is satisfied; and for the small sample category, repeating the steps to generate a new sample until the set sample balance number threshold value is reached.
4. The method for classifying remote sensing image scenes based on the Transformer according to claim 1, wherein the specific implementation of optimizing the feature extraction efficiency by the improved Transformer backbone network through the hierarchical sparse attention module in the step S3 is as follows: firstly, calculating the saliency score of each Patch through a lightweight CNN (computer numerical control) of the cut non-overlapping image blocks; Dividing the Patches into a high-significance group and a low-significance group based on significance scores, wherein the grouping basis is judged according to an adaptive threshold value obtained by data set statistics, so as to generate a dynamic sparse mask; And merging the mask into self-attention calculation to obtain hierarchical sparse attention output characteristics.
5. The method for classifying a remote sensing image scene based on a Transformer according to claim 4, wherein the rule of sparse connection between low-saliency patches comprises: dividing all the patches of the low-saliency group into non-overlapping grid cells according to image space coordinates, wherein each cell is marked as g, and the grid size is adaptively determined by the total number of the patches of the low-saliency group; For each grid cell g, calculate its pixel feature similarity score for its internal Patch, characterizing background homogeneity: , wherein, For the number of Patches in the g-th grid, Pixel mean and standard deviation of Patch respectively; mapping the corresponding pixel characteristic similarity score into a connection reservation interval according to a preset rule; and dynamically selecting anchor points Patch according to the reserved intervals for each grid unit, only reserving the connection of the anchor points Patch, and completing sparse connection among the low-significance patches.
6. The method for classifying remote sensing image scenes based on a transducer according to claim 4, wherein constructing multi-scale parallel branches according to the improved transducer backbone network in step S3, respectively extracting features with different granularity and completing fusion, and outputting the specific implementation of multi-scale fusion basic features is as follows: Adaptively dividing a remote sensing image block into non-overlapping image blocks with different scales based on the prior feature of the ground feature scale of the remote sensing image; extracting the features under each scale respectively by a method for optimizing the feature extraction efficiency through a hierarchical sparse attention module; Selecting a Patch with 5% of the saliency score from a fine-scale high-saliency group Patch as a trans-scale anchor point set to obtain a corresponding feature set; For each patch of a fine scale, calculating a central coordinate of the patch in an original image, mapping the central coordinate to a coordinate of a coarse scale according to a scale proportion, finding a coarse scale anchor point containing the coordinate of the coarse scale, and mapping the coarse scale anchor point to a patch set of the fine scale of the anchor point; Enhancing the Patch feature vector of the thickness scale, determining a rectangular region corresponding to each Patch of each scale in the original image, and assigning the fixed dimension feature vector corresponding to the Patch to all pixel positions in the rectangular region to obtain corresponding global features; Calculating the information quantity of each scale according to the global features, converting the information quantity into normalized weights through Softmax, and carrying out weighted summation on the global features of the two scales to obtain fused features; and performing attention gating on the fused features to filter noise, and finally outputting multi-scale fusion basic features.

Description

Remote sensing image scene classification method based on Transformer Technical Field The invention belongs to the technical field of image recognition, and particularly relates to a remote sensing image scene classification method based on a transducer. Background The remote sensing image scene classification is used as an important branch of an image recognition technology, and has key application value in the fields of homeland space planning, environment monitoring, disaster emergency response and the like. With the development of remote sensing technology, the ground feature information of the high-resolution remote sensing image is increasingly complex, not only comprises various scene types such as buildings, vegetation, water bodies and the like, but also is often influenced by interference factors such as noise, cloud, fog, shadow and the like, and meanwhile, the problems of small sample size and unbalanced data distribution of part of scene types exist, so that a serious challenge is brought to accurate classification. While the conventional classification method is difficult to consider the efficiency and the precision of feature extraction, the method based on the transducer has advantages in global feature capture, but still faces pain points such as large calculation amount, insufficient multi-scale feature adaptation and the like when processing the remote sensing image, and an optimization technical scheme is needed to meet the actual application demands. Disclosure of Invention Aiming at the technical problems in the background technology, the invention provides a remote sensing image scene classification method based on a Transformer. In order to achieve the above purpose, the technical scheme adopted by the invention comprises the following steps: S1, preprocessing a remote sensing image, wherein the preprocessing comprises image noise removal, cloud, fog and shadow areas are removed through threshold segmentation and morphological operation, the resolution of the image is unified to a preset size, and the pixel value is standardized based on the statistical mean value and standard deviation of a remote sensing data set to obtain a preprocessed image; S2, carrying out image enhancement on the preprocessed image, wherein the image enhancement comprises geometric enhancement and class balance enhancement; S3, segmenting the remote sensing image subjected to pretreatment and image enhancement into non-overlapping image blocks, mapping the non-overlapping image blocks into feature vectors with fixed dimensions, adding position codes, and inputting the feature vectors into an improved transducer backbone network, wherein the improved transducer backbone network optimizes feature extraction efficiency through a hierarchical sparse attention module, simultaneously constructs multi-scale parallel branches, respectively extracts features with different granularity, completes fusion, and outputs multi-scale fusion basic features; S4, outputting scene classification prediction probability distribution of the remote sensing image after the multi-scale fusion basic features pass through the full-connection layer and the classification activation function, and determining a final scene classification result based on the maximum probability criterion. The preprocessing comprises removing image noise, removing cloud, fog and shadow areas through threshold segmentation and morphological operation, unifying image resolution to a preset size, and performing standardization processing on pixel values based on a statistical mean value and a standard deviation of a remote sensing data set to obtain a preprocessed image, wherein the preprocessing comprises the following steps: Decomposing an original remote sensing image I into a plurality of scales through a Gaussian pyramid, and for each scale Pixels are calculatedLocal noise variance at: , wherein, To take the following measuresIs a 3 x 3 neighborhood of the center,For the pixel mean of the neighborhood, an adaptive weight is built based on the variance of the noise of each scale; Denoising each scale image by adopting a non-local mean variant: , wherein, For a search window of 11 x 11 a,The denoising result of each scale is fused to obtain a denoising image: Where L is the total number of dimensions, Is the firstThe mean value of denoising of all points in the scale; Based on priori statistics of remote sensing data sets, self-adaptive thresholds of cloud, fog and shadow are determined, and region elimination is carried out through morphological optimization operation to obtain an image after elimination ; For a pair ofUnifying to preset size: wherein Is taken as a pointFour pixel coordinates around; Will be Divided into a plurality of non-overlapping sub-regions, the mth sub-region being denoted asCalculating to obtain a subareaPerforms a normalization operation for each pixel: wherein To avoid a very small constant with a denominator of 0,And respectively