CN-121982432-A - Self-explanatory multi-view remote sensing image classification method

CN121982432ACN 121982432 ACN121982432 ACN 121982432ACN-121982432-A

Abstract

The invention discloses a self-interpretation multi-view remote sensing image classification method which comprises the steps of obtaining multi-view remote sensing images and preprocessing the multi-view remote sensing images, inputting the preprocessed remote sensing images into an interpreter to generate pixel-level attribution masks corresponding to all views, respectively carrying out masking operation on the remote sensing images of the corresponding views based on the pixel-level attribution masks of all views to generate foreground preserved images and complementary background images under all views, inputting the foreground preserved images under all views into a main classifier to obtain prediction probability distribution of scene categories, inputting the complementary background images under all views into an auxiliary classifier to obtain a background interference prediction result, and updating parameters of the interpreter and the main classifier through back propagation based on a total loss function until preset convergence conditions are met. The method effectively overcomes the inherent defects of the existing unexplained model and the postmortem interpretation method in the aspects of reliability, calculation efficiency and scene generalization capability.

Inventors

ZHANG HAOPENG
YAO XUDONG
JIANG ZHIGUO
SUN DONGDONG

Assignees

天目山实验室

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (9)

1. A self-explanatory multi-view remote sensing image classification method, comprising: acquiring a multi-view remote sensing image and preprocessing the image; Inputting the preprocessed remote sensing image to an interpreter, and generating pixel-level attribution masks corresponding to all view angles; Based on the pixel level attribution mask of each view angle, performing mask operation on the remote sensing images of the corresponding view angles respectively, and generating a foreground reserved image and a complementary background image under each view angle; inputting the foreground retention image under each view angle into a main classifier to obtain the prediction probability distribution of scene category, and inputting the complementary background image under each view angle into an auxiliary classifier to obtain the prediction result of background interference; Constructing a total loss function based on the predictive probability distribution, the background interference prediction result and the pixel level attribution mask; based on the total loss function, updating parameters of the interpreter and the main classifier through back propagation until a preset convergence condition is met.
2. The self-explanatory multi-view remote sensing image classification method as claimed in claim 1, wherein the multi-view remote sensing image includes an aerial view image and a ground view image.
3. The self-explanatory multi-view remote sensing image classification method as claimed in claim 1, wherein the preprocessing includes adjusting the multi-view remote sensing image to a uniform size.
4. The method for classifying self-explanatory multi-view remote sensing images according to claim 1, wherein the interpreter adopts a multi-path parallel architecture, and is provided with a plurality of corresponding independent interpretation networks for remote sensing images of each view to generate pixel-level attribution masks corresponding to each view in parallel.
5. The self-explanatory multi-view remote sensing image classification method as claimed in claim 1, wherein the interpreter is implemented using a semantic segmentation network architecture.
6. The self-explanatory multi-view remote sensing image classification method as claimed in claim 1, wherein the masking operation is performed on the remote sensing images of the corresponding view angles based on the pixel-level attribution mask of each view angle, respectively, to generate a foreground preserved image and a complementary background image under each view angle, specifically comprising: Processing the remote sensing images of the corresponding visual angles by using pixel-level attribution masks of the visual angles, reserving high response areas in the remote sensing images, and replacing the removed areas with preset base reference images to obtain foreground reserved images under the corresponding visual angles; And processing the remote sensing image of the corresponding view angle by using the reverse mask of the pixel-level attribution mask under each view angle, reserving a background area in the remote sensing image, and replacing the removed area with a preset base reference image to obtain a complementary background image under the corresponding view angle.
7. The self-explanatory multi-view remote sensing image classification method as claimed in claim 1, wherein parameters of the auxiliary classifier remain frozen during training.
8. The self-explanatory multi-view remote sensing image classification method according to claim 1, wherein the main classifier and the auxiliary classifier adopt an early fusion strategy, extract low-level features of foreground preserved images under each view angle, splice each low-level feature along a channel dimension, and input the spliced low-level features into a backbone network for high-level semantic abstraction and classification.
9. The self-explanatory multi-view remote sensing image classification method of claim 2, wherein the total loss function includes sufficiency loss, integrity loss, and compactness loss, expressed as: Wherein L total represents the total loss function, L sufficiently represents the sufficiency loss, L completeness represents the integrity loss, L compactness represents the compactness loss, lambda 1 represents the weight coefficient of the sufficiency loss L sufficiently , lambda 2 represents the weight coefficient of the integrity loss L completeness , lambda 3 represents the weight coefficient of the compactness loss L compactness , and L CE represents the standard cross entropy loss; representing a prediction process of the foreground preserved image after being input into a main classifier; A foreground preserving image representing an aerial viewing angle image; a foreground preservation image of the ground visual angle image is represented, t represents a real label; Representing a prediction process after the complementary background image is input into the auxiliary classifier; a complementary background image representing an aerial viewing angle image; Masks (i, j) represent pixel values of the pixel-level attribution mask at positions (i, j), H represents the height of the pixel-level attribution mask, W represents the width of the pixel-level attribution mask; Representing a preset threshold.

Description

Self-explanatory multi-view remote sensing image classification method Technical Field The invention belongs to the technical fields of remote sensing image processing, computer vision and pattern recognition, and particularly relates to a self-explanatory multi-view remote sensing image classification method. Background With the rapid development of remote sensing technology, the remote sensing image data volume is in explosive growth, and plays a key role in the fields of land coverage drawing, environment monitoring, natural resource investigation and the like. The remote sensing image scene classification aims at extracting high-level semantic features from a high-resolution image and judging the categories, and is a basic stone for intelligent remote sensing data interpretation. Traditional scene classification methods are mainly based on a single overlooking view angle, have significant limitations when processing complex scenes, such as vegetation shading, cloud layer interference and lack of fine granularity ground texture details, and are often difficult to capture key discrimination features only by the aid of the space view angle, so that prediction uncertainty is increased. In order to overcome the limitation of a single visual angle, a multi-visual angle remote sensing image classification technology integrating air and ground visual angles is generated. In the prior art, by introducing complementary ground view angle information and utilizing an early or late fusion strategy, the vertical structural characteristics of the targets are effectively supplemented, and the classification precision is remarkably improved. However, despite the great success of existing deep learning based multi-view classification models in terms of predictive performance, they essentially operate primarily as opaque "black boxes". In high-risk and safety-critical application scenarios such as disaster assessment and military reconnaissance, users pay attention to the accuracy of prediction, and more urgent needs to understand the basis of decision making of the model. The lack of transparency results in a user being unable to determine whether the model is based on the discriminatory features of the object or the decision made by the background noise, thereby severely impairing the reliability of the model. Therefore, studying the interpretability of multi-view classification decision processes has become a critical issue to be addressed currently. The current model interpretability method mainly adopts a 'post interpretation' strategy, namely, after model training is completed, an attribution graph is generated through disturbance or analysis gradient of an input image and an activation value, and the contribution degree of pixels to a prediction result is displayed in a thermodynamic diagram mode. For example, perturbation-based methods locate critical areas by occluding specific areas of the input image and monitoring output changes. The existing multi-view remote sensing image classification and interpretation method has the following remarkable defects: ① The existing model lacks inherent interpretability, namely the currently mainstream multi-view classification model (such as a multi-branch fusion network based on a convolution network) only outputs classification labels, and cannot provide decision basis in real time in the reasoning process. While a "post-interpretation" tool may be utilized, this does not fundamentally solve the problem of opacity of the model's internal decision mechanism, resulting in limited deployment in remote sensing applications requiring high reliability. ② The existing disturbance-based post interpretation method needs to perform mask processing on an input image for multiple times and repeatedly perform forward reasoning to observe output changes, so that huge calculation redundancy can be generated when high-resolution and multi-view remote sensing data are processed, and the real-time processing requirement is difficult to meet. ③ The data outside the distribution causes inaccurate attribution, namely unnatural artifacts are often introduced when the post interpretation method shields or perturbs the image, and the input data deviates from the distribution during model training. Such a distribution shift may induce the model to produce an abnormal response, such that the generated attribution map does not faithfully reflect the real region of interest of the model under the original input. ④ Interpretive and classification performance cleavage the prior art generally treats "classification" and "interpretation" as two separate steps, the interpretation process does not participate in training optimization of the model. This means that the model is never guided to focus on discriminative areas with semantic meaning, possibly resulting in classification of the model with shortcut features in the background, but less generalization and robustness, although with adequate precision o