CN-121980235-A - Cross-modal fusion-based local climate zone classification method

CN121980235ACN 121980235 ACN121980235 ACN 121980235ACN-121980235-A

Abstract

The disclosure provides a local climate zone classification method based on cross-modal fusion, and relates to the technical field of geographic information science. The specific implementation mode is that three mode data are respectively input into a feature encoder with the same structure and without sharing parameters, and the feature encoder is composed of a multi-scale convolution module and a multi-scale convolution module which are cascaded in sequence The individual channel-spatial attention residual modules are constructed, A multi-channel spatial attention residual module performs output characteristics of the multi-scale convolution module Residual feature extraction operations of channel dimension and space dimension attention are sequentially and serially performed to obtain three-mode data Layer characteristics by A cascade of cross-modal transducer modules for transferring data of each of the three modes Layer characterization The cross-modal explicit fusion operation of the sub-sequential progression is obtained Layer fusion feature, namely, the multispectral satellite remote sensing image data is subjected to a Token level feature fusion module Layer characteristics and the first The layer fusion features perform Token dimension self-adaptive weighting and fusion operations, and feature representations with global discrimination capability are generated for classification. According to the technical scheme, the classification precision of the local climate zone can be improved.

Inventors

SU CHEN
ZHANG LINLIN
MENG QINGYAN
WU JIAHAO
HU XINLI

Assignees

中国科学院空天信息创新研究院

Dates

Publication Date: 20260505
Application Date: 20251223

Claims (4)

1. The local climate zone classification method based on cross-modal fusion is characterized by comprising the following steps of: The multispectral satellite remote sensing image data, the land coverage data and the building height data are respectively input into a feature encoder which has the same structure but does not share parameters, and the feature encoder is composed of a multiscale convolution module and a building height data, which are sequentially cascaded Individual channel-spatial attention residual modules, said A multi-channel spatial attention residual module performs output characteristics of the multi-scale convolution module Residual feature extraction operations of channel dimension and space dimension attention are sequentially and serially performed to obtain the three-mode data Layer characteristics, wherein, And (2) and Is a positive integer; By passing through A cascade of cross-modal transducer modules for converting the three modal data into one of a plurality of different modal data Layer characterization The cross-modal explicit fusion operation of the sub-sequential progression is obtained Layer fusion features, where Layer fusion features are based on Layer fusion features and the third mode data Layer characteristics are fused; the multispectral satellite remote sensing image data is subjected to a Token level feature fusion module Layer characteristics and the first And carrying out self-adaptive weighting and fusion operation on the Token dimension by the layer fusion features, generating a feature representation with global discrimination capability, and accessing the feature representation with global discrimination capability into a classification head to obtain the prediction probability of each category.
2. The method of claim 1, wherein the channel-space attention residual module comprises two CBR blocks, a channel-space attention module, and a residual connection unit, wherein the channel-space attention module comprises a cascading channel attention sub-module and a space attention sub-module, wherein the channel attention sub-module comprises a space dimension pooling layer, an addition unit, a Softmax layer, a multiplication unit, and a residual connection unit, wherein the space attention sub-module comprises a channel dimension pooling layer, an addition unit, a flattening layer, a Softmax layer, and a multiplication unit; correspondingly, the residual feature extraction operation of connecting channel dimension and space dimension attention in series comprises the following steps: Performing two rounds of sequential convolution, batch normalization and linear activation operations on the current layer characteristics through the two CBR blocks to obtain volume normalization activation characteristics; Performing, by a channel-space attention module, an attention tandem operation in a channel dimension and a space dimension on the volume normalization activation feature, comprising: performing global average pooling and global maximum pooling operations of space dimensions on the volume-grouping activation features through the space dimension pooling layer to obtain space global average features and space global peak features; Through an addition unit and a Softmax layer of the channel attention sub-module, the spatial global average feature and the spatial global peak feature are added element by element and then are processed through a Softmax function, so that the channel attention weight is obtained; Multiplying the channel attention weight and the volume normalization activating feature element by element through a multiplying unit and a residual connecting unit of the channel attention sub-module, and adding the result of the multiplication and the volume normalization activating feature element by element to obtain a channel attention residual enhancing feature; Carrying out global average pooling and global maximum pooling operation of channel dimensions on the channel attention residual enhancement features through the channel dimension pooling layer to obtain channel global average features and channel global peak features; Through the addition unit, the flattening layer and the Softmax layer of the space attention sub-module, the channel global average characteristic and the channel global peak characteristic are added element by element and flattened, and then are processed through a Softmax function to obtain the space attention weight; multiplying the spatial attention weight with the channel attention residual enhancement feature element by a multiplication unit of the spatial attention sub-module to obtain the channel-spatial attention self-adaptive enhancement feature; and adding the current layer characteristics, the volume-grouping activation characteristics and the channel-space attention self-adaptive enhancement characteristics element by element through the residual error connection unit to obtain the next layer characteristics.
3. The method of claim 2, wherein the cross-modality Transformer module comprises a stitched projection layer, a first layer normalization unit, a cross-attention sub-module, a first Dropout/DropPath layer, a first residual connection unit, a second layer normalization unit, a two layer feedforward neural network with GELU function, a second Dropout/DropPath layer, and a second residual connection unit, wherein the cross-attention sub-module comprises a first full connection layer, a dual-path parallel scaled dot product attention mechanism, a second full connection layer, and an addition unit; correspondingly, the cross-modal explicit fusion operation comprises the following steps: The current layer characteristics of the multispectral satellite remote sensing image data and the fusion characteristics of the previous layer are spliced through the spliced projection layer to obtain main mode characteristics, wherein if the fusion characteristics of the previous layer do not exist, the main mode characteristics are the current layer characteristics of the multispectral satellite remote sensing image data; performing layer normalization operation on the main mode characteristics, the land coverage data current layer characteristics and the building height data current layer characteristics through the first layer normalization unit; And taking the land coverage data current layer characteristics and the building height data current layer characteristics after the layer normalization operation as auxiliary mode characteristics through a cross attention module, displaying and modeling, and fusing interaction relations between the main mode characteristics and the auxiliary mode characteristics, wherein the method comprises the following steps of: performing full connection operation on the main mode characteristics and the auxiliary mode characteristics through the first full connection layer; The main mode feature is used as a query, the two auxiliary mode features are respectively used as keys and values of a path through a dual-path parallel scaling dot product attention mechanism, and two cross-mode fusion sub-features are obtained based on a scaling dot product attention formula; performing full connection operation on the two cross-mode fusion sub-features through the second full connection layer; Adding the two cross-modal fusion sub-features subjected to the full-connection operation element by element through an adding unit of the cross-attention sub-module to obtain the cross-modal fusion features; regularizing the cross-modal fusion feature through the first Dropout/DropPath layer; adding the regularized cross-modal fusion feature and the main modal feature element by element through the first residual error connection unit to obtain a residual error cross-modal fusion feature; performing layer normalization operation on the residual cross-modal fusion characteristics through the second layer normalization unit; performing GELU-activated double-layer feedforward neural network operation on the residual cross-modal fusion characteristics after the layer normalization operation through the two-layer feedforward neural network with GELU functions to obtain residual cross-modal fusion enhancement characteristics; regularizing the residual cross-modal fusion enhancement feature through the second Dropout/DropPath layer; And adding the residual cross-modal fusion enhancement feature subjected to regularization operation with the residual cross-modal fusion feature element by element through the second residual connecting unit to obtain a current layer fusion feature.
4. A method according to claim 3, wherein the Token-level feature fusion module comprises a Token dimension pooling layer, a channel stacking unit, a parameter sharing two-layer perceptron, a Softmax layer and a multiplication unit; Correspondingly, the self-adaptive weighting and fusion operation of the Token dimension comprises the following steps: the multispectral satellite remote sensing image data is subjected to the Token dimension pooling layer Layer characteristics and the first Carrying out global average pooling operation of Token dimension on the fusion characteristics to obtain global descriptors of the multispectral satellite remote sensing image data and the fusion characteristic data; The channel stacking unit, the two-layer perceptron with shared parameters and the Softmax layer are used for carrying out channel stacking on the multispectral satellite remote sensing image data and the global descriptor fusing the characteristic data, and the channel stacking result is processed by the two-layer perceptron with shared parameters and then is processed by the Softmax function to obtain the fusion weight; And multiplying the fusion weight and the global descriptor element by element through the multiplication unit to obtain the characteristic representation of the global discrimination capability.

Description

Cross-modal fusion-based local climate zone classification method Technical Field The disclosure relates to the technical field of geographic information science, in particular to a local climate zone classification method based on cross-modal fusion. Background Currently, the local climate zone (Local Climate Zone, LCZ) classification mapping methods can be mainly classified into three types, a Remote Sensing (RS) based method, a geographic information system (Geographic Information System, GIS) based method, and a hybrid method of fusing RS and GIS. The RS method is still dominant, and can be further subdivided into a pixel-level method, an object-level method and a scene-level method according to different processing scales. Pixel level methods rely on spectral features of individual pixels for classification, however such methods typically ignore spatial context information and are susceptible to noise interference. The object level method generates spatially continuous objects through image segmentation, thereby enhancing spatial consistency. The scene-level method takes a whole image as input, and extracts high-level spatial features by means of deep learning models such as convolutional neural networks (Convolutional Neural Networks, CNNs) or transformers and the like, so that the scene-level method becomes a current research hotspot. The method can automatically learn the spatial mode in the image, and overcomes the limitation of the traditional machine learning method in the aspect of feature extraction. For example, a CNN-based model increases LCZ classification accuracy to over 80% in the 8 cities of germany, showing its great potential for high-accuracy classification. The MSMLA-Net model enhances the spectrum and spatial feature expression capability by introducing a multi-scale attention module, and is superior to three existing mainstream LCZ classifiers in the test of 6 cities in korea. The LCZNet model adopts a multi-scale convolution filter to extract space characteristics and combines with an SE residual error module to enhance the channel characteristic fusion capability, and the classification accuracy is improved by about 20% compared with WUDAPT standard in the test of three main economic areas in China. The rise of the transducer architecture has brought a breakthrough in a number of fields. The transducer has strong global modeling capability, can capture remote dependency relationships through a self-attention mechanism, and shows excellent performance in various tasks. In the field of LCZ classification, lin et al propose a multimodal transducer network dedicated to LCZ classification. The network effectively captures the hierarchical context relation through a multi-scale embedding mechanism, and further improves the feature discrimination capability and the classification precision by utilizing multi-mode fusion. The method comprises the steps of constructing an LCZ-MFKNet model based on a So2Sat LCZ42 dataset by Zhong et al, fusing multi-level features and priori knowledge, extracting global features by utilizing a Swin transducer, extracting local features by means of SM-ResNet, and finally completing feature fusion through an improved iSE module, so that efficient and accurate LCZ classification is realized. The model based on the Transformer remarkably improves the precision and efficiency of LCZ drawing by integrating multi-source heterogeneous information and multi-scale semantic features, and shows wide application prospects in urban climate division. The GIS method is mainly classified by physical parameters such as building density, average height, sky visible factors and the like, has strong interpretability, but is limited by dependence on high-quality multi-source data. In order to overcome the limitation of a single method, a mixed method of fusing RS and GIS is gradually developed in recent years. For example, some researches firstly use RS means to carry out preliminary classification, then introduce GIS parameters to carry out fine correction, and other researches adopt RS and GIS to respectively carry out independent classification on land coverage type and building type, and finally integrate two types of results to improve the overall performance. While RS-driven deep learning approaches have achieved significant results in LCZ classification, there are still bottlenecks in multi-source data fusion. On one hand, most of the current methods still adopt a shallow feature level splicing mode, so that deep semantic relativity among multiple modes is difficult to fully mine, redundant or noise information is easy to introduce, and on the other hand, an effective cross-mode collaboration mechanism is lacked, and information among different data sources is difficult to realize advantage complementation, so that generalization capability of a model in a complex urban environment is limited. At present, the common multi-mode fusion method in the LCZ classification mapp