CN-120612604-B - Multi-mode collaborative feature classification method and system based on mask comparison learning
Abstract
The invention discloses a multimode collaborative feature classification method and system based on mask comparison learning, and belongs to the technical field of remote sensing image classification. The method comprises the steps of obtaining hyperspectral data and laser radar point cloud data, obtaining hyperspectral image blocks and LiDAR image blocks based on the data, extracting features of the hyperspectral image blocks and the LiDAR image blocks to obtain hyperspectral image feature blocks and LiDAR image feature blocks, carrying out fusion processing on the hyperspectral image features and the LiDAR image feature blocks to obtain hyperspectral image features, a first cross-modal attention map and a second cross-modal attention map, obtaining a first mask fusion feature and a second mask fusion feature based on the data, calculating comparison loss based on the hyperspectral image features and the LiDAR image features, the first mask fusion feature and the second mask fusion feature, updating model parameters, and classifying the to-be-classified ground object multi-modal data based on the model after parameter updating. The invention improves the accuracy and stability of the cross-modal learning task.
Inventors
- LIU SICHENG
- XIA ZHANGUO
- PING HUAN
- LIU QIHAN
- LI CHUNLEI
Assignees
- 中国矿业大学
Dates
- Publication Date
- 20260505
- Application Date
- 20250604
Claims (7)
- 1. A multimode collaborative feature classification method based on mask comparison learning is characterized by comprising the following steps: S1, acquiring hyperspectral data and laser radar point cloud data, and preprocessing the hyperspectral data and the laser radar point cloud data to obtain a hyperspectral image block and a LiDAR image block; S2, adopting a multi-scale feature extraction module to respectively extract features of the hyperspectral image block and the LiDAR image block to obtain a hyperspectral image feature block and a LiDAR image feature block; s3, respectively carrying out fusion processing on the hyperspectral image feature block and the LiDAR image feature block by adopting a CGAFT module to obtain hyperspectral image features, liDAR image features, a first cross-modal attention map and a second cross-modal attention map; s4, obtaining a first mask fusion feature and a second mask fusion feature based on the hyperspectral image feature block and the LiDAR image feature block and the first cross-modal attention map and the second cross-modal attention map; s5, calculating contrast loss based on the hyperspectral image features, the LiDAR image features, the first mask fusion features and the second mask fusion features, and updating model parameters in combination with a cross entropy function, wherein the model comprises a multi-scale feature extraction module and a CGAFT module; s6, classifying the multi-mode data of the ground features to be classified based on the model updated by the parameters to obtain a classification result; the method for obtaining the first mask fusion feature and the second mask fusion feature comprises the following steps: Calculating a combined activation strength for each feature location based on the first cross-modal attention profile and the second cross-modal attention profile; Applying the encoder of CGAFT modules on the hyperspectral image feature block and the LiDAR image feature block for M times respectively, and then inputting the encoder into an FC network to obtain the feature representation F of c channels; Calculating a query Q, a key K and a value V from the feature representation F, and calculating an attention attempt score based on the query Q, the key K and the value V; Converting the attention profile score into a probability distribution reflecting the probability that the feature of high integrated activation strength is masked; And masking the hyperspectral image feature block and the LiDAR image feature block based on the masking strategy to obtain the first mask fusion feature and the second mask fusion feature.
- 2. The method for classifying multimode collaborative features based on mask comparison learning of claim 1, wherein the preprocessing method comprises spatial alignment, format unification and resampling, enhancement and normalization processing and dicing operations.
- 3. The method for classifying multimode cooperative features based on mask comparison learning according to claim 1, wherein the method for obtaining the hyperspectral image feature block comprises the following steps: And carrying out convolution operation on the hyperspectral image block by using 3D convolution with the kernel size of 9 multiplied by 3, then simultaneously carrying out feature extraction by using a group of two-dimensional convolutions with the kernel sizes of 1 multiplied by 1,3 multiplied by 3 and 5 multiplied by 5, and finally fusing the output after the two-dimensional convolutions by using element-wise addition to obtain the hyperspectral image feature block.
- 4. The method for classifying multimode cooperative features based on mask comparison learning according to claim 1, wherein the method for obtaining the LiDAR image feature block comprises the following steps: And carrying out convolution operation on the LiDAR image block by using two-dimensional convolution with the kernel size of 3 multiplied by 3, then simultaneously carrying out feature extraction by using a group of two-dimensional convolutions with the kernel sizes of 1 multiplied by 1,3 multiplied by 3 and 5 multiplied by 5, and finally fusing the output after the two-dimensional convolution by using element-wise addition to obtain the LiDAR image feature block.
- 5. The method for classifying multimode cooperative features based on mask comparison learning according to claim 1, wherein the method for obtaining the hyperspectral image features and the LiDAR image features comprises the following steps: flattening the hyperspectral image feature blocks and the LiDAR image feature blocks and stitching with a learnable classification mark to generate a first sequence of features and a second sequence of features; adding a leachable location embedding to the first sequence of features and the second sequence of features results in the hyperspectral image features and the LiDAR image features.
- 6. The method for classifying multimode collaborative features based on mask contrast learning according to claim 1, wherein the method for updating model parameters based on the contrast loss comprises: , Wherein, the , , , , Wherein ltotal denotes the total loss function, lCE denotes the cross entropy function, α denotes the weight of contrast loss, lContrastive denotes the total contrast loss, lHSI-liDAR denotes the HSI-LiDAR contrast loss, lHSI-MASKEDHSI denotes the HSI-MASKEDHSI contrast loss, lLiDAR-MASKEDLIDAR denotes the LiDAR-MASKEDLIDAR contrast loss; Nb represents the number of images of one batch in the training process; , representing a high-dimensional representation generated by the kth sample in the corresponding batch, , Representing a high-dimensional representation of the i-th sample generation in the corresponding batch, Representing that the kth sample high-dimensional representation in the k traversal is 1 when the current ith sample high-dimensional representation is a high-dimensional representation generated by different samples, and otherwise, the high-dimensional representation is 0; Representing LiDAR high-dimensional features generated in the kth masked sample in the corresponding batch; Representing LiDAR high-dimensional features generated in the ith masked sample in the corresponding batch; Representing LiDAR high-dimensional features generated in the ith unmasked sample in the corresponding batch.
- 7. A mask contrast learning-based multi-modal collaborative terrain classification system for implementing the method of any of claims 1-6, comprising a pre-training module and a classification module; The pre-training module comprises a data processing unit, a feature extraction unit, a fusion processing unit, a mask fusion unit and a loss calculation unit; the data processing unit is used for acquiring hyperspectral data and laser radar point cloud data, and preprocessing the hyperspectral data and the laser radar point cloud data to obtain a hyperspectral image block and a LiDAR image block; The feature extraction unit is used for respectively carrying out feature extraction on the hyperspectral image block and the LiDAR image block by adopting a multi-scale feature extraction module to obtain a hyperspectral image feature block and a LiDAR image feature block; The fusion processing unit is used for respectively carrying out fusion processing on the hyperspectral image feature block and the LiDAR image feature block by adopting a CGAFT module to obtain hyperspectral image features, liDAR image features, a first cross-modal attention map and a second cross-modal attention map; The mask fusion unit is used for obtaining a first mask fusion feature and a second mask fusion feature based on the hyperspectral image feature block and the LiDAR image feature block and the first cross-modal attention map and the second cross-modal attention map; The loss calculation unit is used for calculating contrast loss based on the hyperspectral image features, the LiDAR image features, the first mask fusion features and the second mask fusion features, and updating model parameters in combination with a cross entropy function; the classification module is used for classifying the multi-mode data of the ground objects to be classified based on the model with updated parameters to obtain classification results.
Description
Multi-mode collaborative feature classification method and system based on mask comparison learning Technical Field The invention belongs to the technical field of remote sensing image classification, and particularly relates to a multimode collaborative ground feature classification method and system based on mask comparison learning. Background In recent years, remote sensing technology has rapidly developed, and remote sensing images play an irreplaceable role in various applications such as disaster detection, agricultural management, urban development planning and the like. With the advent of various remote sensing data, multimodal learning has been increasingly attracting attention of researchers. Of these multimodal data, HSI (Hyperspectral Imaging) data provides detailed spectral information for identifying a particular object on the ground, while Light Detection AND RANGING data provides elevation information for that region. However, HSI faces challenges in identifying land objects that have the same spectral characteristics but different heights, and LiDAR data is difficult to distinguish between land objects of different materials and the same heights, while joint classification using HSI and LiDAR data can make full use of their complementary information to improve classification accuracy. Therefore, the feature fusion of the cross-mode data attracts wide attention, and is widely applied to the classification of the ground features of the multi-source remote sensing images. In recent years, many machine learning techniques have been applied to joint classification of HSI and LiDAR data, including Support Vector Machines (SVMs), random forests, rotating forests (rofs), and the like. In the traditional machine learning framework, firstly, the characteristic extraction and selection process is carried out to respectively extract and select the characteristics of two different mode data, the two mode characteristics are fused to form a comprehensive characteristic set, and finally, the classifier is used for classifying the ground objects. However, conventional methods employing these techniques rely on the quality of the manual features, lack of mining of deep features, and do not fit well to the complex nonlinear relationships of features in HSI and LiDAR data, which limits their application scenarios. Deep learning based approaches are of great interest because of their excellent ability to automatically extract features. However, it is not trivial to build an efficient HSI and LiDAR classification model. One of the key reasons is that deep learning based models typically require a large number of labeled samples to achieve satisfactory accuracy, which is expensive and limited in feature classification. Furthermore, these methods typically use attention mechanisms to bridge the semantic gap between modalities, focusing on coordinating feature similarity between different modalities by dynamically assigning weights of key information. The association between modalities typically depends on the same class labels in both modalities, which makes it difficult for the attention mechanism to explore the deeper relationships between the two modalities. Contrast learning is a self-supervised learning technique aimed at extracting meaningful representations from unlabeled data. Typically, the proxy task is used to distinguish between positive and negative samples, and sample features are automatically acquired for training the model. Through long-term research, and a lot of well-known contrast learning frameworks can be found to be used for training encoders with strong characterization extraction capability, such as SimCLR, moCo, BYOL, etc., and expansion according to these researches generates a lot of contrast learning frameworks for multi-modal learning, such as CMC, factor cl. The contrast learning method generally includes two phases, a pre-training phase and a fine-tuning phase. In the pre-training stage, the input data takes positive samples and negative samples as supervision information to learn the characteristics through data enhancement construction, and the high-quality characteristic extraction module learned in the pre-training stage is transferred to a downstream task for fine adjustment. With the development of contrast learning, researchers have proposed many methods for contrast learning of multi-modal remote sensing data, mainly from increasing the diversity of samples through different data enhancement strategies. A masking image model (MASKED IMAGE Modeling, MIM) serves as an advanced self-supervised learning method to learn effective visual representations by randomly masking partial regions of an input image and training the model to recover these masked portions based on non-masked information. In the implementation process, an input image is firstly divided into a series of non-overlapped blocks with fixed sizes, then masking processing is carried out on part of the image blocks