CN-121982532-A - Text-guided Mamba contrast learning hyperspectral image classification method and system

CN121982532ACN 121982532 ACN121982532 ACN 121982532ACN-121982532-A

Abstract

The invention discloses a method and a system for classifying Mamba contrast learning hyperspectral images based on text guidance, which comprise the steps of obtaining hyperspectral image data and category text description, generating multiple views of a sample through enhancement, extracting text semantic features by a text encoder, constructing a pre-training frame comprising parallel contrast branches, respectively extracting and fusing the spectrum and the space features of each view by a Mamba visual encoder, performing cross-modal fusion with corresponding text features, calculating contrast loss and classification loss based on the fusion features, constructing a joint loss function optimization model, multiplexing pre-training parameters for supervised fine adjustment, and finally realizing pixel level classification of images to be classified. According to the method, linear complexity modeling is realized by utilizing Mamba structures, and the problems of high computational complexity, insufficient semantic information utilization and scarce labeling data in the existing method are effectively solved by combining a text semantic guidance and contrast learning mechanism, so that the classification precision and generalization capability are remarkably improved.

Inventors

Zu Baokai
YANG ZHENGRUI
WANG HONGYUAN
LI YAFANG

Assignees

北京工业大学

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. A Mamba contrast learning hyperspectral image classification method based on text guidance is characterized by comprising the following steps: Step S1, data preparation and text guiding feature construction, namely acquiring hyperspectral image data and corresponding category text description thereof, carrying out enhancement processing on the hyperspectral image data to generate two enhancement views of the same training sample, and encoding the category text description into text semantic features by using a text encoder; S2, model pre-training, namely constructing a pre-training frame comprising two parallel comparison branches, respectively inputting different enhancement views of the same training sample into the different comparison branches, respectively extracting spectral features and spatial features through corresponding Mamba visual encoders, and fusing the spectral features and the spatial features into visual features; calculating contrast loss based on the semantic enhancement visual features output by different contrast branches, constructing a joint loss function by combining the classification loss, and synchronously optimizing model parameters to align visual-text pairs with similar semantics and separate visual-text pairs with dissimilar semantics; S3, fine tuning a model, namely multiplexing pre-trained model parameters to initialize based on a labeling sample in the hyperspectral image data, and performing supervised training on the model through a classification loss function to adapt to a target classification task; And S4, classification reasoning, namely inputting the hyperspectral image to be classified into a fine-tuned model, and outputting a pixel-level classification result.
2. The Mamba contrast learning hyperspectral image classification method according to claim 1, wherein in the step S1, the generating process of the enhanced view specifically includes: extracting a local three-dimensional image block of size around each center pixel for hyperspectral image data; Performing enhancement processing on each local three-dimensional image block to generate two enhancement views of the same training sample; The two enhanced views are respectively formatted so that they can be combined with a learnable position code through Reshape operations The spectral and spatial signature sequences required for the subsequent Mamba visual encoder are projected.
3. The Mamba contrast learning hyperspectral image classification method of claim 1, wherein in said step S1, the text semantic feature encoding process specifically includes: Generating an initial category text by adopting a large language model aiming at each category name of hyperspectral image data, and generating category text description containing inherent category attributes and inter-category relation information by expanding guided by a category text example; dividing each expanded text description of each category into a plurality of sub word units by adopting a byte pair coding algorithm to form a standardized text sequence; Inputting the standardized text sequence after word segmentation into a CLIP text encoder, and outputting text embedded vectors of each category; collecting text embedded vectors of all categories to form a text semantic feature set The text semantic feature set is used for cross-modal fusion of follow-up and visual features Wherein , The vector dimensions are embedded for the text, Is the total number of categories of hyperspectral image data.
4. The method of Mamba contrast learning hyperspectral image classification as claimed in claim 2, wherein in said step S2, the process of extracting spectral features and spatial features by said Mamba visual encoder specifically includes: the two enhancement views generated through enhancement processing are respectively and correspondingly input into two parallel comparison branches of the pre-training frame; In each branch, inputting a spectral feature sequence into a spectral branch of a Mamba visual encoder, capturing spectral correlation through a Mamba structure, and outputting spectral features of the branch; and fusing the spectral features and the spatial features output by the same comparison branch to obtain the special visual features of the branch for subsequent cross-modal fusion.
5. The Mamba contrast learning hyperspectral image classification method according to claim 1, wherein in the step S2, the optimization process of the joint loss function specifically includes: in each parallel comparison branch, cross-modal fusion is carried out on the visual features and the corresponding text semantic features through a matrix multiplication strategy, so that semantic enhanced visual features are obtained; Based on the semantic enhanced visual features output by the two comparison branches, calculating a comparison loss for realizing the alignment of the visual-text pairs with similar semantics and the separation of the visual-text pairs with dissimilar semantics; inputting the semantic enhanced visual features in each comparison branch into a corresponding full-connection classification head to obtain a class prediction result, and calculating cross entropy classification loss by combining with a real class label to obtain respective classification loss of the two comparison branches; The comparison loss and the classification losses of the two comparison branches are combined in a weighting mode, and a joint loss function is constructed; Based on the calculation result of the joint loss function, the visual encoder and the feature fusion related parameters are synchronously optimized Mamba until the loss value converges.
6. The Mamba contrast learning hyperspectral image classification method as claimed in claim 5, wherein the joint loss function In (2), the weighted combination of the contrast loss and the two contrast branch classification losses satisfies: ; Wherein, the In order to contrast the loss of the optical fiber, And Respectively representing the cross entropy loss of two contrasting branches, the parameters And As a weighting parameter, the contribution ratio of classification loss and comparison loss to model optimization is balanced.
7. The Mamba contrast learning hyperspectral image classification method of claim 1, wherein in said step S3, the specific process of fine tuning the model includes: selecting a sample with a category label from the hyperspectral image data set to be used as a labeling sample for fine adjustment training; adding a fully-connected classification head at the output end of the pre-trained Mamba visual encoder, and multiplexing model parameters in a pre-training stage for initialization; fixing core parameters of the Mamba visual encoder after pre-training, keeping the core parameters unchanged in the fine tuning process, and setting the classification prediction parameters of the newly added fully-connected classification head into a trainable state; Inputting the labeling sample into a model, calculating the error between a prediction result and a real label through a cross entropy classification loss function, and optimizing the trainable parameters; and (3) continuing training until the adaptation capability of the model to the target class is stable, and obtaining the hyperspectral image classification model with fine tuning completed.
8. The Mamba contrast learning hyperspectral image classification method of claim 1, wherein in said step S4, the specific process of classification reasoning includes: preprocessing hyperspectral images to be classified according to the enhancement processing and format adjustment mode of the step S1 to obtain data to be classified input by an adaptive model; Inputting data to be classified into a model after pre-training and fine-tuning, extracting spectral features and spatial features through a Mamba visual encoder, fusing the spectral features and the spatial features into visual features, and performing cross-modal fusion with corresponding text semantic features; Inputting the cross-modal fused semantic enhanced visual characteristics into a classification head to obtain a class prediction result of each pixel; And integrating the class prediction results of all pixels according to the original spatial position of the hyperspectral image to be classified, and outputting the classification result of the pixel level.
9. A text-guided Mamba contrast-learning hyperspectral image classification system applied to the Mamba contrast-learning hyperspectral image classification method according to any one of claims 1 to 8, comprising: The data preprocessing module is used for acquiring hyperspectral image data and corresponding category text descriptions thereof, carrying out enhancement processing on the hyperspectral image data to generate two enhancement views of the same training sample, and encoding the category text descriptions into text semantic features by using a text encoder; The model training module is used for constructing a pre-training frame comprising two parallel comparison branches, respectively inputting different enhancement views of the same training sample into the different comparison branches, extracting spectral features and spatial features through a Mamba visual encoder, and fusing the spectral features and the spatial features into visual features; calculating contrast loss based on the semantic enhancement visual features output by different contrast branches, constructing a joint loss function by combining with classification loss, synchronously optimizing model parameters to align visual-text pairs with similar semantics and separate visual-text pairs with dissimilar semantics, multiplexing pre-trained model parameters for initialization based on labeling samples in the hyperspectral image data, and performing supervised training on the model through the classification loss function so as to adapt to target classification tasks; and the classification reasoning module is used for inputting the hyperspectral image to be classified into the fine-tuned model and outputting a pixel-level classification result.
10. The Mamba contrast-learning hyperspectral image classification system as claimed in claim 9, wherein the data preprocessing module further includes a text expansion unit for generating an expanded text description containing category-specific attributes and inter-category relationship information for encoding of text semantic features using a template strategy fusing domain-specific knowledge for each category name of hyperspectral image data.

Description

Text-guided Mamba contrast learning hyperspectral image classification method and system Technical Field The invention relates to the technical field of hyperspectral image classification, in particular to a Mamba contrast learning hyperspectral image classification method and system based on text guidance. Background The hyperspectral image (HSI) contains hundreds of continuous spectrum wave bands, can accurately describe the spectrum characteristics of the ground object, and has important application value in the fields of land cover classification, environment monitoring, agricultural remote sensing and the like. The core goal of hyperspectral image classification is to accurately distinguish different ground object categories by mining spectrum-space joint features contained in data. The current mainstream classification method mainly expands around a deep learning framework, wherein the method represented by a transform architecture has remarkable advantages in capturing long-distance dependency, global structure information and semantic association, so that the method is particularly suitable for application scenes with high requirements on structure information, such as remote sensing images. However, the existing method based on the transducer architecture has inherent defects that a self-attention mechanism has secondary calculation complexity, and the characteristic becomes a core bottleneck for expanding application in a hyperspectral image long-sequence modeling task. When the sequence length of the input hyperspectral data is obviously increased, the calculation overhead and the memory occupation of the model can be exponentially increased, so that the training efficiency is suddenly reduced, the deployment cost is suddenly increased, and even the processing of hyperspectral data with ultra-long space cannot be completed. Meanwhile, the existing hyperspectral image classification method still depends on discrete class labels to a great extent in the process of supervision training. The discrete label can only transmit category attribution information of a sample, and is difficult to reflect rich semantic meaning and fine granularity difference which are contained in each category in the real world, so that modeling capacity of a model on complex semantic relation is limited, and classification performance is obviously restricted under the scene of complex data distribution or higher similarity between categories. In summary, the existing hyperspectral image classification method has the core problems of high computational complexity and insufficient semantic information utilization, and a technical scheme with high-efficiency modeling and accurate semantic characterization is needed to meet the requirements of practical application scenes. Disclosure of Invention Aiming at the defects existing in the prior art, the invention provides a Mamba contrast learning hyperspectral image classification method and system based on text guidance. The invention discloses a text-guided Mamba contrast learning hyperspectral image classification method, which comprises the following steps: Step S1, data preparation and text guiding feature construction, namely acquiring hyperspectral image data and corresponding category text description thereof, carrying out enhancement processing on the hyperspectral image data to generate two enhancement views of the same training sample, and encoding the category text description into text semantic features by using a text encoder; S2, model pre-training, namely constructing a pre-training frame comprising two parallel comparison branches, respectively inputting different enhancement views of the same training sample into the different comparison branches, respectively extracting spectral features and spatial features through corresponding Mamba visual encoders, and fusing the spectral features and the spatial features into visual features; calculating contrast loss based on the semantic enhancement visual features output by different contrast branches, constructing a joint loss function by combining the classification loss, and synchronously optimizing model parameters to align visual-text pairs with similar semantics and separate visual-text pairs with dissimilar semantics; S3, fine tuning a model, namely multiplexing pre-trained model parameters to initialize based on a labeling sample in the hyperspectral image data, and performing supervised training on the model through a classification loss function to adapt to a target classification task; And S4, classification reasoning, namely inputting the hyperspectral image to be classified into a fine-tuned model, and outputting a pixel-level classification result. As a further improvement of the present invention, in the step S1, the generating process of the enhanced view specifically includes: extracting a local three-dimensional image block of size around each center pixel for hyperspectral image data; Performing enhance