CN-121999142-A - Cross-modal text enhancement-based hypersurface hyperspectral reconstruction model, training method thereof and electronic equipment
Abstract
The invention discloses a cross-modal text enhancement-based hypersurface hyperspectral reconstruction model, a training method thereof and electronic equipment, and relates to the field of hypersurface hyperspectral reconstruction. The hyperspectral reconstruction model comprises an image encoder, a text encoder, a cross-modal alignment module, a cross-modal condition control module, an encoder level enhancement module and a mixed spatial spectrum decoder, realizes the joint characterization and collaborative reconstruction of a hyperspectral image acquired by a hypersurface and text semantic prompt, effectively solves the problem of semantic deletion caused by the fact that the existing method only depends on the characteristics of the image, and provides a technical framework with clear structure and complete functions for hyperspectral construction.
Inventors
- Ju Fayin
- LI NING
Assignees
- 浙江优众新材料科技有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20260408
Claims (10)
- 1. A cross-modal text enhancement based hypersurface hyperspectral reconstruction model comprising: The image encoder is used for receiving the multispectral image acquired by the super surface, processing the multispectral image through a plurality of cascaded convolution layer groups and generating an image characteristic representation; The text encoder is used for receiving the spatial distribution prompt, the spectral change prompt or the combination of the spatial distribution prompt and the spectral change prompt corresponding to the multispectral image and generating corresponding text semantic features; the cross-modal alignment module is used for calculating cosine similarity between the image feature representation and the text semantic feature to obtain similarity scores; Based on the text projection characteristics, respectively generating space guide characteristics and spectrum guide characteristics through a space-spectrum dual-path attention mechanism, carrying out weighted fusion on the space guide characteristics and the spectrum guide characteristics by utilizing similarity scores, and outputting cross-mode enhanced image characteristics; The encoder hierarchy enhancement module is used for carrying out multi-scale spatial spectrum modeling and cross-modal dynamic updating on the cross-modal enhanced image features and outputting a plurality of hierarchy fusion features; and the hybrid spatial spectrum decoder is used for constructing a target hyperspectral characteristic diagram based on the multiple layers of fusion characteristics.
- 2. The cross-modal text enhancement-based hypersurface hyperspectral reconstruction model as claimed in claim 1 wherein the cross-modal condition control module comprises: The text projection unit is composed of a plurality of cascaded convolution layers and is used for projecting the text semantic features to the space-channel dimension which is the same as the image feature representation to obtain text projection features; the spatial attention unit is used for carrying out attention calculation of spatial dimension on the image characteristic representation by taking the text projection characteristic as a query to generate a spatial guide characteristic; the spectrum attention unit is used for inquiring the text projection characteristic, carrying out attention calculation of channel dimension on the image characteristic representation and generating a spectrum guide characteristic; And the dynamic fusion unit is used for carrying out weighted summation on the spatial guide feature and the spectrum guide feature according to the similarity score and outputting the cross-modal enhanced image feature.
- 3. The cross-modal text enhancement-based hyperspectral reconstruction model as claimed in claim 2 wherein the encoder-level enhancement module includes a plurality of transform units cascaded in sequence, and a cross-modal update unit disposed between each two adjacent transform units; the plurality of conversion units cascaded in sequence sequentially comprise: Each basic transformation unit is used for carrying out local spatial structure extraction and inter-channel spectrum dependency modeling on the input characteristics of the basic transformation units; an enhancement transform unit for extracting multi-scale spatial-spectral combination context information from its input features and constructing a global context based on the information to generate attention weights by which the input features of the enhancement transform unit are adaptively modulated; Each core transformation unit is used for modeling long-range dependency relations among spectrum channels of input features of the core transformation units and capturing structural consistency in space dimension; the first transformation unit receives the cross-modal enhanced image characteristics as input, and the other transformation units respectively receive the output processed by the corresponding cross-modal updating unit of the previous transformation unit as input.
- 4. The cross-modal text enhancement-based hyperspectral reconstruction model as set forth in claim 3, wherein the cross-modal update unit is configured to receive and align the output features of the previous transformation unit and the text semantic features, generate a fusion weight according to the similarity score, and perform weighted fusion on the aligned output features and the aligned text semantic features based on the fusion weight, and use the fusion result as an input of the next transformation unit.
- 5. The cross-modal text enhancement based hypersurface hyperspectral reconstruction model as claimed in claim 4 wherein the plurality of levels of fusion features output by the encoder level enhancement module comprises: The first-level fusion feature is a feature of the output of the second basic transformation unit after being processed by a cross-mode updating unit arranged between the second basic transformation unit and the enhancement transformation unit; The second-level fusion feature is a feature of the output of the enhancement transformation unit after being processed by a cross-mode updating unit arranged between the enhancement transformation unit and the first core transformation unit; The third-level fusion feature is a feature of the output of the second core transformation unit after being processed by a cross-mode updating unit arranged between the second core transformation unit and the third core transformation unit; The fourth level fuses features, which are the outputs of the last core transform unit.
- 6. The cross-modal text enhancement-based hypersurface hyperspectral reconstruction model as claimed in claim 5 wherein the enhancement transformation unit specifically comprises: The multi-scale convolution module is used for performing depth separable convolution operations of different sizes on the input features of the multi-scale convolution module in parallel so as to respectively capture the spatial-spectral combination context information under the corresponding scale; The feature splicing module is used for splicing the spatial-spectral combination context information under each scale along the channel dimension to generate fusion features with the superimposed channel number and the unchanged spatial dimension; the global context extraction module is used for carrying out average pooling and maximum pooling on the fusion features respectively to obtain a first global statistical feature and a second global statistical feature; The attention weight generating module is used for fusing the first global statistical feature and the second global statistical feature to obtain a fused feature, sequentially executing a plurality of cascaded learning linear transformations on the fused feature, and then executing nonlinear normalization operation to generate attention weight with the same space-channel dimension as the input feature of the multi-scale convolution module; And the characteristic modulation module is used for multiplying the attention weight with the input characteristic of the multi-scale convolution module element by element to obtain a modulation characteristic, and adding the modulation characteristic with the input characteristic of the multi-scale convolution module to obtain the output characteristic of the enhancement transformation unit.
- 7. The cross-modal text enhancement based hypersurface hyperspectral reconstruction model as claimed in claim 6 wherein the hybrid spatial spectrum decoder comprises: the local feature processing module is used for respectively enhancing the correlation between the local spatial structure and the spectrum channel through the feature enhancement layer to obtain local enhancement features corresponding to the hierarchy fusion features one by one; The global feature modeling module is used for respectively upsampling the fusion features of each level and the corresponding local enhancement features to a preset spatial resolution, carrying out spatial flattening on each upsampled feature to obtain a corresponding feature sequence, and splicing all the feature sequences along the sequence dimension to form a global feature vector; performing linear projection on the global feature vector to respectively generate a query feature, a key feature and a value feature, calculating attention weights according to the similarity between the query feature and the key feature, and performing weighted aggregation on the value feature through the attention weights to output a global fusion feature; And the mapping module is used for mapping the global fusion characteristic into a target hyperspectral characteristic map through a convolution layer.
- 8. A method of training a cross-modal text-enhanced hypersurface hyperspectral reconstruction model as claimed in any one of claims 1 to 7 comprising: Constructing a text prompt library which comprises spatial distribution prompts describing the spatial distribution characteristics of the multispectral image and spectral change prompts describing the spectral change characteristics; Acquiring a multispectral image acquired by a hyperspectral surface, text prompt corresponding to the multispectral image and true value hyperspectral data of registration, constructing a triplet sample, and preprocessing the triplet sample to form a training set, wherein the text prompt is a spatial distribution prompt, a spectral change prompt or a combination of the spatial distribution prompt and the spectral change prompt; And performing end-to-end training on the hyperspectral reconstruction model through the training set, wherein in each training iteration, a multispectral image and a text prompt corresponding to the multispectral image are input into the hyperspectral reconstruction model to obtain a target hyperspectral feature map, the target hyperspectral feature map is subjected to channel adaptation, and a joint loss function is calculated based on the adapted target hyperspectral feature map and corresponding true hyperspectral data so as to update model parameters.
- 9. The method of training a cross-modal text enhancement based hypersurface hyperspectral reconstruction model as claimed in claim 8 wherein the joint loss function comprises: The pixel-level mean square error loss is used for restraining deviation of the target hyperspectral characteristic diagram after channel adaptation and the true hyperspectral data on the intensity value of each spectrum channel at each pixel; the spectrum angle matching loss is used for restraining the directional consistency between the target hyperspectral characteristic diagram after channel adaptation and the spectrum vector formed by the spectrum channel intensity value of the true hyperspectral data at each pixel; the joint loss function is a weighted sum of the pixel-level mean square error loss and the spectral angle matching loss.
- 10. An electronic device comprising a processor and a memory storing a program, wherein the program comprises instructions that when executed by the processor cause the processor to perform the method of training a cross-modal text enhancement-based hyperspectral reconstruction model as set forth in claim 8.
Description
Cross-modal text enhancement-based hypersurface hyperspectral reconstruction model, training method thereof and electronic equipment Technical Field The invention relates to the field of hyperspectral reconstruction of hypersurface, in particular to a hyperspectral reconstruction model based on cross-modal text enhancement, a training method thereof and electronic equipment. Background The hyperspectral imaging technology has important application value in the fields of material identification, biomedical detection, environmental monitoring and the like because the hyperspectral imaging technology can simultaneously acquire the spatial information and continuous and fine spectral information of a target. Traditional hyperspectral systems rely on complex optical structures, are large in size and weight, and are difficult to meet the requirements of light-weight platforms. In recent years, a micro-spectrum imaging chip (such as a real-time hyperspectral imaging chip) based on a hypersurface realizes rapid and compact acquisition of multispectral images through micro-nano structure regulation and control, and obviously promotes miniaturization and integration of a hyperspectral imaging system. However, such super-surface chips typically output multi-spectral images of limited band, low spectral resolution, which still need to be mapped to the target hyperspectral space by means of a subsequent computational reconstruction algorithm. The existing reconstruction method mostly adopts a general deep learning model, only extracts features from image data, lacks explicit semantic modeling of scene space distribution characteristics and spectrum change rules, and causes deviation of a reconstruction result in the aspect of keeping consistency of space details and spectrum curves. In addition, the existing fusion mechanism mostly adopts a fixed weight or simple splicing strategy, so that the self-adaption and selective fusion of the cross-modal information are difficult to realize, and information redundancy or semantic offset is easy to introduce, so that the reconstruction accuracy is influenced. Although some approaches attempt to introduce attention mechanisms to enhance feature expression, often there is no differentiation in the requirements of differential modeling of spatial and spectral dimensions, nor is there a dynamic association established between external semantic cues and image features, resulting in limited spatial-spectral joint characterization capability. Therefore, there is a need for a hyperspectral reconstruction method that combines spatial and spectral semantic priors and adaptively fuses features accordingly to ensure both spatial sharpness and spectral realism of the reconstructed image. Disclosure of Invention In order to improve the space detail fidelity and spectral curve accuracy of hyperspectral reconstruction, the invention provides a hypersurface hyperspectral reconstruction model based on cross-modal text enhancement, which comprises the following steps: The image encoder is used for receiving the multispectral image acquired by the super surface, processing the multispectral image through a plurality of cascaded convolution layer groups and generating an image characteristic representation; The text encoder is used for receiving the spatial distribution prompt, the spectral change prompt or the combination of the spatial distribution prompt and the spectral change prompt corresponding to the multispectral image and generating corresponding text semantic features; the cross-modal alignment module is used for calculating cosine similarity between the image feature representation and the text semantic feature to obtain similarity scores; Based on the text projection characteristics, respectively generating space guide characteristics and spectrum guide characteristics through a space-spectrum dual-path attention mechanism, carrying out weighted fusion on the space guide characteristics and the spectrum guide characteristics by utilizing similarity scores, and outputting cross-mode enhanced image characteristics; The encoder hierarchy enhancement module is used for carrying out multi-scale spatial spectrum modeling and cross-modal dynamic updating on the cross-modal enhanced image features and outputting a plurality of hierarchy fusion features; and the hybrid spatial spectrum decoder is used for constructing a target hyperspectral characteristic diagram based on the multiple layers of fusion characteristics. Further, the cross-modality condition control module includes: The text projection unit is composed of a plurality of cascaded convolution layers and is used for projecting the text semantic features to the space-channel dimension which is the same as the image feature representation to obtain text projection features; the spatial attention unit is used for carrying out attention calculation of spatial dimension on the image characteristic representation by taking the text projecti