CN-122019821-A - Remote sensing image-text cross-mode retrieval method and system based on multi-mode large model
Abstract
The invention discloses a remote sensing image-text cross-mode retrieval method and a remote sensing image-text cross-mode retrieval system based on a multi-mode large model, wherein the method comprises the steps of constructing a retrieval model and training, wherein the framework of the model comprises initial mode feature extraction, fine-granularity image-text characterization enhancement, image-text similarity scoring function design of group perception and multi-view contrast learning strategies; extracting initial characterization of the text and the image respectively, enhancing the two types of fine-grained characterization respectively, separating the background information of each image block from the target features by taking feature reconstruction, scene classification and background information alignment 3 learning tasks as guidance for the image, enhancing the relation between the targets through graph learning, emphasizing key word information for the text, calculating relevant scores and generating the similarity of image-text pairs. The system comprises a model building unit, a model training unit and a retrieval unit. By using the method and the device, the cross-modal retrieval precision can be improved. The method and the device can be applied to the field of cross-modal retrieval.
Inventors
- YU CHENYUN
- ZHENG YE
- Cai Liankai
Assignees
- 中山大学
Dates
- Publication Date
- 20260512
- Application Date
- 20260127
Claims (9)
- 1. The remote sensing image-text cross-mode retrieval method based on the multi-mode large model is characterized by comprising the following steps of: constructing a data set according to the image data and the text data; constructing a retrieval model based on the feature extraction module, the characterization enhancement module and the similarity perception calculation module; training the retrieval model based on the data set to obtain a trained retrieval model; Inputting the image-text data to be searched into the search model after training; performing feature extraction on the image-text data to be retrieved based on the feature extraction module to obtain initial modal features; Fine granularity enhancement is carried out on the initial modal characteristics based on the characterization enhancement module, so that enhanced characteristics are obtained; And calculating a group perception similarity score, a probe similarity score and a global semantic similarity score according to the enhanced features and the initial modal features based on the similarity perception calculation module, and generating the similarity of image-text pairs.
- 2. The remote sensing image-text cross-mode retrieval method based on the multi-mode large model according to claim 1, wherein the step of extracting the characteristics of the image-text data to be retrieved based on the characteristic extraction module specifically comprises the following steps: Encoding the image of the graphic data based on an image encoder to obtain initial image characteristics; the initial image features include global features and local image block features; encoding the text of the graphic data based on a text encoder to obtain initial text characteristics; the initial text features include sentence features and word features.
- 3. The remote sensing image-text cross-modal retrieval method based on a multi-modal large model according to claim 1, wherein the step of carrying out fine-granularity enhancement on the initial modal feature based on the characterization enhancement module to obtain an enhanced feature specifically comprises the following steps: introducing a compression and excitation network, a characteristic decoupling network and a graph learning module, and performing decoupling reasoning enhancement on initial modal characteristics of the image to obtain enhanced characteristics of the image; And selecting keywords of the text, and enhancing the initial modal characteristics of the text by combining the compression and excitation network and the gating unit to obtain enhanced text fine granularity characterization.
- 4. The remote sensing image-text cross-mode retrieval method based on the multi-mode large model according to claim 3, wherein the working process of the characteristic decoupling network is as follows: encoding the global characterization of the image and the enhanced local characterization generated by the compression and excitation network into an implicit feature space to obtain a first feature; And combining a feature reconstruction task, a scene discrimination task and a background information alignment task, and decoupling and optimizing the first features to obtain key target features.
- 5. The remote sensing image-text cross-mode retrieval method based on the multi-mode large model as claimed in claim 4, wherein the working process of the image learning module is as follows: The key target features are used as initial graph node representation, cosine similarity among nodes is used as side information, and a full-connection graph is generated; Based on the full connection graph, performing characterization learning through a graph convolution network, capturing a dependency relationship by adopting a gating circulation unit, and generating enhanced features of the image.
- 6. The remote sensing image-text cross-mode retrieval method based on the multi-mode large model as claimed in claim 1, wherein the calculation formula of the group perception similarity score is as follows: Wherein, the The value of B is the total number of keywords, the value of B is the number of images in the batch, Represent the first An enhanced fine-grained representation of the text, Represent the first Fine-grained characterization of individual images.
- 7. The remote sensing image-text cross-mode retrieval method based on the multi-mode large model as claimed in claim 6, wherein the loss function of the training process is expressed as follows: Wherein, the Representing a ternary ordering loss, The graph-text comparison learning loss is represented, Representing the decoupling loss of the image features, Representing the corresponding weight parameter(s), A single image is represented and is displayed, A set of images is represented and, The edge parameter is represented by a value of the edge parameter, A single text is represented and, A set of text is represented and, And Respectively refer to And The most similar negative samples in the current lot, Representing a contrast learning penalty for the image retrieval text, Representing a contrast learning penalty for the text retrieval image, Indicating a loss of reconstruction and, Representing the cross-entropy loss, Representing a background consistency loss of the image.
- 8. The remote sensing image-text cross-mode retrieval system based on the multi-mode large model is characterized by comprising the following steps of: A data set construction unit for constructing a data set from the image data and the text data; the model construction unit is used for constructing a retrieval model based on the feature extraction module, the characterization enhancement module and the similarity perception calculation module; The model training unit is used for training the retrieval model based on the data set to obtain a trained retrieval model; The image-text data retrieval system comprises a retrieval unit, a feature extraction module, a characteristic enhancement module, a similarity perception calculation module, a group perception similarity score, a probe similarity score and a global semantic similarity score, wherein the retrieval unit is used for inputting image-text data to be retrieved to a retrieval model which is completed in training, extracting features of the image-text data to be retrieved based on the feature extraction module to obtain initial modal features, carrying out fine-granularity enhancement on the initial modal features based on the characteristic enhancement module to obtain enhanced features, and calculating the similarity of image-text pairs based on the enhanced features and the initial modal features.
- 9. The remote sensing image-text cross-mode retrieval device based on the multi-mode large model is characterized by comprising: At least one processor; at least one memory for storing at least one program; When the at least one program is executed by the at least one processor, the at least one processor is caused to implement a remote sensing image-text cross-mode searching method based on a multi-mode large model as claimed in any one of claims 1 to 7.
Description
Remote sensing image-text cross-mode retrieval method and system based on multi-mode large model Technical Field The invention relates to the field of cross-modal retrieval, in particular to a remote sensing image-text cross-modal retrieval method and system based on a multi-modal large model. Background With the rapid development of the earth observation technology, remote sensing image data is explosively increased, and mass and various data supports are provided for various fields such as geographic information analysis, environment monitoring, natural disaster assessment, urban planning and the like. However, the high dimensionality, complexity and diversity of remote sensing data presents serious challenges for information extraction and content understanding. Specifically, the remote sensing image has the characteristics of small target scale, complex background, remarkable noise interference and the like, and text description matched with the remote sensing image is often broad and abstract. In addition, the problems of unbalanced semantics, large distribution difference and the like of the multi-mode data further aggravate the difficulty of accurate matching of graphics and texts. At present, the remote sensing image-text retrieval method mainly comprises a deep learning-based method and a Visual-Language pre-training large model (PRETRAINED LLMS, VLP) -based method. The former usually extracts image features by means of a Convolutional Neural Network (CNN) or a backbone network such as vision Transformer (ViT), and then learns the public space of the graphic representation by designing a network structure and a modal alignment method, so as to realize cross-modal data matching. However, the performance of such methods depends largely on the size and quality of the training set, and requires retraining the entire model (including the multi-modal feature extraction and characterization alignment module), resulting in higher overall computational and training costs. In contrast, VLPs are able to learn more versatile and rich semantic representations by joint pre-training over large-scale cross-modal datasets. Therefore, the cross-modal searching method based on the VLP can more effectively shorten the data distribution distance between the text and the image features. Moreover, in practice, such methods typically employ model tuning (updating a relatively small number of network parameters, or freezing them for use only as initial feature extractors), and thus have relatively little overhead in terms of computation and training. However, cross-modal remote sensing teletext retrieval techniques based on VLPs still face several challenges. First, the targets in the remote sensing image are finer and the image background information of the same scene label is highly similar, which makes it difficult to accurately identify the small differences between the images. Secondly, the existing method focuses on global semantic learning, and focuses on the fine granularity characteristics of images/texts, so that the accurate matching capability of the images/texts in a remote sensing scene is poor. Disclosure of Invention In view of this, in order to solve the technical problem that the existing cross-modal method based on VLP mainly considers example-level feature alignment, does not consider global modal correlation, and further causes low accuracy of cross-modal image-text retrieval, in a first aspect, the invention provides a remote sensing image-text cross-modal retrieval method based on a multi-modal large model, which comprises the following steps: constructing a retrieval model and training, wherein the framework of the model comprises four core parts, namely, initial modal feature extraction, fine-granularity image-text characterization enhancement, image-text similarity scoring function design for group perception and multi-view contrast learning strategy; The overall process of the search model is as follows, firstly, extracting initial representations (covering global and fine-grained representations) of text and images respectively by means of a pre-trained multi-modal large model (e.g. CLIP). On one hand, an image decoupling reasoning enhancement module is designed, 3 learning tasks including feature reconstruction, scene classification and background information alignment are used as guidance, the background information of each image block is effectively separated from target features, meanwhile, the relation among targets is enhanced through graph learning, a clearer and more accurate data basis is provided for subsequent mode matching, and on the other hand, an important word labeling mechanism is introduced for a text mode, and key semantic information in text description is enhanced. Based on the working foundation of fine granularity modal characterization enhancement, the scheme designs a group-perceived graph-text similarity scoring function, realizes modal alignment from a macroscopic angl