CN-122019820-A - Zero sample remote sensing image combined retrieval method based on multi-mode fusion
Abstract
The invention particularly relates to a zero sample remote sensing image combined retrieval method based on multi-mode fusion, which comprises the following steps: firstly, a zero sample query text generator is designed, and a complete query text sample can be automatically generated based on attributes, so that a remote sensing combined image reference dataset comprising triples is constructed, and the problem that the existing remote sensing field dataset cannot support multi-mode retrieval is solved. Secondly, the invention provides a combined image retrieval method for text-image sequence training, which effectively utilizes bimodal information of images and texts to retrieve through a multi-stage training strategy, and avoids the limitation that the traditional method only depends on texts. In addition, the text image sequential training method can filter conflict information between the images and the text, preserve fine granularity characteristics of the images and remarkably improve retrieval precision.
Inventors
- HUANG TAO
- XUE SHIWEN
- LIANG ZHECHUN
- WANG ZHENYU
- DONG WEISHENG
Assignees
- 西安电子科技大学
- 西安电子科技大学杭州研究院
Dates
- Publication Date
- 20260512
- Application Date
- 20260109
Claims (10)
- 1. A zero sample remote sensing image combination retrieval method based on multi-mode fusion is characterized by comprising the following steps: coding a current query text and a current query image to respectively acquire text embedded information and image embedded information; Extracting fine granularity characteristics of the image embedded information based on a pre-constructed fine granularity image attention model to obtain fine granularity image information, wherein the fine granularity image attention model is constructed based on model training of a multi-head self-attention network structure by using a query image sample and is used for filtering the image embedded information which is not matched with the text embedded information; and respectively matching the text embedded information and the fine-granularity image information with the candidate images to obtain a target image.
- 2. The method according to claim 1, wherein extracting fine-grained characteristics of the image embedding information based on the pre-constructed fine-grained image attention model to obtain fine-grained image information comprises: Inputting the image embedding information into a fine-grained image attention model; And carrying out characteristic re-weighting on the image embedded information through a multi-head self-attention network structure in the fine-granularity image attention model to determine fine-granularity image information.
- 3. The method according to claim 1, wherein the matching the text embedded information and the fine-grained image information with the candidate images respectively to obtain the target image includes: image coding is carried out on the candidate images in advance to obtain candidate embedded information; calculating a first cosine distance based on the text embedding information and the candidate embedding information; calculating a second cosine distance based on the fine-granularity image information and the candidate embedded information; Weighting and fusing the first cosine distance and the second cosine distance according to a preset weighting coefficient to obtain the comprehensive distance of the candidate image; And screening out target images from the candidate images according to the comprehensive distance.
- 4. A method according to any one of claims 1 to 3, further comprising: Dividing an initial remote sensing image set into a plurality of image subsets based on image similarity, and constructing an image pair in each image subset, wherein the image pair comprises a query image sample and a target image sample; Generating a query text sample matched with the image pair by using a zero sample query text generator, wherein the zero sample query text generator comprises a text generation sub-model and a contrast language-image pre-training sub-model; Training the multiple self-masking network structure by updating the first loss value to obtain a multiple self-masking projection model; And training the multi-head attention network structure by updating the second loss value to obtain a fine-granularity image attention model.
- 5. The method of claim 4, wherein dividing the initial set of remotely sensed images into a plurality of image subsets based on image similarity and constructing an image pair in each image subset comprises: Extracting the characteristics of each remote sensing image in the initial remote sensing image set to obtain a corresponding image characteristic vector; Dividing an initial remote sensing image set based on image similarity among image feature vectors to obtain a plurality of image subsets, wherein the image similarity of each remote sensing image in the same image subset is in a preset similarity range; Randomly selecting one of the remote sensing images from the same image subset as a query image sample, and sequentially taking the rest remote sensing images in the corresponding image subset as target image samples to form an image pair with the query image sample.
- 6. The method of claim 4, wherein generating query text samples matching image pairs with a zero sample query text generator comprises: inputting the target image sample and the structured text template into a text generation sub-model to obtain candidate query texts and corresponding fluency scores; calculating the cross-modal similarity between the candidate query text and the target image sample by using the contrast language-image pre-training sub-model; and screening out a query text sample from the candidate query text based on the fluency score and the cross-modal similarity.
- 7. The method of claim 6, wherein calculating the first loss value using the query text sample and the first loss function comprises: encoding the query text sample by using a text encoder to generate reference text embedded information; adding noise to the reference text embedded information to generate noise-containing text embedded information; Performing multi-template masking and projection reconstruction on the embedded information of the noisy text by utilizing a multi-self-masking structure to generate a reconstructed query text, wherein the multi-self-masking structure comprises a plurality of parallel self-masking branches, and each self-masking branch corresponds to one masking template; Encoding the reconstructed query text to generate reconstructed text embedded information; Substituting the reference text embedded information and the reconstructed text embedded information into a first loss function to obtain a first loss value.
- 8. The method of claim 4, wherein calculating a second loss value based on the multiple self-mask projection model, the query image sample, and the second loss function comprises: Image coding is carried out on the query image sample, and image sample embedded information is obtained; inputting the embedded information of the image sample into a multiple self-masking projection model to obtain conflict information between the query image sample and the query text sample; Inputting the image sample embedded information into a multi-head attention structure to obtain fine-granularity image sample embedded information; substituting the image sample embedded information, the conflict information and the fine-granularity image sample information into a second loss function to obtain a second loss value.
- 9. A computer readable storage medium, characterized in that the computer readable storage medium comprises a stored executable program, wherein the executable program when run controls a device in which the storage medium is located to perform the method of any one of claims 1 to 8.
- 10. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1 to 8.
Description
Zero sample remote sensing image combined retrieval method based on multi-mode fusion Technical Field The invention relates to the technical field of remote sensing image retrieval, in particular to a zero sample remote sensing image combined retrieval method based on multi-mode fusion. Background Currently, applications of remote sensing in the field of earth observation are increasingly receiving attention. The large amount of remote sensing data provides rich research materials for computer vision tasks. However, managing and extracting relevant images from such data is a significant challenge, and the ability to efficiently and quickly retrieve remote sensing images is critical. Combined image retrieval (Composed IMAGE RETRIEVAL, CIR) solves this problem by searching and retrieving images from the remote sensing image archive. Unlike traditional unimodal queries, CIR is a more challenging visual language task and involves paired images and text. The goal is to retrieve the target image by adjusting the existing image. Because language is the most natural way of coding human interaction, CIR enhances the retrieval process, so that users can more effectively acquire required images through language refining inquiry. In the field of remote sensing combined image retrieval (Remote Sensing Composed IMAGE RETRIEVAL, RSCIR), the prior art faces three major core challenges: (1) The lack of a standardized reference dataset, such as the remote sensing image retrieval dataset Patterncom based on attribute labeling, only supports attribute-based retrieval and the lack of text query statements and structured triples, results in difficulty in evaluating the performance of the existing methods in the remote sensing combined image retrieval field; (2) The existing Zero sample combined image retrieval (Zero-shot Composed IMAGE RETRIEVAL, ZS-CIR) only carries out retrieval through a text mode, and fine granularity information in an image is ignored, so that the retrieval precision is low, and the retrieval precision is poor in the remote sensing image retrieval field; (3) The existing ZS-CIR usually uses a single projection module to process the multi-position mask, semantic diversity cannot be fully captured, and the retrieval effect is further reduced. It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the invention and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art. Disclosure of Invention The invention provides a zero-sample remote sensing image combined retrieval method based on multi-mode fusion, a computer readable storage medium and a computer program product, which can effectively overcome the defects in the prior art. Other features and advantages of the invention will be apparent from the following detailed description, or may be learned by the practice of the invention. According to a first aspect of the present invention, there is provided a zero sample remote sensing image combination retrieval method based on multi-mode fusion, the method comprising: coding a current query text and a current query image to respectively acquire text embedded information and image embedded information; Extracting fine granularity characteristics of the image embedded information based on a pre-constructed fine granularity image attention model to obtain fine granularity image information, wherein the fine granularity image attention model is constructed based on model training of a multi-head self-attention network structure by using a query image sample and is used for filtering the image embedded information which is not matched with the text embedded information; and respectively matching the text embedded information and the fine-granularity image information with the candidate images to obtain a target image. In some exemplary embodiments, the extracting fine-granularity features of the image embedding information based on the pre-constructed fine-granularity image attention model, to obtain fine-granularity image information, includes: Inputting the image embedding information into a fine-grained image attention model; And carrying out characteristic re-weighting on the image embedded information through a multi-head self-attention network structure in the fine-granularity image attention model to determine fine-granularity image information. In some exemplary embodiments, the matching the text embedded information and the fine-grained image information with the candidate image respectively to obtain the target image includes: image coding is carried out on the candidate images in advance to obtain candidate embedded information; calculating a first cosine distance based on the text embedding information and the candidate embedding information; calculating a second cosine distance based on the fine-granularity image information and the candidate embedded info