CN-121980054-A - Cross-modal remote sensing image retrieval method and device

CN121980054ACN 121980054 ACN121980054 ACN 121980054ACN-121980054-A

Abstract

The invention discloses a cross-modal remote sensing image retrieval method and device, which belong to the technical field of remote sensing image intelligent analysis and cross-modal information retrieval, and are characterized in that a sample set containing a double-temporal remote sensing image pair and a corresponding change description text is constructed, a multi-scale double-branch text-guided visual change feature extraction module is adopted to jointly model a double-temporal image to generate visual change feature representation, a frozen pre-training text encoder is utilized to provide semantic priors, the multi-scale local change feature is subjected to weight scaling to highlight a text related area, a text semantic modeling module is constructed to generate a global text feature representation, and semantic alignment of a change text and the double-temporal image change feature in a shared space is realized through joint optimization of a constraint module containing global cross-modal contrast loss and local semantic alignment loss. The invention realizes the remote sensing image pair retrieval based on natural language change description, and improves the accuracy of change semantic expression and the reliability of cross-modal matching.

Inventors

YU JIEXIAO
FU YUJIE
LIU JING

Assignees

天津大学

Dates

Publication Date: 20260505
Application Date: 20260203

Claims (10)

1. The cross-mode remote sensing image retrieval method is characterized by comprising the following steps of: acquiring a pre-change remote sensing image, a post-change remote sensing image and corresponding change description text of the same geographic area; performing feature extraction and joint modeling on the pre-change remote sensing image and the post-change remote sensing image to generate a multi-scale visual change feature representation; Processing the change description text by using a frozen pre-training text encoder to obtain text semantic features, and carrying out weight scaling on the multi-scale visual change feature representation according to the text semantic features to generate an image change embedded vector; carrying out semantic coding on the change description text to generate a global text embedded vector; and realizing semantic alignment of the image change embedded vector and the global text embedded vector in a shared feature space by jointly optimizing the global cross-modal contrast loss and the local semantic alignment loss.
2. The method of claim 1, wherein the step of determining the position of the substrate comprises, The feature extraction and joint modeling of the pre-change remote sensing image and the post-change remote sensing image comprises the following steps: Constructing a global change feature extraction branch and a local change feature extraction branch; Carrying out time sequence fusion on the integral features of the double-phase images through the global change feature extraction branch to generate a global change feature representation representing scene level change semantics; and extracting local features of the double-phase image through the local change feature extraction branches, calculating local feature differences under a plurality of spatial scales, and generating a multi-scale local change feature representation.
3. The method of claim 2, wherein the step of determining the position of the substrate comprises, Generating the global change feature representation includes: Extracting global visual feature vectors of the remote sensing image before change and the remote sensing image after change respectively; And inputting the two global visual feature vectors into a time sequence attention fusion module for weighted fusion, and outputting the global change feature representation.
4. The method of claim 2, wherein the step of determining the position of the substrate comprises, Generating the multi-scale local variation feature representation includes: Acquiring local block feature sequences of the remote sensing image before change and the remote sensing image after change; Under at least two different resolution scales, inputting the local block characteristic sequences of the double phases into a change transform encoder for interactive encoding; calculating the difference of the coded double-phase characteristics at the corresponding positions to obtain a change characteristic sequence under each scale; and (3) aligning and fusing the change feature sequences under different scales to form a unified multi-scale local change feature representation.
5. The method of claim 1, wherein the step of determining the position of the substrate comprises, The weighting scaling of the multi-scale visual change feature representation according to the text semantic features comprises: Calculating the correlation between the text semantic features and each local feature in the multi-scale visual change feature representation; generating weight information according to the correlation; and carrying out weighted aggregation on the multi-scale visual change characteristic representation by using the weight information, and highlighting the visual change area related to text semantics.
6. The method of claim 1, wherein the step of determining the position of the substrate comprises, Semantically encoding the change description text includes: Converting the change description text into a text mark sequence and carrying out embedded representation; modeling the embedded representation up and down Wen Yuyi using a self-attention based text encoder; and extracting a global text feature representation for representing the whole text semantic from the coded sequence as the global text embedding vector.
7. The method of claim 1, wherein the step of determining the position of the substrate comprises, The global cross-modal contrast loss is InfoNCE loss based on multiple positive examples, the local semantic alignment loss is constructed through a cross-modal cross-attention mechanism, the cross-modal cross-attention mechanism takes text mark representation as query, the multi-scale visual change feature representation as key and value, and attention distribution between text and change area is calculated for supervision alignment.
8. A cross-modal remote sensing image retrieval apparatus for implementing the method of any one of claims 1-7, the apparatus comprising: The image and text input module is used for acquiring a pre-change remote sensing image, a post-change remote sensing image and a corresponding change description text of the same geographic area; The image feature joint modeling module is used for carrying out feature extraction and joint modeling on the pre-change remote sensing image and the post-change remote sensing image to generate a multi-scale visual change feature representation; The text-guided visual feature scaling module is used for processing the change description text by using a frozen pre-training text encoder to obtain text semantic features, and carrying out weight scaling on the multi-scale visual change feature representation according to the text semantic features so as to generate an image change embedded vector; the text semantic coding module is used for carrying out semantic coding on the change description text to generate a global text embedded vector; And the cross-modal semantic alignment training module is used for realizing semantic alignment of the image change embedded vector and the global text embedded vector in a shared feature space by jointly optimizing the global cross-modal contrast loss and the local semantic alignment loss.
9. A computer terminal device, comprising: One or more processors; A memory coupled to the processor for storing one or more programs; When executed by the one or more processors, causes the one or more processors to implement the steps of the method of any of claims 1-7.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method according to any of claims 1-7.

Description

Cross-modal remote sensing image retrieval method and device Technical Field The invention belongs to the technical field of intelligent analysis and cross-modal information retrieval of remote sensing images, and particularly relates to a cross-modal remote sensing image retrieval method and device. Background Along with the rapid development of remote sensing imaging technology and earth observation platform, the scale of high-resolution multi-temporal remote sensing image data is continuously enlarged, and the data plays a vital role in the fields of urban extension monitoring, land utilization change analysis, disaster assessment and the like. In practical applications, users often want to quickly and accurately search out a dual-time phase image pair before and after the change, which can reflect the geographic change corresponding to the description, from a massive historical and active remote sensing image library by inputting natural language descriptions such as "a residential area is newly built in a certain area" or "a road is covered by vegetation". The retrieval requirement based on the change semantics aims at improving the intuitiveness and efficiency of remote sensing data query. However, existing solutions have difficulty in efficiently supporting such needs. The main stream remote sensing image retrieval method is mostly built around a single static image, and the core of the main stream remote sensing image retrieval method is to establish a matching relation between visual content of the image and text description, so that critical time dimension information in remote sensing data is completely ignored, and semantic query related to a 'change' event cannot be modeled and responded. On the other hand, although the conventional remote sensing change detection technology can identify and locate a specific pixel or region which changes by comparing the two-phase images, the technical aim is to generate a change map or a segmentation map, which is a closed visual analysis process and is not associated with natural language semantics, so that related image pairs can not be directly searched according to the change description text of an open domain. Therefore, the prior art has an obvious blank that a technical means capable of deeply fusing change description text semantics and double-phase remote sensing image change characteristics is lacked, so as to realize cross-mode change retrieval of text search graphs. The root of the problem is that two challenges need to be overcome simultaneously, namely, complex change information between two-time-phase images is accurately extracted and expressed, and fine-granularity alignment and matching are carried out on visual changes and change semantics in texts in the same feature space. The method solves the problem of cross-mode time sequence semantic alignment, and is a key point for realizing efficient and accurate remote sensing change retrieval. Disclosure of Invention In order to solve the technical problems, the invention provides a cross-mode remote sensing image retrieval method and a device thereof, which are used for solving the problems existing in the prior art. In order to achieve the above object, the present invention provides a cross-modal remote sensing image retrieval method, comprising the following steps: acquiring a pre-change remote sensing image, a post-change remote sensing image and corresponding change description text of the same geographic area; performing feature extraction and joint modeling on the pre-change remote sensing image and the post-change remote sensing image to generate a multi-scale visual change feature representation; Processing the change description text by using a frozen pre-training text encoder to obtain text semantic features, and carrying out weight scaling on the multi-scale visual change feature representation according to the text semantic features to generate an image change embedded vector; carrying out semantic coding on the change description text to generate a global text embedded vector; and realizing semantic alignment of the image change embedded vector and the global text embedded vector in a shared feature space by jointly optimizing the global cross-modal contrast loss and the local semantic alignment loss. Optionally, performing feature extraction and joint modeling on the pre-change remote sensing image and the post-change remote sensing image includes: Constructing a global change feature extraction branch and a local change feature extraction branch; Carrying out time sequence fusion on the integral features of the double-phase images through the global change feature extraction branch to generate a global change feature representation representing scene level change semantics; and extracting local features of the double-phase image through the local change feature extraction branches, calculating local feature differences under a plurality of spatial scales, and generating a multi-