CN-122024089-A - Method and system for positioning image sequence under cross-view angle

CN122024089ACN 122024089 ACN122024089 ACN 122024089ACN-122024089-A

Abstract

The invention belongs to the field of artificial intelligence and computer vision, and discloses a method and a system for positioning an image sequence under a cross-view angle, wherein the method comprises the steps of obtaining a street view sequence image and a corresponding satellite image; the method comprises the steps of constructing a cross-view image sequence visual positioning model, training the cross-view image sequence visual positioning model by using street view sequence images and corresponding satellite images, inputting the street view sequence images to be tested into the trained cross-view image sequence visual positioning model, and realizing satellite image retrieval.

Inventors

ZHENG XIN
LI WEIGANG
YIN QIAN
FENG MINGZHE
BAO RUI

Assignees

北京师范大学

Dates

Publication Date: 20260512
Application Date: 20260206

Claims (8)

1. A method for cross-view image sequence positioning, the method comprising: obtaining a street view sequence image and a corresponding satellite image; constructing a visual positioning model of the cross-view image sequence; training a visual positioning model of the cross-view image sequence by using the street view sequence image and the corresponding satellite image; And inputting the street view sequence image to be detected into a trained cross-view image sequence visual positioning model, so as to realize satellite image retrieval.
2. The method of claim 1, wherein the cross-view image sequence visual positioning model comprises a feature extraction module, a feature fusion module and an aggregation pooling module; The feature extraction module is used for extracting a street view sequence image and a preliminary feature representation of a corresponding satellite image through a backbone network ConvNeXt-B; the feature fusion module consists of a layered graph neural network GNN fusion module and is used for embedding street view features extracted by a backbone network as frame level nodes and generating scene level advanced node embedding through aggregation; the aggregation pooling module is used for aggregating and pooling through embedding the advanced nodes of the scene level through the graph neural network again, extracting the final sequence level embedding for calculation and fusion, and then carrying out measurement learning.
3. The method of claim 2, wherein the step of embedding the extracted street view features of the backbone network as frame level nodes and generating the scene level advanced node embedments by aggregation comprises: ; Wherein, the In order to assign the matrix to the matrix, And B represents Batchsize as a feature matrix, B group sequences are processed once, C is the number of new aggregation nodes of one layer, F is the feature dimension of the nodes, and the feature dimension is unchanged in the aggregation process.
4. The method of claim 2, wherein training the cross-view image sequence visual positioning model using the street view sequence images and corresponding satellite images comprises: performing semantic feature embedding learning of satellite features and street view features by using InfoNCE loss functions of contrast learning; And training by taking the similarity distance between the learned feature vector and the geographic coordinates as a loss function, so as to realize constraint and optimization of the model.
5. A system for positioning an image sequence across viewing angles, the system for implementing the method of any of claims 1-4, the system comprising an acquisition module, a construction module, a training module, a positioning module; the acquisition module is used for acquiring street view sequence images and corresponding satellite images; The construction module is used for constructing a visual positioning model of the cross-view image sequence; the training module is used for training a visual positioning model of the cross-view image sequence by using the street view sequence image and the corresponding satellite image; the positioning module is used for inputting the street view sequence image to be detected into a trained visual positioning model of the cross-view image sequence, and satellite image retrieval is achieved.
6. The system of claim 5, wherein the cross-view image sequence visual positioning model comprises a feature extraction module, a feature fusion module, and an aggregation pooling module; The feature extraction module is used for extracting a street view sequence image and a preliminary feature representation of a corresponding satellite image through a backbone network ConvNeXt-B; the feature fusion module consists of a layered graph neural network GNN fusion module and is used for embedding street view features extracted by a backbone network as frame level nodes and generating scene level advanced node embedding through aggregation; the aggregation pooling module is used for aggregating and pooling through embedding the advanced nodes of the scene level through the graph neural network again, extracting the final sequence level embedding for calculation and fusion, and then carrying out measurement learning.
7. The system of claim 6, wherein the method of embedding the extracted street view features of the backbone network as frame level nodes, generating advanced node embeddings of scene level by aggregation comprises: ; Wherein, the In order to assign the matrix to the matrix, And B represents Batchsize as a feature matrix, B group sequences are processed once, C is the number of new aggregation nodes of one layer, F is the feature dimension of the nodes, and the feature dimension is unchanged in the aggregation process.
8. The system of claim 6, wherein the method for training the cross-view image sequence visual positioning model using the street view sequence images and the corresponding satellite images comprises: performing semantic feature embedding learning of satellite features and street view features by using InfoNCE loss functions of contrast learning; And training by taking the similarity distance between the learned feature vector and the geographic coordinates as a loss function, so as to realize constraint and optimization of the model.

Description

Method and system for positioning image sequence under cross-view angle Technical Field The invention belongs to the field of artificial intelligence and computer vision, and particularly relates to a method and a system for positioning an image sequence under a cross-view angle. Background In the process of cross-view image positioning, more context information is contained in the sequence images compared with Shan Zhen images, however, in the process of image retrieval, the existing method is used for carrying out information fusion through average pooling and the like, but neglecting the association between the interior of the sequence images and the constraint relation with external geographic coordinates, the information is neglected in the process of geographic positioning, the fusion method based on time attention is used for needing a fixed input sequence, the utilization of spatial structure information of multi-frame images is greatly enhanced by fusion of a hierarchical graph neural network through the similarity between the sequence images, unordered input can be adapted, and training is carried out through the consistency loss constraint of image features and the geographic coordinates, so that the model and geographic consistency and retrieval performance are enhanced. Disclosure of Invention In order to solve the problems in the prior art, the invention provides a method and a system for positioning an image sequence under a cross view angle, and the method provides training constraint of fusion method and geographic consistency loss of multi-frame images based on a layered scene in a cross view angle image sequence model, and can be used for tasks such as image retrieval, cross view angle geographic positioning and the like. In order to achieve the above object, the present invention provides the following solutions: A method for cross-view image sequence positioning, the method comprising: obtaining a street view sequence image and a corresponding satellite image; constructing a visual positioning model of the cross-view image sequence; training a visual positioning model of the cross-view image sequence by using the street view sequence image and the corresponding satellite image; And inputting the street view sequence image to be detected into a trained cross-view image sequence visual positioning model, so as to realize satellite image retrieval. Preferably, the cross-view image sequence visual positioning model comprises a feature extraction module, a feature fusion module and an aggregation pooling module; The feature extraction module is used for extracting a street view sequence image and a preliminary feature representation of a corresponding satellite image through a backbone network ConvNeXt-B; the feature fusion module consists of a layered graph neural network GNN fusion module and is used for embedding street view features extracted by a backbone network as frame level nodes and generating scene level advanced node embedding through aggregation; the aggregation pooling module is used for aggregating and pooling through embedding the advanced nodes of the scene level through the graph neural network again, extracting the final sequence level embedding for calculation and fusion, and then carrying out measurement learning. Preferably, the method for embedding the street view features extracted by the backbone network as frame level nodes and generating advanced node embedding of scene level by aggregation comprises the following steps: ; Wherein, the In order to assign the matrix to the matrix,And B represents Batchsize as a feature matrix, B group sequences are processed once, C is the number of new aggregation nodes of one layer, F is the feature dimension of the nodes, and the feature dimension is unchanged in the aggregation process. Preferably, the method for training the cross-view image sequence visual positioning model by using the street view sequence image and the corresponding satellite image comprises the following steps: performing semantic feature embedding learning of satellite features and street view features by using InfoNCE loss functions of contrast learning; And training by taking the similarity distance between the learned feature vector and the geographic coordinates as a loss function, so as to realize constraint and optimization of the model. The invention also provides a system for positioning the image sequence under the cross-view angle, which is used for realizing the method, and comprises an acquisition module, a construction module, a training module and a positioning module; the acquisition module is used for acquiring street view sequence images and corresponding satellite images; The construction module is used for constructing a visual positioning model of the cross-view image sequence; the training module is used for training a visual positioning model of the cross-view image sequence by using the street view sequence image and the correspondi