CN-121999038-A - Cross-view geographic positioning method and system based on multi-mode data

CN121999038ACN 121999038 ACN121999038 ACN 121999038ACN-121999038-A

Abstract

The invention provides a cross-view geographic positioning method and a system based on multi-mode data, wherein the method comprises the steps of respectively preprocessing an optical image and a satellite image to obtain a preprocessed optical image and a preprocessed satellite image; the method comprises the steps of utilizing an intermediate layer of a first feature extraction network and introducing a normalization layer to perform feature extraction on a preprocessed optical image to obtain optical features, utilizing a second feature extraction network and a feature pyramid network to perform feature extraction on a preprocessed satellite image to obtain SAR features, performing weighted feature fusion on the optical features and the SAR features to obtain a fused feature image, storing the fused feature image into a feature library, and performing similarity calculation on the image to be queried and the fused feature image in the feature library to obtain geographic coordinates. The invention can effectively fuse multi-mode data, process view transformation in a robust way and efficiently perform geographic positioning.

Inventors

LIANG ZHIKAI
ZHAO YUZHE

Assignees

北京特种机械研究所

Dates

Publication Date: 20260508
Application Date: 20251210

Claims (10)

1. A cross-view geographic positioning method based on multi-modal data, the method comprising: respectively preprocessing an optical image and a satellite image to obtain a preprocessed optical image and a preprocessed satellite image; Utilizing an intermediate layer of the first feature extraction network and introducing a normalization layer to perform feature extraction on the preprocessed optical image to obtain optical features; performing feature extraction on the preprocessed satellite image by using a second feature extraction network and a feature pyramid network to obtain SAR features; the optical characteristics and the SAR characteristics are subjected to weighted characteristic fusion to obtain a fusion characteristic image, and the fusion characteristic image is stored in a characteristic library; And carrying out similarity calculation on the image to be queried and the fusion feature image in the feature library to obtain geographic coordinates.
2. The multi-modal data-based cross-view geographic positioning method as claimed in claim 1, wherein the step of utilizing the intermediate layer of the first feature extraction network and introducing a normalization layer to perform feature extraction on the preprocessed optical image to obtain the optical features includes: inputting the preprocessed optical image into ResNet network, extracting intermediate feature images with resolution of 56×56, 28×28 and 14×14 at layer 2, layer 3 and layer 4 of ResNet network respectively; and respectively introducing a normalized layer after extracting the intermediate feature map in each layer, and mapping the features in the inclined view to a standard top view space by using an affine transformation matrix to obtain optical features.
3. The multi-modal data-based cross-view geographic positioning method as set forth in claim 2, wherein the feature extraction of the preprocessed satellite image using the second feature extraction network and the feature pyramid network to obtain SAR features includes: Inputting the satellite image into DenseNet network for multi-layer feature extraction to obtain multi-layer feature map; Up-sampling a high-level feature map in the multi-level feature map to the same resolution of a bottom-level feature map by utilizing a feature pyramid network through bilinear interpolation to obtain an aligned feature map; and carrying out channel adjustment on the alignment feature map by using 1X 1 convolution, and carrying out fusion according to element summation to obtain SAR features.
4. The multi-modal data-based cross-view geographic positioning method as claimed in claim 1, wherein the step of performing weighted feature fusion on the optical features and the SAR features to obtain a fused feature image includes: Let the optical characteristics be (F O ) and the SAR characteristics be (F S ); Projecting (F O ) and (F S ) into a query space, a key space, and a value space, respectively: Q O ＝W Q F O ,K S ＝W K F S ,V S ＝W V F S ; wherein, (W Q )、(W K )、(W V ) is a leachable projection matrix; The attention weight is calculated by: Where α is the attention weight, softmax is the normalization function, and (d k ) is the dimension of the key vector; The SAR features are weighted by the attention weight, and the process is expressed as follows: F S '＝αV S ; The attention weighted SAR features are fused with the original optical features to obtain a fused feature image, and the process is expressed as follows: F fused ＝F O +λF S '; Wherein F fused is a fusion feature, and lambda is a learnable fusion weight parameter.
5. The multi-modal data-based cross-view geographic positioning method as claimed in claim 1, wherein the step of performing similarity calculation on the image to be queried and the fused feature images in the feature library to obtain the ordered search list and the geographic coordinates comprises the steps of: taking the obtained vehicle camera image as an image to be queried to perform feature extraction to obtain the feature image to be queried; Calculating the similarity between the feature image to be queried and all the fusion feature images in the feature library; sequencing the similarity to obtain candidate matching images of Top-K; extracting geometric correspondence from candidate image pairs by using a RANSAC algorithm, and calculating a base matrix; Based on the basic matrix and the internal parameters of the vehicle camera, recovering the basic matrix of the vehicle camera to perform SVD decomposition, combining the geographic coordinates of the satellite images in the GPS database, and obtaining the geographic coordinates through coordinate transformation.
6. A cross-view geolocation system based on multimodal data, the system comprising: the preprocessing module is used for respectively preprocessing the optical image and the satellite image to obtain a preprocessed optical image and a preprocessed satellite image; the first feature extraction module is used for extracting features of the preprocessed optical image by utilizing an intermediate layer of the first feature extraction network and introducing a normalization layer to obtain optical features; The second feature extraction module is used for carrying out feature extraction on the preprocessed satellite image by utilizing a second feature extraction network and a feature pyramid network to obtain SAR features; The feature fusion module is used for carrying out weighted feature fusion on the optical features and the SAR features to obtain fusion feature images, and storing the fusion feature images into a feature library; the coordinate acquisition module is used for carrying out similarity calculation on the image to be queried and the fusion feature image in the feature library to obtain geographic coordinates.
7. The multi-modal data-based cross-view geolocation system of claim 6, wherein said first feature extraction module is specifically configured to: inputting the preprocessed optical image into ResNet network, extracting intermediate feature images with resolution of 56×56, 28×28 and 14×14 at layer 2, layer 3 and layer 4 of ResNet network respectively; and respectively introducing a normalized layer after extracting the intermediate feature map in each layer, and mapping the features in the inclined view to a standard top view space by using an affine transformation matrix to obtain optical features.
8. The multi-modal data-based cross-view geolocation system of claim 6, wherein said second feature extraction module is specifically configured to: Inputting the satellite image into DenseNet network for multi-layer feature extraction to obtain multi-layer feature map; Up-sampling a high-level feature map in the multi-level feature map to the same resolution of a bottom-level feature map by utilizing a feature pyramid network through bilinear interpolation to obtain an aligned feature map; and carrying out channel adjustment on the alignment feature map by using 1X 1 convolution, and carrying out fusion according to element summation to obtain SAR features.
9. The multi-modal data-based cross-view geographic positioning system of claim 6 wherein the feature fusion module is specifically configured to: Let the optical characteristics be (F O ) and the SAR characteristics be (F S ); Projecting (F O ) and (F S ) into a query space, a key space, and a value space, respectively: Q O ＝W Q F O ,K S ＝W K F S ,V S ＝W V F S ; wherein, (W Q )、(W K )、(W V ) is a leachable projection matrix; The attention weight is calculated by: Where α is the attention weight, softmax is the normalization function, and (d k ) is the dimension of the key vector; The SAR features are weighted by the attention weight, and the process is expressed as follows: F S '＝αV S ; The attention weighted SAR features are fused with the original optical features to obtain a fused feature image, and the process is expressed as follows: F fused ＝F O +λF S '; Wherein F fused is a fusion feature, and lambda is a learnable fusion weight parameter.
10. The multi-modal data-based cross-view geographic positioning system of claim 6 wherein the coordinate acquisition module is specifically configured to: taking the obtained vehicle camera image as an image to be queried to perform feature extraction to obtain the feature image to be queried; Calculating the similarity between the feature image to be queried and all the fusion feature images in the feature library; sequencing the similarity to obtain candidate matching images of Top-K; extracting geometric correspondence from candidate image pairs by using a RANSAC algorithm, and calculating a base matrix; Based on the basic matrix and the internal parameters of the vehicle camera, recovering the basic matrix of the vehicle camera to perform SVD decomposition, combining the geographic coordinates of the satellite images in the GPS database, and obtaining the geographic coordinates through coordinate transformation.

Description

Cross-view geographic positioning method and system based on multi-mode data Technical Field The invention relates to the technical field of cross-view geographic positioning, in particular to a cross-view geographic positioning method and system based on multi-mode data. Background In environments where GPS signals are blocked or otherwise unavailable (e.g., urban canyons, underground facilities, disaster sites, etc.), how to accurately determine the geographic location of a target is an important issue. Traditional geolocation methods rely primarily on GPS signals and do not work well in such constrained environments. In recent years, a cross-view geographic positioning technology realizes positioning by matching images of different platforms and different view angles, and has become an important means for acquiring position information. However, the existing cross-view geographic positioning method still faces a plurality of challenges, on one hand, the appearance difference of images of different views is obvious, the front main camera view of a vehicle and the same target building in satellite images are extremely different in view angle, scale and imaging characteristics, so that the traditional pixel-by-pixel matching method is difficult to take effect, on the other hand, the robustness of feature extraction is insufficient, the method only depends on single-mode information (such as optical images), auxiliary information in multi-source data is not fully utilized, and the adaptability to variables such as seasonal variation, weather conditions, imaging time and the like is limited. Meanwhile, an effective multi-modal fusion mechanism is lacking at present, although the remote sensing field covers rich data such as optical images, SAR data and text labels, the fusion utilization degree of the multi-modal information is low, the complementary advantages of the multi-modal data cannot be fully exerted by the existing method, in addition, a short plate exists in the aspect of calculation efficiency, the calculation complexity of the existing deep learning method is high, and the real-time positioning requirement is difficult to meet. Disclosure of Invention In view of the above problems, embodiments of the present invention provide a cross-view geographic positioning method and system based on multi-modal data, so as to solve the existing technical problems. In order to solve the technical problems, the invention provides the following technical scheme: In a first aspect, the present invention provides a cross-view geographic positioning method based on multi-modal data, the method comprising: respectively preprocessing an optical image and a satellite image to obtain a preprocessed optical image and a preprocessed satellite image; Utilizing an intermediate layer of the first feature extraction network and introducing a normalization layer to perform feature extraction on the preprocessed optical image to obtain optical features; performing feature extraction on the preprocessed satellite image by using a second feature extraction network and a feature pyramid network to obtain SAR features; the optical characteristics and the SAR characteristics are subjected to weighted characteristic fusion to obtain a fusion characteristic image, and the fusion characteristic image is stored in a characteristic library; And carrying out similarity calculation on the image to be queried and the fusion feature image in the feature library to obtain geographic coordinates. In an embodiment, the performing feature extraction on the preprocessed optical image by using the intermediate layer of the first feature extraction network and introducing the normalization layer to obtain the optical feature includes: inputting the preprocessed optical image into ResNet network, extracting intermediate feature images with resolution of 56×56, 28×28 and 14×14 at layer 2, layer 3 and layer 4 of ResNet network respectively; and respectively introducing a normalized layer after extracting the intermediate feature map in each layer, and mapping the features in the inclined view to a standard top view space by using an affine transformation matrix to obtain optical features. In an embodiment, the feature extraction of the preprocessed satellite image by using the second feature extraction network and the feature pyramid network, to obtain SAR features includes: Inputting the satellite image into DenseNet network for multi-layer feature extraction to obtain multi-layer feature map; Up-sampling a high-level feature map in the multi-level feature map to the same resolution of a bottom-level feature map by utilizing a feature pyramid network through bilinear interpolation to obtain an aligned feature map; and carrying out channel adjustment on the alignment feature map by using 1X 1 convolution, and carrying out fusion according to element summation to obtain SAR features. In an embodiment, the performing weighted feature fusion on the opti