CN-121999047-A - Visual positioning method and system

CN121999047ACN 121999047 ACN121999047 ACN 121999047ACN-121999047-A

Abstract

A visual positioning method and system includes obtaining image to be tested including query image and reference image, inputting image to be tested into machine learning model after pre-training to obtain position probability distribution map and direction vector field aligned to position probability distribution map, confirming candidate peak value based on position probability value of each grid coordinate in position probability distribution map, executing non-maximum value suppression NMS and sub-pixel level refinement to multiple candidate peak values to obtain target grid coordinate to convert grid coordinate based on mapping relation to obtain positioning position coordinate, confirming candidate direction based on direction vector corresponding to each grid coordinate in direction vector field to obtain direction angle, carrying out weighted average and unitization treatment to multiple candidate directions to obtain direction angle, converting direction angle to obtain positioning angle. The method realizes high precision, high robustness and high efficiency.

Inventors

ZHANG GUOWEN
Ying Jiangyong
ZHANG YILONG
ZHAI GUANGTAO
DUAN HUIYU
ZHENG YULI
JIN JIANING

Assignees

天翼视联科技股份有限公司
上海交通大学

Dates

Publication Date: 20260508
Application Date: 20260129

Claims (10)

1. A method of visual localization comprising: S1, acquiring an image to be detected, wherein the image to be detected comprises a query image and a reference image; S2, inputting the image to be tested into a pre-trained machine learning model to obtain a position probability distribution map and a direction vector field aligned with the position probability distribution map, wherein each grid coordinate in the position probability distribution map corresponds to a position probability value, and each grid coordinate corresponds to a direction vector; S3, determining candidate peaks from the position probability distribution map based on the position probability value of each grid coordinate, and executing non-maximum suppression NMS and sub-pixel level refinement on a plurality of candidate peaks to obtain target grid coordinates, so that the grid coordinates are converted based on a mapping relation to obtain positioning position coordinates; S4, determining candidate directions from the direction vector field based on the direction vectors corresponding to each grid coordinate, carrying out weighted average and unitization processing on a plurality of candidate directions to obtain direction angles, and converting the direction angles to obtain positioning angles; s5, determining the visual pose according to the positioning position coordinates and the positioning angles.
2. The visual positioning method of claim 1, wherein the pre-trained machine learning model is trained by: Inputting the query image into a query image decoder to obtain a one-dimensional direction sensing feature vector; Inputting the reference image into a reference image decoder to obtain a two-dimensional feature map; fusing the one-dimensional direction perception feature vector and the two-dimensional feature map to obtain an initial position probability distribution map and an initial direction vector field; optimizing the initial position probability distribution map and the initial direction vector field based on loss calculation to obtain the position probability distribution map and the direction vector field aligned with the position probability distribution map; the method for obtaining the one-dimensional direction perception feature vector comprises the steps of: Preprocessing the query image, inputting the preprocessed query image into a convolutional neural network or a visual transducer backbone network with denaturation such as translation, and extracting a characteristic diagram of a middle multichannel; dividing the feature map from a Cartesian coordinate system to a plurality of discrete direction intervals by taking the center of the preprocessed query image or a preset horizon as a reference, and carrying out regional pooling on features in each direction interval to obtain a direction feature vector; Inputting the direction feature vector to a group of full-connection layers and/or sub-networks with attention mechanisms, performing weighted modeling on responses in different directions to obtain a coded direction sensing feature vector, and performing normalization processing on the coded direction sensing feature vector to obtain the one-dimensional direction sensing feature vector; The reference image is input into a reference image decoder to obtain a two-dimensional characteristic image, which comprises the following steps: preprocessing the reference image, inputting the preprocessed reference image into a convolutional neural network or a network backbone with a multi-scale feature extraction structure, and extracting local textures and global scene semantic information layer by layer to obtain a multi-scale feature map; Upsampling or downsampling the multi-scale feature map, and fusing by adopting a feature pyramid fusion, pixel-by-pixel weighted summation and/or channel splicing mode to obtain a fused two-dimensional feature map; Compressing or reconstructing the channel dimension through a1 multiplied by 1 convolution or linear projection layer based on the fused two-dimensional feature map to obtain a new feature map, and carrying out normalization processing on the new feature map to obtain the two-dimensional feature map; The method for fusing the one-dimensional direction perception feature vector and the two-dimensional feature map to obtain an initial position probability distribution map and an initial direction vector field comprises the following steps: determining that the dimension of the one-dimensional direction sensing feature vector is matched with the dimension of the two-dimensional feature map; Fusing the one-dimensional direction sensing feature vector and the two-dimensional feature map, inputting the fused feature map to a convolution layer, outputting an un-normalized position scoring map, and normalizing the position scoring map through softmax operation to obtain the initial position probability distribution map; Taking the fused feature map or the two-dimensional feature map as input, outputting a direction vector at each position through a direction pre-measuring head, and encoding the camera orientation at the direction vector; Unitizing the direction vector to obtain the initial direction vector field; Wherein optimizing the initial position probability distribution map and the initial direction vector field based on loss calculation to obtain the position probability distribution map and the direction vector field aligned with the position probability distribution map comprises: labeling true positions of samples with true value tags based on A cross entropy loss is calculated, wherein, Representing the cross-entropy loss, Represents a true value of the label, The true position is indicated and the position is indicated, Representing an initial position probability distribution map and determining a difference of the truth labels from the initial position probability distribution map based on the cross entropy loss; Determining a predicted direction vector for the true position based on The L2 loss is calculated and, The loss of L2 is indicated by, Representing camera orientation and determining a difference of a predicted direction vector of the real position from the initial direction vector field based on the L2 loss; Based on The total loss is calculated, wherein, Balance coefficients representing the cross-entropy loss, A balance coefficient representing the L2 loss and back-propagating the total loss to obtain the position probability distribution map and the direction vector field aligned with the position probability distribution map.
3. The visual positioning method of claim 2, further comprising: Based on noise injection and model temperature increase, driving the machine learning model to carry out G times of forward reasoning with randomness to obtain G pose prediction results; Scoring each pose prediction result based on a reward function, and calculating an average score based on the scores of a plurality of the pose prediction results; determining a target score of each pose prediction result relative to the average score; the machine learning model increases the prediction probability if the target score is positive and decreases the prediction probability if the target score is negative.
4. The visual positioning method of claim 1, wherein determining candidate peaks from the position probability distribution map based on the position probability values for each grid coordinate comprises: determining a position probability value of other grid coordinates within a preset range centering on the grid coordinates based on each grid coordinate; comparing the position probability value of the grid coordinate with the position probability values of other grid coordinates; And under the condition that the position probability value of the grid coordinate is maximum and is larger than a preset threshold value, taking the position probability value of the grid coordinate as the candidate peak value to obtain a plurality of candidate peak values and a plurality of candidate points.
5. The visual positioning method of claim 4, wherein performing non-maximum suppressing NMS and sub-pixel level refinement on a plurality of the candidate peaks results in target grid coordinates, comprising: judging whether the distance between every two candidate points in the plurality of candidate points is smaller than a preset pixel interval or not, and acquiring the candidate points smaller than the pixel interval; taking the highest value of the candidate peak value in the candidate points smaller than the pixel interval as a rough position, and determining the grid coordinates in a preset range of the rough position; And carrying out weighted summation on the position probability values of the grid coordinates in a preset range to obtain refined continuous coordinates, and determining the target grid coordinates based on the refined continuous coordinates.
6. The visual positioning method of claim 5, wherein determining a candidate direction from the direction vector field based on the direction vector for each grid coordinate comprises: And determining neighbor grid coordinates of the target grid coordinates, and taking a direction vector of the neighbor grid coordinates as the candidate direction.
7. The visual positioning method according to claim 6, wherein performing weighted average and unitization processing on the plurality of candidate directions to obtain a direction angle comprises: Carrying out weighted average on a plurality of candidate directions to obtain a weighted average direction vector; Carrying out unitization treatment on the weighted average direction vector to obtain a treated unit vector; Based on The direction angle is obtained, wherein, Representing a four-quadrant arctangent function, Representing the directional vector component of the target grid coordinates on the X-axis, Representing the directional vector component of the target grid coordinates on the Y-axis.
8. A visual positioning system, comprising: The acquisition module is used for acquiring an image to be detected, wherein the image to be detected comprises a query image and a reference image; The obtaining module is used for inputting the image to be detected into a pre-trained machine learning model to obtain a position probability distribution map and a direction vector field aligned with the position probability distribution map, wherein each grid coordinate in the position probability distribution map corresponds to a position probability value, and each grid coordinate corresponds to a direction vector; The first processing module is used for determining candidate peaks from the position probability distribution map based on the position probability value of each grid coordinate, and executing non-maximum suppression NMS and sub-pixel level refinement on a plurality of candidate peaks to obtain target grid coordinates so that the grid coordinates are converted based on a mapping relation to obtain positioning position coordinates; The second processing module is used for determining candidate directions from the direction vector field based on the direction vectors corresponding to each grid coordinate, carrying out weighted average and unitization processing on a plurality of candidate directions to obtain direction angles, and converting the direction angles to obtain positioning angles; and the determining module is used for determining the visual pose according to the positioning position coordinates and the positioning angle.
9. An electronic device, comprising: At least one processor, and A memory communicatively coupled to the at least one processor, wherein, The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.
10. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1 to 7.

Description

Visual positioning method and system Technical Field The invention belongs to the technical field of computers, and particularly relates to a visual positioning method and a visual positioning system. Background In the related art, the mainstream technologies for solving the positioning problem include GNSS and vision-based positioning methods. The cross-view pose estimation is a research hotspot in the field of visual positioning. The technology generally adopts a twin (Siamese-like) network structure, comprises two deep learning branches, and extracts the characteristics of a ground query image and an air reference image respectively. The model is then trained by a metric learning loss function (e.g., contrast loss) that enables it to pull the feature distance of the matching view closer to the feature distance of the non-matching view. However, the goal of metric learning is simply to make the "matched" image closer to the feature and the "unmatched" image farther, and when situations of rapid illumination, season replacement, severe occlusion or extremely large viewing angle difference are encountered, such implicit similarity may be very fragile, leading to failure of matching, and further to inaccurate trained models, leading to inaccurate visual positioning. Disclosure of Invention In view of the above shortcomings of the prior art, an object of the present invention is to provide a visual positioning method and system. The visual positioning method comprises the steps of S1, obtaining an image to be detected, S2, inputting the image to be detected into a pre-trained machine learning model to obtain a position probability distribution map and a direction vector field aligned with the position probability distribution map, wherein each grid coordinate in the position probability distribution map corresponds to a position probability value, each grid coordinate corresponds to a direction vector, S3, determining a candidate peak value from the position probability distribution map based on the position probability value of each grid coordinate, performing non-maximum value inhibition NMS and sub-pixel level refinement on a plurality of candidate peak values to obtain target grid coordinates, so that the grid coordinates are converted based on a mapping relation to obtain positioning position coordinates, S4, determining a candidate direction from the direction vector field based on the direction vector corresponding to each grid coordinate, performing weighted average and unitization processing on a plurality of candidate directions to obtain a direction angle, converting the direction angle, and determining the visual positioning angle according to the position and the position pose, S5. Further, the pre-trained machine learning model is trained by inputting the query image into a query image decoder to obtain a one-dimensional direction perception feature vector, inputting the reference image into the reference image decoder to obtain a two-dimensional feature map, fusing the one-dimensional direction perception feature vector and the two-dimensional feature map to obtain an initial position probability distribution map and an initial direction vector field, optimizing the initial position probability distribution map and the initial direction vector field based on loss calculation to obtain the position probability distribution map and the direction vector field aligned with the position probability distribution map, inputting the query image into the query image decoder to obtain a one-dimensional direction perception feature vector, preprocessing the query image, inputting the preprocessed query image into a convolutional neural network or a visual transducer backbone network with translation and other variability to extract a feature map of an intermediate multichannel, dividing the feature map from a Cartesian coordinate system into a plurality of directions based on loss calculation to obtain the position probability distribution map and the direction vector field aligned with the position probability distribution map, inputting the feature vector into a sensor, carrying out full-scale feature vector coding, and/or a full-scale feature vector coding, and carrying out full-scale feature coding to the feature vector coding to obtain a full-scale feature vector coding to the feature vector, the method comprises the steps of preprocessing the reference image, inputting the preprocessed reference image into a convolutional neural network or a network backbone with a multi-scale feature extraction structure, extracting local textures and global scene semantic information layer by layer to obtain a multi-scale feature map, upsampling or downsampling the multi-scale feature map, and adopting feature pyramid fusion, The method comprises the steps of carrying out fusion in a pixel-by-pixel weighted summation/channel splicing mode to obtain a fused two-dimensional feature map, carrying out compression or