CN-122023755-A - Visual positioning method and system combining deep learning model and geometric model

CN122023755ACN 122023755 ACN122023755 ACN 122023755ACN-122023755-A

Abstract

The application relates to the technical field of visual positioning and discloses a visual positioning method and a system combining a deep learning model and a geometric model, wherein the method comprises the steps of constructing a first information field for simultaneously encoding scene geometric density information, semantic category probability information and local feature description sub-information according to a multi-view image sequence collected by history and corresponding camera pose information; based on a real-time image acquired at the current moment, extracting a semantic feature image and a geometric feature image of the real-time image, screening out key point positions, generating a corresponding first descriptor by combining an image structure, analyzing an evolution rule of motion corresponding to the first descriptor, extracting trend features and instantaneous features of a motion process, fusing the trend features and the instantaneous features, predicting a first pose of a next frame of image, determining space coordinates corresponding to each key point position, and optimizing the space coordinates according to mapping errors to obtain a visual positioning result.

Inventors

YANG LONG
Xiong jiao
Zang Sudong
ZHU WENFENG
LI JIAN
ZHANG YONGPENG
WANG KANGBIN

Assignees

上海恒泽辅汇智能科技有限公司
上海恒启智向智能科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260415

Claims (10)

1. A visual localization method combining a deep learning model and a geometric model, comprising: constructing a first information field for simultaneously encoding scene geometric density information, semantic category probability information and local feature descriptor information according to a multi-view image sequence acquired by history and corresponding camera pose information; Based on a real-time image acquired at the current moment, extracting a semantic feature image and a geometric feature image of the real-time image through a preset feature extraction model, screening out key point positions, and generating a corresponding first descriptor by combining an image structure; Analyzing an evolution rule of the motion corresponding to the first descriptor, extracting trend features and instantaneous features of a motion process, fusing the trend features and the instantaneous features, and predicting a first pose of a next frame of image; Mapping the first pose to a first information field, determining a space coordinate corresponding to each key point position, and optimizing the space coordinate according to the mapping error to obtain a visual positioning result.
2. The visual positioning method combining a deep learning model and a geometric model according to claim 1, wherein the constructing a first information field for simultaneously encoding scene geometric density information, semantic category probability information and local feature descriptor information according to a historically acquired multi-view image sequence and corresponding camera pose information comprises: According to a multi-view image sequence collected by history and corresponding camera pose information, calculating sparse three-dimensional point clouds of a scene, determining a space boundary of the scene, dividing the scene into voxel grids according to the space boundary, and setting a corresponding first information vector in the center of each voxel grid; Based on the first information vector, a first information field is constructed which simultaneously encodes scene geometric density information, semantic category probability information and local feature descriptor information.
3. The method for visual localization by combining a deep learning model and a geometric model according to claim 2, wherein constructing a first information field simultaneously encoding scene geometric density information, semantic category probability information and local feature descriptor information based on the first information vector comprises: Based on the first information vector, constructing a corresponding space feature vector by combining the distance between each position query point in the space and the center of the voxel grid; According to the space feature vector, analyzing a geometric density value, semantic category probability and feature descriptors corresponding to the position query points through a preset joint coding model, and constructing a first coding vector; transmitting sampling rays from each pixel, counting a first coding vector through which the sampling rays pass, predicting a second coding vector of the corresponding pixel, and comparing the first coding vector with the second coding vector to obtain a joint loss value; And optimizing the first coding vector according to the joint loss value, and constructing a first information field according to the optimized first coding vector.
4. The visual positioning method combining a deep learning model and a geometric model according to claim 1, wherein the extracting semantic feature images and geometric feature images of the real-time image through a preset feature extraction model based on the real-time image acquired at the current moment, screening out key point positions, and combining an image structure to generate a corresponding first descriptor comprises the following steps: Based on a real-time image acquired at the current moment, extracting a semantic feature image and a geometric feature image of the real-time image through a preset feature extraction model; And analyzing the feature gradient of each pixel by combining the semantic feature map and the geometric feature map, screening out the position of the key point, and generating a corresponding first descriptor by combining the image structure.
5. The visual positioning method combining a deep learning model and a geometric model according to claim 4, wherein the extracting the semantic feature map and the geometric feature map of the real-time image by the preset feature extraction model based on the real-time image acquired at the current moment comprises: And respectively inputting the bottom layer feature map into a semantic analysis branch and a geometric analysis branch, outputting the semantic feature map by the semantic analysis branch through a deepening network layer, and outputting the geometric feature map by the geometric analysis branch through a convolution layer maintaining spatial resolution, wherein the feature extraction model comprises the semantic analysis branch and the geometric analysis branch.
6. The visual localization method combining a deep learning model and a geometric model according to claim 5, wherein the combining the semantic feature map and the geometric feature map, analyzing feature gradients of each pixel, screening out key point positions, and combining an image structure to generate a corresponding first descriptor comprises: According to the geometric feature diagram, calculating the gradient of pixels in a window by taking each pixel as a center according to a preset window, and constructing a structure tensor; performing feature decomposition on the structure tensor, and taking a feature vector corresponding to the maximum feature value as a gradient main direction of the pixel; According to the semantic feature map, analyzing the confidence probability that each pixel belongs to the static semantic category, and screening out the pixel position with the highest confidence probability as a key point position; And for each key point position, constructing a rotation matrix according to the corresponding gradient main direction, taking the corresponding key point as a center on the bottom layer feature map, screening a feature region according to the direction of the rotation matrix, and generating a first descriptor according to the feature region.
7. The visual positioning method combining a deep learning model and a geometric model according to claim 1, wherein the analyzing the evolution rule of the motion corresponding to the first descriptor, extracting and fusing the trend feature and the transient feature of the motion process, and predicting the first pose of the next frame of image comprises: Analyzing the evolution rule of the corresponding motion of the first descriptor, integrating the key point characteristics in each frame of image, and arranging the key point characteristics in time sequence to obtain a characteristic sequence; extracting trend features and instantaneous features of the motion process based on the feature sequence; and merging the trend characteristic and the instantaneous characteristic, and predicting the first pose of the next frame of image.
8. The method for visual localization by combining a deep learning model and a geometric model according to claim 7, wherein the merging trend features and transient features predicts a first pose of a next frame of image, comprising: calculating the motion dynamic degree of a scene according to the adjacent frame images, and determining fusion weight coefficients corresponding to trend features and instantaneous features based on the motion dynamic degree; weighting and summing the trend characteristics and the instantaneous characteristics according to the fusion weight coefficient to obtain fusion characteristics; and mapping the fusion characteristics to obtain relative motion parameters between adjacent frames, and obtaining a first pose of a next frame image by combining pose prediction of the current frame image.
9. The visual positioning method combining a deep learning model and a geometric model according to claim 1, wherein the mapping the first pose to the first information field, determining the spatial coordinates corresponding to each key point position, and optimizing the spatial coordinates according to the mapping error to obtain the visual positioning result comprises: Mapping each key point to a first information field according to the first pose of the current frame and camera parameters, and matching corresponding feature descriptors to obtain corresponding space coordinates; Calculating a predicted coordinate of the space coordinate projected to the image plane based on the first pose, and calculating a mapping error of the predicted coordinate and the key point coordinate; And optimizing the space coordinates according to the mapping error to obtain a visual positioning result.
10. A visual positioning system combining a deep learning model and a geometric model, wherein the visual positioning system is used for implementing a visual positioning method combining a deep learning model and a geometric model according to any one of claims 1 to 9, and the visual positioning system comprises: the first information field construction module is used for constructing a first information field for simultaneously encoding scene geometric density information, semantic category probability information and local feature descriptor information according to a multi-view image sequence acquired by history and corresponding camera pose information; The first description sub-calculation module is used for extracting a semantic feature image and a geometric feature image of the real-time image through a preset feature extraction model based on the real-time image acquired at the current moment, screening out the positions of key points and generating a corresponding first description by combining an image structure; The first pose analysis module is used for analyzing the evolution rule of the corresponding motion of the first descriptor, extracting the trend characteristic and the instantaneous characteristic of the motion process, fusing the trend characteristic and the instantaneous characteristic, and predicting the first pose of the next frame of image; And the visual positioning module maps the first pose into a first information field, determines the space coordinates corresponding to the positions of each key point, and optimizes the space coordinates according to the mapping error to obtain a visual positioning result.

Description

Visual positioning method and system combining deep learning model and geometric model Technical Field The application relates to the technical field of visual positioning, in particular to a visual positioning method and a visual positioning system combining a deep learning model and a geometric model. Background Currently, in the fields of augmented reality, autopilot, robotic navigation, and the like, high-precision visual positioning techniques are required. Conventional visual localization methods include geometric model-based methods and depth learning-based methods. Under the scenes of weak textures, repeated textures, intense illumination change or motion blur, the traditional feature points are easy to lose or mismatch, and only geometric information is utilized, so that semantic content of the scene can not be understood, and positioning failure is easy to occur in a dynamic environment. The deep learning method has certain robustness to illumination change and dynamic objects, but has limited generalization capability, and is difficult to meet the requirement of high-precision application. The prior art has the following problems that the sparse point cloud data only stores three-dimensional coordinates and simple descriptors, and the geometric density and semantic category of a scene cannot be encoded, so that the positioning process lacks semantic constraint; the environment corresponding to the SLAM system is static, and when a dynamic object exists in a scene, the dynamic object can be added and optimized as a static landmark, so that pose drift is caused; the application provides a visual positioning method and a visual positioning system combining a deep learning model and a geometric model, which aim to solve at least one of the problems. Disclosure of Invention Aiming at the defects existing in the prior art, the application aims to provide a visual positioning method and a visual positioning system combining a deep learning model and a geometric model, which can effectively solve the problems in the background art. The specific technical scheme of the application is as follows: A visual localization method combining a deep learning model and a geometric model, comprising: constructing a first information field for simultaneously encoding scene geometric density information, semantic category probability information and local feature descriptor information according to a multi-view image sequence acquired by history and corresponding camera pose information; Based on a real-time image acquired at the current moment, extracting a semantic feature image and a geometric feature image of the real-time image through a preset feature extraction model, screening out key point positions, and generating a corresponding first descriptor by combining an image structure; Analyzing an evolution rule of the motion corresponding to the first descriptor, extracting trend features and instantaneous features of a motion process, fusing the trend features and the instantaneous features, and predicting a first pose of a next frame of image; Mapping the first pose to a first information field, determining a space coordinate corresponding to each key point position, and optimizing the space coordinate according to the mapping error to obtain a visual positioning result. Specifically, the constructing a first information field for simultaneously encoding the geometric density information, the semantic category probability information and the local feature descriptor information of the scene according to the historically acquired multi-view image sequence and the corresponding camera pose information includes: According to a multi-view image sequence collected by history and corresponding camera pose information, calculating sparse three-dimensional point clouds of a scene, determining a space boundary of the scene, dividing the scene into voxel grids according to the space boundary, and setting a corresponding first information vector in the center of each voxel grid; Based on the first information vector, a first information field is constructed which simultaneously encodes scene geometric density information, semantic category probability information and local feature descriptor information. Specifically, the constructing a first information field for simultaneously encoding scene geometric density information, semantic category probability information and local feature descriptor information based on the first information vector includes: Based on the first information vector, constructing a corresponding space feature vector by combining the distance between each position query point in the space and the center of the voxel grid; According to the space feature vector, analyzing a geometric density value, semantic category probability and feature descriptors corresponding to the position query points through a preset joint coding model, and constructing a first coding vector; transmitting sampling rays from