CN-121982243-A - Simultaneous positioning and map construction method based on monocular event camera

CN121982243ACN 121982243 ACN121982243 ACN 121982243ACN-121982243-A

Abstract

The invention discloses a simultaneous localization and map construction method based on a monocular event camera, which comprises the steps of receiving an event stream of the monocular event camera, carrying out time alignment and window division on the event stream to generate an event tensor, carrying out image reconstruction on the event tensor to generate a reconstructed image, adopting a differentiable explicit scene representation method to represent a three-dimensional scene, carrying out projection and rendering on the three-dimensional scene by combining a current camera pose to generate a predicted image, optimizing and calculating the current camera pose by minimizing the difference between the current reconstructed image and the predicted image under the current view angle, carrying out multi-round optimization on parameters of the three-dimensional scene by minimizing a basic mapping loss function in a sliding window containing a plurality of key frames, wherein the basic mapping loss function comprises the difference between the reconstructed image of each key frame and the predicted image under the corresponding view angle, and outputting a camera track and a three-dimensional map. The invention realizes on-line simultaneous positioning and map construction under the condition of using only a monocular event camera.

Inventors

TAN JUNBO
DUAN HONGBO
CHEN XU
WANG XUEQIAN

Assignees

清华大学深圳国际研究生院

Dates

Publication Date: 20260505
Application Date: 20260407

Claims (10)

1. The method for simultaneously positioning and constructing the map based on the monocular event camera is characterized by comprising the following steps of: s1, receiving an event stream of a monocular event camera, and performing time alignment and window division on the event stream to generate an event tensor; s2, reconstructing the event tensor to generate a reconstructed image; s3, representing the three-dimensional scene by adopting a differentiable explicit scene representation method, and projecting and rendering the three-dimensional scene by combining the current camera pose to generate a predicted image; S4, optimizing and calculating the current camera pose by minimizing the difference between the current reconstructed image and the predicted image under the current visual angle; s5, in a sliding window containing a plurality of key frames, optimizing parameters of the three-dimensional scene in multiple rounds by minimizing a basic mapping loss function, wherein the basic mapping loss function comprises differences between the reconstructed image of each key frame and the predicted image under a corresponding view angle; and S6, outputting a camera track based on the optimized current camera pose, and outputting a three-dimensional map based on parameters of the optimized updated three-dimensional scene.
2. The simultaneous localization and mapping method of claim 1, wherein step S2 comprises generating a generated event reconstruction of the event tensor to generate a reconstructed image, wherein generating the generated event reconstruction of the event tensor specifically comprises: s21, processing the event tensor through a convolutional neural network to reconstruct an initial pseudo gray image; S22, refining the initial pseudo gray image by using a diffusion model to generate a pseudo gray image, wherein the pseudo gray image is the reconstructed image.
3. The simultaneous localization and mapping method according to claim 1, wherein the representing of the three-dimensional scene using the differentiable explicit scene representation in step S3 specifically comprises parametrizing the three-dimensional scene by a three-dimensional gaussian scene representation based on three-dimensional gaussian splatter, wherein the three-dimensional gaussian scene representation comprises a plurality of gaussian units, each gaussian unit comprising a position parameter describing a spatial position, a shape parameter describing a geometry, and an appearance parameter describing a visual appearance.
4. The simultaneous localization and mapping method of claim 1 wherein the projecting and rendering the three-dimensional scene in combination with the current camera pose to generate the predicted image in step S3 comprises projecting the three-dimensional scene to a two-dimensional image plane according to the given current camera pose and rendering the predicted image.
5. The simultaneous localization and mapping method according to claim 1, wherein step S4 specifically comprises calculating a photometric consistency loss and a photovoltaic contrast loss according to a difference between the current reconstructed image and the predicted image at the current view angle, performing weighted combination on the photometric consistency loss and the photovoltaic contrast loss to form a combined loss function, and optimizing and calculating the current camera pose by minimizing the combined loss function.
6. The simultaneous localization and mapping method of claim 5 wherein calculating a loss of photometric uniformity based on a difference between the current reconstructed image and the predicted image at the current view angle comprises calculating a loss of photometric uniformity based on a difference in photometric intensity of the current reconstructed image and the predicted image at the current view angle.
7. The simultaneous localization and mapping method of claim 5 wherein calculating the photo-voltage contrast loss based on the difference between the current reconstructed image and the predicted image at the current view angle comprises: Mapping the image intensities of the current reconstructed image and the predicted image under the current view angle to a photovoltaic domain respectively; Constructing a reference event map based on the photovoltage variation of the reconstructed images of two continuous frames, and constructing a simulation event map based on the photovoltage variation of the predicted images of two continuous frames; and calculating to obtain the photovoltaic comparison loss according to the difference of the simulated event diagram and the reference event diagram in the logarithmic brightness change domain.
8. The simultaneous localization and mapping method of claim 1 wherein step S5 comprises optimizing parameters of the updated three-dimensional scene by minimizing a base mapping loss function comprising a loss of photometric consistency calculated as a difference between the reconstructed image of each key frame and the predicted image at the corresponding view angle and an isotropic regularization term within a sliding window containing a plurality of key frames.
9. The simultaneous localization and mapping method of claim 1, further comprising generating at least one virtual camera pose near a current camera motion trajectory by pose extrapolation and projecting and rendering the three-dimensional scene in combination with the virtual camera pose to generate a virtual rendered image, denoising the virtual rendered image to generate a pseudo-observation image, and progressively optimizing updating parameters of the three-dimensional scene using a loss between the pseudo-observation image and the virtual rendered image.
10. A computer readable storage medium, wherein a computer program is stored in the computer readable storage medium, wherein the computer program is configured to be run by a processor to perform the monocular event camera based simultaneous localization and mapping method of any of claims 1 to 9.

Description

Simultaneous positioning and map construction method based on monocular event camera Technical Field The invention relates to the technical field of computer graphics, in particular to a method for simultaneously positioning and constructing a map based on a monocular event camera. Background Synchronous positioning and map building (SLAM) technology is a key to the mobile platform to achieve autonomous navigation in an unknown environment. A new explicit micro-scene representation method, represented by three-dimensional gaussian splatter (3D Gaussian Splatting, 3 DGS), has been introduced into the visual SLAM (Simultaneous Localization AND MAPPING, synchronized localization and mapping) framework to form a 3DGS-SLAM because of its ability to achieve high quality, real-time scene rendering. Existing 3DGS-SLAM schemes are mainly based on traditional frame cameras (RGB or RGB-D cameras). However, in high-speed motion, severe illumination variation, or extremely high dynamic range scenes, the frame camera may generate serious motion blur, overexposure, or underexposure, resulting in constraint failure based on photometric consistency, and further causing significant degradation in pose tracking and map building performance of the SLAM system. Event cameras are a new type of bio-inspired vision sensor that asynchronously outputs luminance change events at the pixel level, with temporal resolution at the microsecond level, high Dynamic Range (HDR), low latency, and low power consumption characteristics, and are well suited to address the challenges described above. There are some prior art efforts to utilize event cameras, such as event visual odometers (e.g., EVO, ESVO, etc.) and event-to-image reconstruction methods (e.g., E2VID, fireNet, etc.). However, the event visual odometer typically outputs only the camera trajectory and does not build a dense map, the event-to-image reconstruction method is typically split from the SLAM process, the reconstructed image has noise and artifacts, and mostly is processed off-line, requiring a known or externally provided camera pose. Thus, the prior art lacks a SLAM solution that can accomplish both high-robustness pose estimation and high-quality dense map construction on-line (i.e., real-time or near real-time) depending on monocular event cameras alone. How to combine the advantages of the event camera with the powerful capability of the 3DGS-SLAM, solve a series of problems of event data reconstruction quality, pose tracking robustness, map optimization integrity and the like, and become a technical problem to be solved in the field. The foregoing background is only for the purpose of facilitating an understanding of the principles and concepts of the application and is not necessarily in the prior art to the present application and is not intended to be used as an admission that such background is not entitled to antedate such novelty and creativity by virtue of prior application or that it is already disclosed at the date of filing of this application. Disclosure of Invention In order to solve the technical problems, the invention provides a simultaneous localization and map construction method based on a monocular event camera, which realizes online simultaneous localization and map construction under the condition of using only the monocular event camera and simultaneously outputs a high-precision camera track and a high-quality three-dimensional map. In order to achieve the above purpose, the present invention adopts the following technical scheme: In a first aspect, the invention discloses a method for simultaneous localization and mapping based on a monocular event camera, comprising the following steps: s1, receiving an event stream of a monocular event camera, and performing time alignment and window division on the event stream to generate an event tensor; s2, reconstructing the event tensor to generate a reconstructed image; s3, representing the three-dimensional scene by adopting a differentiable explicit scene representation method, and projecting and rendering the three-dimensional scene by combining the current camera pose to generate a predicted image; S4, optimizing and calculating the current camera pose by minimizing the difference between the current reconstructed image and the predicted image under the current visual angle; s5, in a sliding window containing a plurality of key frames, optimizing parameters of the three-dimensional scene in multiple rounds by minimizing a basic mapping loss function, wherein the basic mapping loss function comprises differences between the reconstructed image of each key frame and the predicted image under a corresponding view angle; and S6, outputting a camera track based on the optimized current camera pose, and outputting a three-dimensional map based on parameters of the optimized updated three-dimensional scene. Preferably, the step S2 comprises the step of carrying out the generated event reconstruction on th