CN-121982256-A - 3 DGS-based sparse view pose-free scene reconstruction method and system

CN121982256ACN 121982256 ACN121982256 ACN 121982256ACN-121982256-A

Abstract

A3 DGS-based sparse view pose-free scene reconstruction method and system belong to the field of computer vision and three-dimensional reconstruction, and solve the problems of dependence on accurate camera pose priori, easy failure under sparse view or weak texture environment and poor geometric consistency. The method comprises the steps of constructing a double-flow perception module containing semantic flow and geometric flow, extracting semantic features by utilizing a visual basic model and extracting explicit geometric corresponding relations by utilizing a dense feature matching network, so that the relative camera pose is regressed under pose-free prior, introducing a geometric guide depth refinement module combining a potential diffusion model framework and a Program ray code, eliminating scale ambiguity of monocular depth estimation by a depth residual prediction mechanism, and carrying out end-to-end optimization by utilizing a self-supervision loss function containing rendering consistency, reprojection and epipolar geometric constraint based on micro Gaussian rasterization. According to the invention, high-fidelity three-dimensional reconstruction is realized under the supervision of no external parameter value, and the geometric stability and rendering quality of a complex scene are improved.

Inventors

GONG DAOXIONG
NIE XU

Assignees

北京工业大学

Dates

Publication Date: 20260505
Application Date: 20260128

Claims (10)

1. The sparse view pose-free scene reconstruction method based on 3DGS is characterized by comprising the following steps of: S1, acquiring a sparse multi-view two-dimensional image sequence of a scene to be reconstructed, wherein the image sequence comprises a context view and a reference view of an unknown camera external parameter, and preprocessing images to unify resolution and format; S2, constructing a double-flow perception module comprising semantic flows and geometric flows, respectively extracting pixel-level semantic features of the context view by utilizing a pre-trained visual basic model, and extracting an explicit geometric corresponding relation between the context view and the target view by utilizing a pixel-level dense feature matching network; s3, based on the explicit geometric correspondence, predicting the relative camera pose of the context view relative to the reference view through a lightweight pose regression network, wherein the relative camera pose comprises a rotation matrix and a translation vector, and performing auxiliary correction by utilizing semantic features; s4, constructing a geometrically guided depth optimization module by utilizing the pixel-level semantic features and the predicted initial depth map, wherein the module is based on a U-Net depth refinement network, takes predicted relative camera pose and multi-view features as conditions for injection, and eliminates the scale ambiguity of monocular depth estimation by using the depth map optimized by a residual prediction mechanism; S5, based on the optimized depth map and semantic features, generating attribute parameters of a head network regression three-dimensional Gaussian primitive through Gaussian, wherein the attribute parameters comprise a three-dimensional center position, an anisotropic covariance matrix, a color coefficient and opacity, so that dense three-dimensional Gaussian field representation of a scene is constructed; S6, projecting the three-dimensional Gaussian field to a target view plane by utilizing a micro Gaussian rasterization technology, generating a rendered image, calculating a self-supervision loss function based on the difference between the rendered image and a real target image, and realizing end-to-end pose-free scene reconstruction by jointly optimizing network parameters through back propagation; wherein the self-supervising penalty function includes a render consistency penalty, a reprojection penalty, and an epipolar geometry constraint penalty.
2. The method according to claim 1, wherein the implementation of the dual stream perception module in step S2 comprises: In the semantic stream, a pre-trained ViT (Vision Transformer) architecture is adopted as an encoder, high-dimensional semantic features of the image are extracted, and feature representation is updated by utilizing a dot product attention mechanism normalized by Softmax through information among cross-view attention layer interaction context views so as to infer an implicit three-dimensional geometric relationship; In the geometric flow, loFTR (Local Feature Transformer) networks are used as backbones, dense corresponding relations between the context view and the target view are established on the coarse-granularity feature map, top-N explicit matching point pairs are selected according to the matching confidence degree, and a geometric matching set is formed For subsequent pose estimation.
3. The method according to claim 2, wherein the predicting process of the relative camera pose in step S3 is specifically: Assembling the geometric matches Flattening the data into vectors, and inputting the vectors into a pose regressor formed by a multi-layer perceptron by combining with a global feature descriptor of a semantic stream; Quaternary representation of output rotation parameters Translation vector ; Normalized processing is carried out on quaternion and converted into a rotation matrix Combining translation vectors to form a transformation matrix For describing a rigid body transformation from the contextual view coordinate system to the target view coordinate system, thereby replacing the traditional implicit pose regression.
4. The method according to claim 1, wherein the implementation of the geometrically guided depth optimization module in step S4 comprises: constructing a depth refinement network of a UNet architecture based on LDM optimization, wherein the input of the depth refinement network is an initial depth map, fused semantic features and ray features coded by a Program coordinate; the ray features are camera internal parameters predicted by combining pixel coordinates And relative pose, calculating the direction vector of each imaging ray And moment vector ; The depth residual prediction mechanism is executed in a potential feature space, the context understanding capability of a network on multi-mode features is utilized, a depth residual item is directly regressed, the residual item is superimposed to an initial depth map, so that local geometric errors and global scale inconsistencies of the initial depth map are corrected, in the process, the ray features are injected into a feature decoding process based on a potential diffusion model U-Net network architecture as an explicit three-dimensional geometric prior, and the forcibly output fine depth map meets multi-view geometric consistency constraint.
5. The method according to claim 1, wherein the self-supervising loss function in step S6 The definition is as follows: Wherein, the For rendering consistency loss, the method consists of L2 pixel loss and LPIPS perception loss weighting and is used for restraining visual similarity between a rendered image and a real image; for the loss of the two-way re-projection, the pixels of the target view are reversely projected to the upper part and the lower part Wen Shitu according to the predicted depth and the pose, and the consistency of the geometric structure is restrained by utilizing the luminosity error; for epipolar geometry constraint loss, reducing scale drift in pose estimation by constructing a base matrix; And the weight coefficient is adjustable for each loss term.
6. The method according to claim 1, wherein the calculating of the center position of the three-dimensional gaussian primitive in step S5 specifically comprises: Converting pixel coordinates into normalized plane coordinates by using an inverse matrix of a camera internal reference matrix, extending the normalized plane coordinates into three-dimensional points under a camera coordinate system according to the optimized depth values, converting the three-dimensional points under the camera coordinate system into a target coordinate system by using a predicted relative pose matrix to obtain the central position of a three-dimensional Gaussian primitive, and simultaneously, returning the Gaussian generating head network to output a covariance matrix Color coefficient And opacity The center position and the attribute parameters together form a three-dimensional Gaussian representation of the scene, and the three-dimensional Gaussian representation is used for constructing the three-dimensional scene to be reconstructed.
7. A system for implementing the method of claim 1, comprising: the data acquisition and preprocessing module is used for acquiring a sparse multi-view image sequence of a scene to be reconstructed and preprocessing the sparse multi-view image sequence; the double-flow feature extraction module comprises a semantic feature extraction unit and a geometric matching extraction unit which are respectively used for extracting DINOv semantic features and LoFTR explicit geometric corresponding relations of the images; The pose estimation module is used for regressing the relative camera pose between views based on the explicit geometric corresponding relation and the semantic features; The geometric guidance depth optimization module is integrated with the LDM-UNet and the Program ray encoder and is used for carrying out geometric guidance fine correction on the initial depth map based on the predicted pose and the semantic features; The three-dimensional Gaussian point cloud generation module is used for generating space and attribute parameters of the three-dimensional Gaussian primitives according to the corrected depth map and the image characteristic regression, and aggregating the space and the attribute parameters to generate a Gaussian point cloud model for expressing a complete three-dimensional scene; And the result acquisition module is used for acquiring a final three-dimensional scene model or synthesizing a new view angle image based on the generated three-dimensional Gaussian primitive.
8. The system of claim 7, further comprising a standardized data interface module for automatically initializing pinhole camera model parameters without camera calibration parameters and converting predicted pose sequences and reconstructed point cloud data into standard triplet data formats including camera bin, images, txt and points3d. Ply for compatibility with subsequent three-dimensional visualization or editing tools.
9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any one of claims 1 to 6.
10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any one of claims 1 to 6 when the program is executed by the processor.

Description

3 DGS-based sparse view pose-free scene reconstruction method and system Technical Field The invention belongs to the technical fields of computer vision, deep learning and three-dimensional reconstruction, and particularly relates to a 3DGS (3D Gaussian Splatting) -based sparse view pose-free scene reconstruction method and system. The method is particularly suitable for realizing high-quality three-dimensional scene reconstruction and new view angle synthesis by combining implicit nerve representation and explicit geometric constraint under the complex scene with sparse view and large baseline difference without obtaining the accurate camera pose. Background Three-dimensional scene reconstruction is a fundamental task in the fields of computer vision, augmented reality, and autonomous robotic navigation. In this field, how to efficiently and high-quality represent and render three-dimensional scenes is always the core of research. In recent years, implicit representation methods represented by neural radiation fields (NeRF), while achieving photo-level rendering quality, have limited practical application due to their high training and reasoning costs. Three-dimensional Gaussian splats (3D Gaussian Splatting, 3 DGS) are used as an emerging explicit scene representation method, and by means of the characteristic that the anisotropic Gaussian ellipsoids are utilized to model the scene, extremely high rendering fidelity is reserved, and real-time rendering speed is realized through an efficient rasterization pipeline. The method has the characteristics of high quality and high efficiency, and is rapidly becoming a main flow technical route in the field of current three-dimensional scene reconstruction and new view angle synthesis. However, while 3DGS performs well in a controlled environment, its standard training procedure has a very high threshold, i.e., highly dependent on accurate camera outliers as a priori input. In normal operation, these camera poses typically need to be pre-computed from the image sequence by means of a motion restoration structure (Structure from Motion, sfM) or an instant localization and mapping (SLAM) algorithm. However, in a real open world scene, the conventional SfM algorithm often faces the problem of feature matching failure or difficult convergence in the face of a weak texture region, a sparse shooting view angle or a complex dynamic environment, so that an accurate pose initial value cannot be obtained, and further a subsequent 3DGS reconstruction process is directly blocked. In order to get rid of the dependence on pre-computed pose, researchers have tried a variety of solutions for three-dimensional reconstruction under unknown pose conditions. The prior art is mainly divided into two paths, wherein one method tries to jointly optimize the pose of the camera in the training process of the nerve field, and the solving process of the method has extremely high dependence on the initial estimated value of the pose of the camera. If the precise initial value guidance is lacking, the optimization algorithm is very easy to fall into a local convergence state, so that the reconstructed scene geometry is disordered. Another class of deep learning-based methods attempts to directly return to pose through encoders, although increasing the speed of reasoning, such methods typically rely on pose truth values that are difficult to obtain for supervised training, and often fail to guarantee cross-view geometric consistency when processing large disparities or sparse view data due to the lack of explicit multi-view geometric constraints in the network design. The lack of the geometric constraint directly leads to the phenomena of unstable depth estimation, blurred edges of objects, ghost images, even structural collapse and the like in the reconstruction result, and seriously influences the accuracy and the robustness of three-dimensional reconstruction. In summary, how to fully utilize the high-efficiency expression capability of 3DGS and realize high-quality three-dimensional scene reconstruction with geometric consistency and robustness under the conditions of no camera pose priori and no truth value supervision is a key technical problem to be solved in the current three-dimensional computer vision field. Disclosure of Invention The invention aims to provide a sparse view pose-free scene reconstruction method and system based on three-dimensional Gaussian representation, and aims to solve the problems that in the prior art, under the condition of lacking of accurate camera pose, sparse view angles or weak textures, the three-dimensional reconstruction effect is poor, the geometric structure is easy to collapse and the dependence on true value data is strong. In order to achieve the above purpose, the invention adopts the following technical scheme: a sparse view pose-free scene reconstruction method based on three-dimensional Gaussian representation comprises the following steps: