CN-121982216-A - Feedforward large scene reconstruction method based on three-dimensional Gaussian splatter

CN121982216ACN 121982216 ACN121982216 ACN 121982216ACN-121982216-A

Abstract

The invention relates to the technical field of image processing, in particular to a feedforward large scene reconstruction method based on three-dimensional Gaussian splatter. The method comprises the steps of obtaining a multi-view image set, constructing an initial camera token carrying view angle identification information, carrying out feature interaction between a frame and a cross view through an alternate attention geometric transducer, outputting an updated camera token and an image feature token, predicting pose, depth map and depth confidence map of each frame through decoding branches, throwing pixels back into a three-dimensional space by utilizing prediction parameters, returning Gaussian attribute and confidence coefficient by combining the image feature, and finally achieving redundancy elimination of Gaussian primitives through voxel aggregation and outputting a target view. According to the method, the efficient and high-precision three-dimensional reconstruction of the large scene can be realized without the prior and complicated sparse reconstruction process of the internal and external parameters of the camera, the calculation cost is obviously reduced, and the problems of geometric drift and video memory overflow which are easy to occur in the large scene in the traditional method are solved.

Inventors

LI YUQI
Jin Zirui
ZHENG WENXING
WANG ZIYAO

Assignees

宁波大学

Dates

Publication Date: 20260505
Application Date: 20260130

Claims (10)

1. The feedforward large scene reconstruction method based on three-dimensional Gaussian splatter is characterized by comprising the following steps of: S1, acquiring a multi-view image set of a scene to be reconstructed, performing feature coding on each view image to obtain an image token, and constructing an initial camera token for each view; s2, inputting the image token and the initial camera token into a preset alternative attention geometry transducer, carrying out token updating through intra-frame fusion and cross-view global fusion, and outputting an updated camera token and an updated image token which contain scene geometry information; S3, based on the updated camera token, predicting camera internal and external parameters corresponding to each frame of image through pose resolving branches, and simultaneously, based on the updated image token, predicting a depth map and a depth confidence map corresponding to each frame of image through depth predicting branches; S4, utilizing predicted camera internal and external parameters and a depth map, casting pixel points of each view back to a three-dimensional space, determining a three-dimensional Gaussian center position corresponding to each pixel, and predicting attribute parameters of each three-dimensional Gaussian primitive by combining an updated image token; s5, carrying out space barrel division on original three-dimensional Gaussian primitives generated by all views according to preset voxel sizes, and carrying out weighted aggregation on original Gaussian primitive attributes in each voxel according to Gaussian confidence in each voxel to obtain a sparse voxel-level three-dimensional Gaussian set; and S6, rendering the voxel-level three-dimensional Gaussian set by using a micro-renderer to obtain a rendered image under the target view angle.
2. The three-dimensional Gaussian splatter-based feedforward large scene reconstruction method according to claim 1, wherein in the step S1, an overlapping area meeting a preset overlapping degree requirement is arranged between images of adjacent views in the multi-view image set, the exposure and definition of the images are in a preset threshold range, and when the multi-view image set is used as input data, camera internal parameters, camera external parameters and shooting sequence information are not included.
3. A three-dimensional gaussian splatter based feedforward large scene reconstruction method according to claim 1, wherein in step S1, the step of constructing an initial camera token for each view comprises: initializing a leachable vector with a preset dimension for each view image as an initial camera token for the view; and fusing the initial camera token with the view index code corresponding to the view, so that the initial camera token carries identification information for distinguishing different view angles.
4. The three-dimensional gaussian splatter-based feedforward large scene reconstruction method according to claim 1, wherein in step S2, the step of outputting an updated camera token containing scene geometry information and an updated image token comprises: Alternately performing intra-frame self-attention computation and cross-view global fusion computation in a plurality of feature processing layers of the alternate-attention geometric convertors; the intra-frame self-attention calculation is used for carrying out information interaction between image tokens in the single image so as to extract the space semantic characteristics of each view; The cross-view global fusion calculation is used for carrying out global attention interaction between the image tokens of all views and the camera tokens of all views; And the global geometric constraint information of the scene is aggregated into a camera token corresponding to each view by capturing the parallax characteristic and geometric corresponding relation of the overlapping region between different views, and the multi-view compensation information is fused into an image characteristic token corresponding to each view.
5. The three-dimensional gaussian splatter-based feedforward large scene reconstruction method according to claim 1, wherein in step S3, predicting the camera internal and external parameters corresponding to each frame of image through pose solving branches comprises: and inputting the updated camera token into a pose resolving branch, obtaining a rotation vector and a translation vector of each frame of image relative to a reference view angle through nonlinear mapping regression, and converting the rotation vector and the translation vector into a camera internal and external parameter matrix.
6. The three-dimensional gaussian splatter-based feedforward large scene reconstruction method according to claim 1, wherein in step S3, the step of predicting the depth map and the depth confidence map corresponding to each frame of image by the depth prediction branch comprises: Inputting the updated image token into a decoding network in a depth prediction branch, restoring the characteristic dimension to the original image resolution through an up-sampling operation, and outputting a corresponding depth map; and simultaneously, outputting a depth confidence map of each pixel position by using a confidence prediction head of the decoding network.
7. The three-dimensional gaussian splatter-based feedforward large scene reconstruction method according to claim 1, wherein in step S4, the step of determining the three-dimensional gaussian center position corresponding to each pixel includes: According to the predicted internal and external parameters of each view camera, constructing a projection transformation matrix from an image coordinate system to a world coordinate system; And (3) projecting the two-dimensional coordinates and the corresponding depth values of each pixel in the depth map of each view back to the three-dimensional world space by using the projective transformation matrix to obtain the three-dimensional space coordinates corresponding to each pixel, and taking the three-dimensional space coordinates as the central position of the corresponding original three-dimensional Gaussian primitive.
8. The three-dimensional gaussian splatter-based feedforward large scene reconstruction method according to claim 1, wherein in step S4, the attribute parameters of each three-dimensional gaussian primitive are predicted in combination with the updated image tokens: Inputting the updated image token into an attribute regression head, decoding, and carrying out regression to obtain attribute parameters of the three-dimensional Gaussian primitives corresponding to each pixel position; And extracting the initial Gaussian confidence corresponding to each pixel from the updated image token through a confidence prediction branch.
9. The method for reconstructing a large feedforward scene based on three-dimensional gaussian splatter according to claim 1, wherein in step S5, the step of obtaining a thinned voxel-level three-dimensional gaussian set comprises: Dividing the three-dimensional scene into a plurality of voxel grids according to the preset spatial resolution, and mapping original three-dimensional Gaussian primitives generated by all views into corresponding voxels according to the central position; constructing Softmax distribution in each voxel containing a plurality of original three-dimensional Gaussian primitives by utilizing Gaussian confidence of each primitive, and calculating to obtain the aggregation weight of each primitive; And weighting and summing the attributes of all original three-dimensional Gaussian primitives in the voxels by using the aggregation weight to generate a unique voxel-level three-dimensional Gaussian primitive in each voxel.
10. The three-dimensional gaussian splatter-based feedforward large scene reconstruction method according to claim 1, further comprising the step of supervising the training phase of the alternating attention geometric fransformer with rendering reconstruction loss, the training step comprising: Rendering by using the predicted voxel level three-dimensional Gaussian set, and calculating luminosity loss between the rendered image and the real image; Screening out a high-confidence pixel region according to the depth confidence level map, and applying multi-view geometric consistency loss to the region so as to restrict the consistency of space point projection under different view angles; generating a pseudo tag by utilizing a pre-trained depth estimation model and a pose estimation algorithm, and carrying out distillation training on the model by taking the pseudo tag as a supervision signal.

Description

Feedforward large scene reconstruction method based on three-dimensional Gaussian splatter Technical Field The invention relates to the technical field of image processing, in particular to a feedforward large scene reconstruction method based on three-dimensional Gaussian splatter. Background The three-dimensional reconstruction and new view angle synthesis are key basic technologies in the fields of computer vision, graphics, virtual reality and the like, and are widely applied to digital twinning, AR/VR, film and television production, cultural relic digitization, automatic driving simulation, indoor and outdoor mapping and other scenes. In the prior art, the mainstream high-quality three-dimensional reconstruction/new view angle synthesis generally adopts a two-stage or multi-stage process of estimating the pose of a camera and then optimizing the representation of a scene, for example, firstly obtaining the internal parameters and external parameters of the camera through a structured light beam method/multi-view geometry (SfM/MVS), and then performing iterative optimization of representations of a neural field (such as NeRF) or a three-dimensional gaussian point cloud (3 dgaussian spatting, 3 dgs) and the like on each scene. Such methods generally have the following problems: 1. and additional pose solving or strict calibration is needed, so that the scene adaptation of the unconstrained multi-view image with no calibration, no pose and unfixed shooting sequence is difficult. 2. The scene-by-scene optimization is time-consuming and high in cost, long optimization time is often needed, and quick output of 'use after shooting' is difficult to realize. 3. Some feedforward methods distribute Gaussian primitives to each pixel, and the Gaussian quantity increases approximately linearly when the number of the viewing angles increases, so that video memory and computation pressure are brought. 4. Some feedforward methods distribute Gaussian primitives to each pixel, and the Gaussian quantity increases approximately linearly when the number of the viewing angles increases, so that video memory and computation pressure are brought. 5. The method has the advantages that reliable three-dimensional supervision is lacking, three-dimensional labeling of a real scene is expensive and possibly noisy, and when the three-dimensional supervision is lacking, the model is easy to be fitted over a small amount of context view angles, so that the problems of geometrical inconsistency, depth layering, view angle dislocation and the like are caused, and the rendering quality is affected. Therefore, a method and system for rapidly predicting camera parameters and three-dimensional gaussian representation end to end and completing rendering under the condition of inputting pose-free multi-view images only are needed, and meanwhile, the dense visual angle expansion capability and training stability are both considered. Disclosure of Invention In order to make up for the defects, the invention provides a feedforward large scene reconstruction method based on three-dimensional Gaussian splatter, which aims to solve the technical problems that the existing three-dimensional Gaussian splatter technology is highly dependent on the pose priori of a high-precision camera, the redundancy of point cloud data in a large scene seriously causes overflow of a video memory, and the end-to-end quick feedforward reconstruction cannot be realized. The invention provides a feedforward large scene reconstruction method based on three-dimensional Gaussian splatter, which comprises the following steps: S1, acquiring a multi-view image set of a scene to be reconstructed, performing feature coding on each view image to obtain an image token, and constructing an initial camera token for each view; s2, inputting the image token and the initial camera token into a preset alternative attention geometry transducer, carrying out token updating through intra-frame fusion and cross-view global fusion, and outputting an updated camera token and an updated image token which contain scene geometry information; S3, based on the updated camera token, predicting camera internal and external parameters corresponding to each frame of image through pose resolving branches, and simultaneously, based on the updated image token, predicting a depth map and a depth confidence map corresponding to each frame of image through depth predicting branches; S4, utilizing predicted camera internal and external parameters and a depth map, casting pixel points of each view back to a three-dimensional space, determining a three-dimensional Gaussian center position corresponding to each pixel, and predicting attribute parameters of each three-dimensional Gaussian primitive by combining an updated image token; s5, carrying out space barrel division on original three-dimensional Gaussian primitives generated by all views according to preset voxel sizes, and carrying out weighted aggregat