EP-4738268-A1 - APPARATUS, DATA CARRIER, COMPUTER PROGRAM, TRAINING METHOD, AND METHOD FOR 3D SCENE RECONSTRUCTION

EP4738268A1EP 4738268 A1EP4738268 A1EP 4738268A1EP-4738268-A1

Abstract

Embodiments of the present disclosure relate to an apparatus, a data carrier, a computer program, a training method, and a method for 3D scene reconstruction. The model comprises a series of extrapolation of latent features (ELF) blocks and a feature extractor including multiple transformer blocks. The method comprises obtaining a first latent feature representation of a scene for reference images of the scene from a first transformer block of the feature extractor applied to the reference images and obtaining a second latent feature representation the scene from a second transformer block of the feature extractor applied to the reference images. As well, the method comprises obtaining, based on the first latent feature representation, a first extrapolated representation using a first ELF block, and obtaining, based on the second latent feature representation, a second extrapolated representation using a second ELF block. Further, the method comprises generating one or more novel views of the scene based on the extrapolated representations.

Inventors

Kästingschäfer, Marius
BERNHARD, SEBASTIAN
Najafli, Eyvaz

Assignees

AUMOVIO Germany GmbH

Dates

Publication Date: 20260506
Application Date: 20241029

Claims (12)

A method (200) for 3D scene reconstruction using a machine-learning-based model, wherein the model comprises a series of extrapolation of latent features, ELF, blocks and a feature extractor including multiple transformer blocks, the method (200) comprising: obtaining (210) a first latent feature representation of a scene for reference images of the scene from a first transformer block of the feature extractor applied to the reference images; obtaining (220) a second latent feature representation of the scene from a second transformer block of the feature extractor applied to the reference images; obtaining (230), based on the first latent feature representation, a first extrapolated representation using a first ELF block; obtaining (240), based on the second latent feature representation, a second extrapolated representation using a second ELF block; and generating (250) one or more novel views of the scene based on the extrapolated representations.
The method (200) of claim 1, wherein the first transformer block is upstream to the second transformer block and the first ELF block is upstream to the second ELF block.
The method (200) of claim 1 or 2, wherein obtaining the first and second extrapolated representation comprises: obtaining initial virtual views of the scene based on the reference images and camera parameters for recording the reference images; generating the first and the second extrapolated representation based on the initial virtual views as input for initializing the ELF blocks and the first latent feature representation as conditioning signal for the first ELF block and the second latent feature representation as conditioning signal for the second ELF block.
The method (200) of any one of the preceding claims, wherein generating the novel views comprises: projecting the first and second extrapolated representation to a reduced number of dimensions; and fusing the projected extrapolated representations for generating the novel views based on the fused extrapolated representations.
The method (200) of claim 4, wherein generating the novel views further comprises: obtaining a depth map from the fused projected extrapolated representations; obtaining a feature map based on an outcome of the series of ELF blocks; and generating the novel views based on the depth map and the feature map of the scene.
The method (200) of any one of the preceding claims, wherein the method (200) further comprises: obtaining one or more further latent feature representations of the scene for the reference images from one or more further transformer blocks of the feature extractor applied to the reference images; and obtaining, based on the further latent feature representations, a one or more further extrapolated representations using respective further ELF blocks.
The method (200) of any one of the preceding claims, wherein the method (200) further comprises providing the novel views for controlling a cyber physical system.
The method (200) of claim 6, wherein the cyber physical system comprises or corresponds to an autonomous or assisted driving system for a vehicle, or a robot.
A training method (500) for training a machine-learning-based model for 3D scene reconstruction, wherein the model comprises a feature extractor including multiple transformer blocks and a series of extrapolation of latent features, ELF, blocks, the training method comprising: training (510) the feature extractor using training data including labelled image data; obtaining (520) a first latent feature representation of sample images as well as of target views of a scene from a first transformer block of the trained feature extractor applied to the sample images and target views; obtaining (530) a second latent feature representation of sample images as well as of target views of the scene from a second transformer block of the trained feature extractor applied to the sample images and target views; and training (540) a first and a second ELF block based on: initial virtual views for the scene; the first latent feature representation of the sample images as conditioning signal for the first ELF block and the first latent feature representation of the target views as ground truth for the first ELF block; and the second latent feature representation of the sample images as conditioning signal for the second ELF block and the second latent feature representation of the target views as ground truth for the second ELF block.
A computer program comprising instructions which, when the computer program is executed by a computer, cause the computer to carry out a method (200, 500) of any one of the claims 1 to 9 and/or provide a machine-learning based model obtainable by a training method (500) of claim 9.
A computer-readable data carrier having stored thereon the computer program of claim 10.
An apparatus (600) comprising: one or more interfaces (610) for communication; and a data processing circuit (620) configured to execute a method of any one of the claims 1 to 9 and/or provide a machine-learning based model obtainable by a training method of claim 9.

Description

Embodiments of the present disclosure relate to an apparatus, a data carrier, a computer program, a training method, and a method for 3D scene reconstruction. 3D scene reconstruction was applied in various applications and in different fields of technology including, e.g., automotive applications (e.g., driving/parking assistant systems). The reconstruction of 3D scenes from 2D images is a long-standing challenge in computer vision. Recent advancements in novel view synthesis (NVS) due to the use of neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS) have renewed interest in the problem. NeRFs and 3DGS learn the 3D scene representation given camera images and pose information. Recent related approaches either focus on novel view synthesis directly, reducing the number of images required for training or reconstructing 3D occupancy: Novel View Synthesis methods (I.): Those methods primarily focus on improving the visual quality of the obtained reconstructions. Here recently Gaussian Splats have replaced more implicit representation methods in terms of speed and reconstruction performance.Semantic Occupancy prediction (II.): Transforming consistent and reliable 3D representations from 2D RGB image inputs or 2,5D lidar inputs is particularly important within autonomous driving. Methods in this domain usually do not rely on differentiable volume rendering but apply other methods to parameterize a voxelized space.Sparse 3D reconstruction (III.): The original NeRF method relied on many RGB images for regularizing the learned scene representation. Many methods attempt to reduce the number of images required to learn the scene. Allowing few-image or very recently single-image-to-3D reconstructions. Novel View Synthesis methods are limited in the following regards: Performance of most methods drops dramatically when using fewer images.Require absolute pose information, usually estimated using additional compute-intensive structure-from motion algorithms such as COLMAP.Often limited in their ability to perform inference over occluded or unobserved parts of the scene.Require extensive per-scene training which increases total usage time and makes the methods unfeasible for few-image single-shot inference time usage. Semantic Occupancy prediction methods face the following problems:Are limited in their visual fidelity and spatial resolution. Since those models are not aiming at reconstructing the 3D scene faithful but instead focus on occupancy (whether a point in space has non-zero density) and semantic segmentation (to which of the predefined classes does the density belong), they are far from photorealistic or high-fidelity reconstructions.Have a fixed spatial resolution due to the lack of scene contraction. They model the scene up to a certain distance (e.g., 20-meter radius in all directions) but are unable to capture details or scene parts further away. Sparse 3D reconstruction methods are limited by the following factors:Are often only suited for inward-facing camera setups with significant view overlap. Ego-exo view generalization as required for autonomous driving applications is thus difficult for those methods.Existing methods are mainly tackling bounded scenes, particularly single objects with transparent or white scene backgrounds.Many methods rely on expensive inference time-optimization (for example common within most diffusion models). Existing approaches focus on pixel-aligned explicit 3D primitive prediction, which restricts the ability of those models to generalize to potentially unseen regions. Furthermore, reconstructing large scenes from a handful of observations is difficult when the number of primitives used to represent the scene is bounded above by the number of input views. Hence, there may be a demand for an improved concept of reconstructing 3D scenes. This demand may be satisfied by the subject-matter of the appended independent claims. In particular, the approach proposed herein may tackle at least one of the challenges are drawbacks mentioned above. Optional embodiments are disclosed in the appended dependent claims. Embodiments of the proposed approach are based on the finding that a hierarchical model architecture allows a separate training of blocks for latent feature extrapolation. Embodiments of the present disclosure provide a method for 3D scene reconstruction using a machine-learning-based model. The model comprises a series of extrapolation of latent features (ELF) blocks and a feature extractor including multiple transformer blocks. The method comprises obtaining a first latent feature representation of a scene for reference images of the scene from a first transformer block of the feature extractor applied to the reference images and obtaining a second latent feature representation the scene from a second transformer block of the feature extractor applied to the reference images. As well, the method comprises obtaining, based on the first latent feature representatio