US-12626459-B2 - Method and system for enabling exposure-guided three-dimensional reconstruction

US12626459B2US 12626459 B2US12626459 B2US 12626459B2US-12626459-B2

Abstract

Disclosed is a method for enabling exposure-guided three-dimensional (3D) reconstruction, including (i) receiving Low Dynamic Range (LDR) input image captured by camera; (ii) rendering High Dynamic Range (HDR) image corresponding to input image, using HDR 3D reconstruction model pre-trained for HDR image reconstruction of 3D environment; (iii) applying HDR to LDR tone-mapping operator on HDR image, for producing tone-mapped LDR image; (iv) determining first loss function (FLF) between input image and tone-mapped LDR image, FLF comprising weighted pixel value differences between pixels of input image and corresponding pixels of tone-mapped LDR image; (v) determining whether input image has saturated pixel(s), saturated pixel(s) comprises: highlight saturated pixel, shadow saturated pixel; (vi) de-weighting pixel value difference corresponding to saturated pixel in FLF; (vii) back-propagating gradient of FLF with respect to model parameters, through differentiable render function of HDR 3D reconstruction model, for adjusting model parameters in way that reduces FLF.

Inventors

Kimmo Roimela
Anton Brandl

Assignees

Varjo Technologies Oy

Dates

Publication Date: 20260512
Application Date: 20240626

Claims (15)

1 . A method for enabling exposure-guided three-dimensional (3D) reconstruction, the method comprising: (i) receiving an input image captured by a camera, wherein the input image is a Low Dynamic Range (LDR) image; (ii) rendering a High Dynamic Range (HDR) image corresponding to the input image, using a HDR 3D reconstruction model that is pre-trained for HDR image reconstruction of a 3D environment; (iii) applying a HDR to LDR tone-mapping operator on the HDR image, for producing a tone-mapped LDR image; (iv) determining a first loss function between the input image and the tone-mapped LDR image, the first loss function comprising weighted pixel value differences between pixels of the input image and their corresponding pixels of the tone-mapped LDR image; (v) determining whether the input image has at least one saturated pixel, wherein the at least one saturated pixel comprises at least one of: a highlight saturated pixel, a shadow saturated pixel; (vi) de-weighting at least one pixel value difference corresponding to the at least one saturated pixel, in the first loss function, when it is determined that the input image has the at least one saturated pixel; and (vii) back-propagating a gradient of the first loss function with respect to one or more model parameters, through a differentiable render function of the HDR 3D reconstruction model, for adjusting the one or more model parameters in a way that reduces the first loss function, wherein the differentiable render function comprises an HDR image-rendering function and the HDR to LDR tone-mapping operator.
2 . The method of claim 1 , wherein the steps (ii)-(vii) are performed iteratively until the first loss function is minimized.
3 . The method of claim 1 , wherein the step (vi) of de-weighting the at least one pixel value difference comprises: determining a derivative of the HDR to LDR tone-mapping operator, for each pixel of the input image; and adjusting a weight of the at least one pixel value difference such that the weight is proportional to the derivative of the HDR to LDR tone-mapping operator, the weight lying within a first threshold from zero.
4 . The method of claim 1 , wherein the step (vi) of de-weighting the at least one pixel value difference comprises adjusting a weight of the at least one pixel value difference according to a value of at least one of: a first ramp function, a second ramp function, for the at least one saturated pixel, wherein the first ramp function ranges between 0 to 1 such that the value of the first ramp function is equal to 0 when a pixel value of a saturated pixel is equal to a maximum LDR pixel value, and the value of the first ramp function is equal to 1 when the pixel value of the saturated pixel is equal to a first threshold LDR pixel value corresponding to the maximum LDR pixel value, and the second ramp function ranges between 0 to 1 such that the value of the second ramp function is equal to 0 when a pixel value of a saturated pixel is equal to a minimum LDR pixel value, and the value of the second ramp function is equal to 1 when the pixel value of the saturated pixel is equal to a second threshold LDR pixel value corresponding to the minimum LDR pixel value.
5 . The method of claim 1 , wherein the step (vi) of de-weighting the at least one pixel value difference comprises: adjusting a weight of a pixel value difference to zero, when a pixel value of a corresponding saturated pixel is equal to a maximum LDR pixel value or to a minimum LDR pixel value; and adjusting a weight of a pixel value difference to one, when a pixel value of a corresponding saturated pixel is equal to any other pixel value.
6 . The method of claim 1 , wherein the step (vi) of de-weighting the at least one pixel value difference comprises: adjusting a weight of a pixel value difference corresponding to a pixel of the tone-mapped LDR image whose pixel value is greater than a first threshold LDR pixel value corresponding to a maximum LDR pixel value, to zero; and adjusting a weight of a pixel value difference corresponding to a pixel of the tone-mapped LDR image whose pixel value is lesser than a second threshold LDR pixel value corresponding to a minimum LDR pixel value, to zero, wherein the first threshold LDR pixel value and the second threshold LDR pixel value are defined by values of one or more parameters of the HDR to LDR tone-mapping operator.
7 . The method of claim 1 , further comprising estimating a camera tone reproduction curve by: initializing a value of at least one trainable parameter of the HDR to LDR tone-mapping operator to a default value; generating a training dataset comprising ground truth LDR images captured by the camera and HDR images corresponding to the ground truth LDR images; and performing at least one iteration of: applying the HDR to LDR tone-mapping operator on the HDR images, for producing tone-mapped LDR images; determining a second loss function between the ground truth LDR images and the tone-mapped LDR images, the second loss function comprising weighted pixel value differences between pixels of the ground truth LDR images and their corresponding pixels of the tone-mapped LDR image; and back-propagating the second loss function through the HDR to LDR tone-mapping operator for adjusting the value of the at least one trainable parameter in a way that reduces the second loss function, wherein when the second loss function is minimized upon performing the at least one iteration, the HDR to LDR tone-mapping operator best approximates the camera tone reproduction curve, and wherein the HDR to LDR tone-mapping operator that best approximates the camera tone reproduction curve is used at step (iii) and/or step (vi).
8 . The method of claim 1 , further comprising: determining an exposure level of the input image by analysing metadata corresponding to the input image; and determining an exposure gain factor that corresponds to an exposure value difference between the exposure level of the input image and an absolute exposure level of the HDR 3D reconstruction model, wherein when performing the step (iii) of applying the HDR to LDR tone-mapping operator on the HDR image, pixel values of the pixels of the HDR image are scaled according to the exposure gain factor.
9 . The method of claim 1 , wherein an exposure gain factor is a trainable parameter of the HDR 3D reconstruction model, and wherein the method further comprises iteratively adjusting a value of the exposure gain factor while training the HDR 3D reconstruction model, wherein when performing the step (iii) of applying the HDR to LDR tone-mapping operator on the HDR image, pixel values of the pixels of the HDR image are scaled by the exposure gain factor.
10 . The method of claim 1 , further comprising training the HDR 3D reconstruction model for HDR image reconstruction of the 3D environment, by implementing a first training process, wherein the first training process comprises: receiving a plurality of input images ( 118 ) representing the 3D environment from a plurality of viewpoints and view directions, wherein the plurality of input images are LDR images that are captured by at least one camera at a plurality of exposure levels; analysing metadata corresponding to the plurality of input images for determining the plurality of exposure levels of the plurality of input images; mapping the plurality of exposure levels to an HDR colour space, for generating a plurality of HDR images corresponding to the plurality of input images; and training a neural network using the plurality of HDR images, for generating a HDR 3D model of the 3D environment.
11 . The method of claim 1 , further comprising training the HDR 3D reconstruction model for HDR image reconstruction of the 3D environment, by implementing a second training process, wherein the second training process comprises: receiving a plurality of input images representing the 3D environment from a plurality of viewpoints and view directions, wherein the plurality of input images are LDR images that are captured by at least one camera at a plurality of exposure levels; and using a Structure-from-Motion (SfM) technique for creating one of: a HDR 3D point cloud, a HDR 3D voxel grid, of the 3D environment, from the plurality of input images.
12 . The method of claim 1 , further comprising training the HDR 3D reconstruction model for HDR image reconstruction of the 3D environment, by implementing a third training process, wherein the third training process comprises: receiving a plurality of input images representing the 3D environment from a plurality of viewpoints and view directions, wherein the plurality of input images are LDR images that are captured by at least one camera at a plurality of exposure levels; analysing metadata corresponding to the plurality of input images for determining the plurality of exposure levels of the plurality of input images; determining an exposure gain of each input image amongst the plurality of input images, relative to a reference exposure level; using a Structure-from-Motion (SfM) technique for creating one of: a HDR 3D point cloud, a HDR 3D voxel grid, of the 3D environment, from the plurality of input images; finding a plurality of image points corresponding to the one of: the HDR 3D point cloud, the HDR 3D voxel grid; for each image point amongst the plurality of image points, scaling pixel values of said image point by the exposure gain of its corresponding input image; and for each point in the HDR 3D point cloud or for each voxel in the HDR 3D voxel grid, using an average of the scaled pixel values of one or more image points corresponding to said point or said voxel, as a pixel value of said point or said voxel.
13 . The method of claim 1 , wherein the HDR 3D reconstruction model is one of: a neural network, a voxel grid, a point cloud.
14 . A system for enabling exposure-guided three-dimensional (3D) reconstruction, the system comprising at least one processor configured to: (i) receive an input image captured by a camera, wherein the input image is a Low Dynamic Range (LDR) image, and wherein the at least one processor is communicably coupled to the camera or to a data repository that is communicably coupled to the camera; (ii) render a High Dynamic Range (HDR) image corresponding to the input image, using a HDR 3D reconstruction model that is pre-trained for HDR image reconstruction of a 3D environment; (iii) apply a HDR to LDR tone-mapping operator on the HDR image, to produce a tone-mapped LDR image; (iv) determine a first loss function between the input image and the tone-mapped LDR image, the first loss function comprising weighted pixel value differences between pixels of the input image and their corresponding pixels of the tone-mapped LDR image; (v) determine whether the input image has at least one saturated pixel, wherein the at least one saturated pixel comprises at least one of: a highlight saturated pixel, a shadow saturated pixel; (vi) de-weight at least one pixel value difference corresponding to the at least one saturated pixel, in the first loss function, when it is determined that the input image has the at least one saturated pixel; and (vii) back-propagate a gradient of the first loss function with respect to one or more model parameters, through a differentiable render function of the HDR 3D reconstruction model, for adjusting the one or more model parameters in a way that reduces the first loss function, wherein the differentiable render function comprises an HDR image-rendering function and the HDR to LDR tone-mapping operator.
15 . The system of claim 14 , wherein the at least one processor is further configured to iteratively perform the steps (ii)-(vii) until the first loss function is minimized.

Description

TECHNICAL FIELD The present disclosure relates to methods for enabling exposure-guided three-dimensional (3D) reconstruction. Moreover, the present disclosure relates to systems for enabling exposure-guided 3D reconstruction. BACKGROUND In recent times, advancements in three-dimensional (3D) reconstruction techniques have revolutionized recreation (i.e., reconstruction) of detailed digital 3D models of 3D environments, from images captured using common devices like smartphones and cameras. For example, 3D reconstruction techniques such as Neural Radiance Fields (NeRF) and Gaussian Splatting have showcased impressive capabilities in accurately reconstructing 3D real-world scenes from said images. However, a significant challenge arises when existing 3D reconstruction techniques are employed to reconstruct 3D scenes with high dynamic range (HDR) illumination, using said images. Notably, the aforesaid common devices can only capture low dynamic range (LDR) slices of the (true) HDR illumination of the 3D scenes in said images. As an example, parts of said images may be over-exposed (too bright) or under-exposed (too dark). Thus, the vast illumination range of the 3D scenes is inaccurately captured in the LDR images, and resultantly inaccurately reconstructed in the 3D models. For example, said common devices may capture two images depicting a same scene but captured with different exposure levels. One image amongst the two images may show a window as overexposed (appearing as a solid white colour with no visible details of an outside scene), while the other image shows the outside scene properly exposed (thus revealing detailed information beyond the window). This difference in exposure levels between the two images and its corresponding effect on visual detail leads to ambiguities and errors in reconstructing the scene outside the window using existing 3D reconstruction techniques. Since half of evidence suggests that the window and its outside scene should be the solid white colour, while another half shows it with actual details, the 3D reconstruction techniques do not have a single solution to aim for. This usually results in artifacts such as floating clouds of matter (i.e., floaters) in thin air as an optimization algorithm (which is a 3D reconstruction algorithm) attempts to recreate a 3D model that can explain the two images (leading to overfitting). Existing methods attempt to address the aforesaid challenge of reconstructing 3D scenes with HDR illumination through various approaches. For example, inverse tone mapping is traditionally applied to bracketed stacks of LDR images to recover HDR data and camera tone curve. As another example, in recent times neural network models have been used to recover HDR information from single LDR images. However, both these methods are unable to recover accurate and sufficient HDR data from the LDR images. In yet another example, approaches based on exposure compensation during 3D reconstruction utilize per-pixel scaling as part of its training loop to unify exposure levels across input images, but still struggle to mitigate saturation from the input images effectively. In still another example, the 3D reconstruction techniques that focus on view-dependent color reconstruction in standard red-green-blue (SRGB) space, are being employed. In 3D Gaussian Splatting, this is implemented using a representation of said color space using spherical harmonics, but it often results in noticeable reconstruction artifacts and inaccuracies, especially in complex scenes with diverse lighting conditions. Therefore, in light of the foregoing discussion, there exists a need to overcome the aforementioned drawbacks. SUMMARY The aim of the present disclosure is to provide a method and a system for enabling exposure-guided three-dimensional (3D) reconstruction, for reconstructing accurate HDR 3D models of 3D environments using LDR images. The aim of the present disclosure is achieved by a method for enabling exposure-guided 3D reconstruction and a system for enabling exposure-guided 3D reconstruction as defined in the appended independent claims to which reference is made to. Advantageous features are set out in the appended dependent claims. Throughout the description and claims of this specification, the words “comprise”, “include”, “have”, and “contain” and variations of these words, for example “comprising” and “comprises”, mean “including but not limited to”, and do not exclude other components, items, integers or steps not explicitly disclosed also to be present. Moreover, the singular encompasses the plural unless the context otherwise requires. In particular, where the indefinite article is used, the specification is to be understood as contemplating plurality as well as singularity, unless the context requires otherwise. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates a process flow of a method for enabling exposure-guided three-dimensional (3D) reconstruction, in accordanc