JP-7855795-B2 - Scalable 3D scene representation using neural field modeling

JP7855795B2JP 7855795 B2JP7855795 B2JP 7855795B2JP-7855795-B2

Inventors

スー，グアン－ミン
イン，ポン
チョードゥリー，アヌスタプクマールアタヌ
ルウ，タオラァン

Assignees

ドルビーラボラトリーズライセンシングコーポレイション

Dates

Publication Date: 20260508
Application Date: 20230905
Priority Date: 20220908

Claims (17)

An encoder provides a method for generating a scalable 3D scene representation, the method being: The step involves accessing a first set of images (102) in a first format for the scene; A step of generating a first 3D scene representation (107) of the scene based on the first set of images; The step of accessing a second set of images (104) in a second format for the aforementioned scene; A step of generating a second 3D scene representation (112) of the scene based on the second set of images, wherein the second 3D representation is better than the first 3D scene representation according to one or more quality criteria; The steps include: generating output image residuals (122) based on the first 3D scene representation and the second 3D scene representation using the original set of observation positions and the new set of observation positions; The process involves training a residual neural field network (125) using the output image residuals to generate a predicted residual image that approximates the output image residuals; The step of transmitting the first 3D scene representation (107) for the aforementioned scene as the base layer; A method comprising the step of transmitting information about the trained residual neural field network as an improvement layer.
The method according to claim 1, further comprising reformatting the output of the first 3D scene representation or the second 3D scene representation before generating the output image residual.
The method according to claim 2, wherein reformatting includes image upscaling, image downscaling, frame dropping, frame interpolation, or dynamic range/color gamut expansion.
The method according to claim 1, wherein one or more of the quality criteria include PSNR scalability, dynamic range scalability, color gamut scalability, spatial resolution scalability, and temporal frame rate scalability.
The method according to claim 1, wherein the first set of images is identical to the second set of images.
The method according to claim 1, wherein the first image set differs from the second image set in terms of dynamic range or bit depth, color gamut, spatial resolution, or frame rate.
The method according to claim 1, wherein the 3D scene representation may be one of the following: multiview plus depth (MVD) representation, multiplane imaging (MPI) representation, or neural radiation field (NeRF) neural network representation.
The first 3D scene representation includes a first NeRF model, the second 3D scene representation includes a second NeRF model, the second NeRF model renders images of better quality than the first NeRF model, and generates the output image residuals: Image residuals of the first image Calculate, Second image residual This includes calculating, t g represents the original camera pose, and t n represents the new camera pose. These represent images rendered based on the first NeRF model and the second NeRF model, respectively, for spatial position (x, y, z) and observation direction (θ, φ). represents an image in the aforementioned first image set, The method according to claim 5.
During training, the parameters of the residual neural field network are optimized. Generated by, Φ r * represents the optimal set of parameters for the residual neural field network, This represents the output of the trained residual neural field network in view t. D() represents the image residual at view t, and D() represents the loss function that should be minimized during training. The method according to claim 8.
A decoder, a method for generating an output 3D scene, wherein the method is: The process involves receiving a base layer bitstream (107) containing a first 3D scene representation (107) of the scene; The step involves receiving an improved layer bitstream (127) containing information for reconstructing the trained residual neural field network; Given the observer's position, Based on the first 3D scene representation, a first 3D output (132) of the scene is generated. Using the observer position and the trained residual neural field network, image residuals (145) are generated. The first 3D output of the scene and the image residual are combined to generate an improved 3D output of the scene. A method that includes stages.
The method according to claim 10, further comprising the step of reformatting the first 3D output or the image residual of the scene before combining them.
The method according to claim 11, wherein reformatting includes image upscaling, image downscaling, frame dropping, or frame interpolation.
Information about the aforementioned trained residual neural field network: A quality parameter (nnr_purpose_idc) that specifies one or more of the aforementioned quality criteria, Camera viewport information (viewport_camera_info_present_flag parameters), The first model parameter (nnr_bl_idc) for the first 3D representation model, The number of hidden layers in the residual neural field (which is related to the NN topology and can be carried in an NNR bitstream or external means based on nnr_mode_idc), Input position encoding method (nnr_position_encoding_freq[i]), Activation function (which is related to the NN topology and can be carried in the NNR bitstream or external means based on nnr_mode_idc) , Parameters related to residual rescaling (nnr_normalized_weight/ nnr_abs_normalized_offset and nnr_sign_normalized_offset), Descriptors for input coordinate parameters (nnr_input_dimension_minus3) and descriptors for output parameters (nnr_colour_primaries, nnr_output_pic_width_in_luma_samples, nnr_output_pic_height_in_luma_samples ) Including one or more of the following: The method according to claim 1.
The method according to claim 13, wherein the aforementioned information is transmitted as part of supplementary enhancement information messaging.
Training the residual neural field network (125) using the output image residuals is performed at a first spatial resolution, and further: This includes training the residual neural field network using the output image residual with a second spatial resolution lower than the first spatial resolution as input, and the predicted residual image with the first spatial resolution as output. The method according to claim 1.
A non-temporary computer-readable storage medium storing computer-executable instructions for performing the method described in any one of claims 1 to 15 by one or more processors.
A device having a processor configured to perform the method described in any one of claims 1 to 15.

Description

Cross-reference to related applications: This application claims the benefit of priority from U.S. Provisional Application No. 63/404,885, filed on 8 September 2022, which is incorporated herein by reference in its entirety. Technically, this specification relates to images. More specifically, embodiments of the present invention relate to scalable 3D scene representations using a dual-layer approach in which information in the upper layer is modeled using neural fields. In recent years, there has been growing interest in the efficient modeling and representation of 3D scenes. 3D scenes can be used in a variety of applications, including volumetric imaging, virtual reality, or augmented reality. While deep learning techniques have shown promising results in 3D scene representation and reconstruction, not all devices can handle the computational load associated with such approaches. As understood by the inventors, it is desirable to provide scalable 3D scene representations under diverse scalability criteria, and therefore, improved techniques for 3D scene representation are described herein. The term "metadata" as used herein relates to any auxiliary information transmitted as part of an encoded bitstream that assists the decoder in rendering the decoded image or 3D scene. Such metadata may include, but is not limited to, color space or color gamut information, reference display parameters, camera parameters, neural network parameters, and the like. The approaches described in this section are approaches that could have been pursued, but are not necessarily approaches that were previously conceived or pursued. Therefore, unless otherwise indicated, none of the approaches described in this section should be assumed to qualify as prior art simply by being included in this section. Similarly, any issues identified with respect to one or more approaches should not be assumed, unless otherwise indicated, to have been recognized in any prior art based on this section. Embodiments of the present invention are shown in the accompanying drawings as examples, not as limitations. Similar reference numerals refer to similar elements. This figure shows an example of an encoder for scalable 3D scene representation under a general scalability framework, according to one embodiment of the present invention. This figure shows an example of a decoder for scalable 3D scene representation under a general scalability framework, according to one embodiment of the present invention. This figure shows an example of an encoder for scalable 3D scene representation under PSNR criteria, according to one embodiment of the present invention. This figure shows an example of a decoder for scalable 3D scene representation under PSNR criteria, according to one embodiment of the present invention. This figure shows an example of an encoder for scalable 3D scene representation and multiplane image (MPI) representation under PSNR criteria, according to one embodiment of the present invention. An example of a decoder for scalable 3D scene representation and MPI representation under PSNR criteria, according to one embodiment of the present invention, is shown. Exemplary embodiments relating to scalable 3D scene representation are described herein. The following description includes numerous specific details to provide a full understanding of the various embodiments of the invention for illustrative purposes. However, it will be apparent that various embodiments of the invention can be carried out without these specific details. On the other hand, well-known structures and devices are not described in exhaustive detail to avoid unnecessarily obscuring, ambiguizing, or confusing the embodiments of the invention. Outline The exemplary embodiments described herein relate to scalable 3D scene representations. In one embodiment, in an encoder, to generate a scalable 3D scene representation, the processor: The step involves accessing a first set of images (102) in a first format for the scene; A step of generating a first 3D scene representation (107) of the scene based on a first set of images; The step involves accessing a second set of images (104) in a second format for the scene; A step of generating a second 3D scene representation (112) of the scene based on a second set of images, wherein the second 3D representation is better than the first 3D scene representation according to one or more quality criteria; The steps include: generating output image residuals (122) based on the first 3D scene representation and the second 3D scene representation using the original set of observation positions and the new set of observation positions; The process involves training a residual neural field network (125) using the output image residuals to generate a predicted residual image that approximates the output image residuals; The step of transmitting the first 3D scene representation (107) for the aforementioned scene as the base layer; The process invol