EP-4742676-A1 - ENCODING AND DECODING A REPRESENTATION OF A 3D SCENE

EP4742676A1EP 4742676 A1EP4742676 A1EP 4742676A1EP-4742676-A1

Abstract

The invention provides a bitstream, an encoding apparatus and method for encoding the bitstream, and a decoding apparatus and method for decoding the bitstream. The bitstream comprises a first plurality of encoded data structures and at least one further plurality of encoded data structures. Each data entry in each encoded data structure of the first plurality comprises parameters for a respective 3D Gaussian splat in a first set of 3D Gaussian splats representing a 3D scene at a first moment in time. Each data entry in each encoded data structure of each further plurality comprises parameters for a respective 3D Gaussian splat in a further set of 3D Gaussian splats representing the 3D scene at a further moment in time. Each data structure in each further plurality corresponds to a data structure in the first plurality. Corresponding data structures have the same number of data entries as one another.

Inventors

VAREKAMP, CHRISTIAAN
KROON, BART

Assignees

Koninklijke Philips N.V.

Dates

Publication Date: 20260513
Application Date: 20241111

Claims (15)

A method (800) for decoding a dynamic representation of a 3D scene, the method comprising: receiving a bitstream (915) comprising: a first plurality of encoded data structures (500) for a first frame (210, 610, 710) of a 3D scene including a first data structure and a second data structure, wherein each data entry in each encoded data structure in the first plurality comprises a set of parameters for a respective 3D Gaussian splat in a first set of 3D Gaussian splats (215) representing the 3D scene at a first moment in time; and a second plurality of encoded data structures for a second frame (220, 620, 720) of the 3D scene, wherein each data entry in each encoded data structure in the second plurality comprises a set of parameters for a respective 3D Gaussian splat in a second set of 3D Gaussian splats (225) representing the 3D scene at a second, different moment in time, and wherein each data structure in the second plurality corresponds to a respective data structure in the first plurality and has the same number of data entries as the corresponding data structure in the first plurality; determining the number of data structures per frame and the number of data entries in each data structure; decoding a first subset of data structures in the first plurality of encoded data structures; and decoding a second subset of data structures in the second plurality of encoded data structures, wherein, for each data structure in the second subset, the corresponding data structure is included in the first subset.
The method (800) of claim 1, wherein: the first data structure in the first subset is decoded by a first decoder; the second data structure in the first subset is decoded by a second, different decoder; and each data structure in the second subset is decoded by the decoder that decoded the corresponding data structure in the first subset.
The method (800) of claim 1 or 2, wherein: at least one data structure corresponds to a region of the 3D scene; the method further comprises defining a target viewport for each frame; and for any data structure that corresponds to a region of the 3D scene, the data structure is only included in the first or second subset if the data structure corresponds to a region of the 3D scene that is determined to be visible in the target viewport and/or is identified as being required for decoding a data structure corresponding to a region of the 3D scene that is visible in the target viewport.
The method (800) of any of claims 1 to 3, wherein the method comprises: defining a target viewport for each frame; processing at least some of the decoded data structures in the first subset to render a first image frame in the target viewport for the first frame; and processing at least some of the decoded data structures in the second subset to render a second image frame in the target viewport for the second frame.
The method (800) of claim 4, wherein: the first data structure in the first subset is rendered by a first renderer; the second data structure in the first subset is rendered by a second, different renderer; and each data structure in the second subset is rendered by the renderer that rendered the corresponding data structure in the first subset.
The method (800) of claim 4 or 5, wherein: at least one data structure corresponds to a region of the 3D scene; and for each image frame, a data structure that corresponds to a region of the 3D scene is processed to render the image frame only if the data structure is determined to correspond to a region of the 3D scene that is visible in the target viewport.
The method (800) of any of claims 1 to 6, wherein: at least one data structure in the first plurality contains a reference data entry and at least one further data entry, wherein the set of parameters for each further data entry is a set of first residual parameters; and the step of decoding the first subset of data structures in the first plurality of encoded data structures comprises, for each data structure containing a reference data entry and at least one further data entry: identifying the reference data entry; and for each further data entry, adding the set of first residual parameters to the set of parameters in the reference data entry.
The method (800) of any of claims 1 to 7, wherein: at least one set of parameters in at least one data structure in the second plurality is a set of second residual parameters; and the step of decoding the second subset of data structures in the second plurality of encoded data structures comprises, for each data entry having a set of second residual parameters, adding the set of second residual parameters to the set of parameters for the corresponding data entry in the corresponding data structure in the first plurality.
The method (800) of claim 8, wherein: the bitstream (915) further comprises, for at least one data structure in the second plurality, at least one shared residual parameter; and the step of decoding the second subset of data structures in the second plurality of encoded data structures comprises, for each data structure having at least one shared residual parameter, transforming the set of parameters for each data entry in the corresponding data structure in the first plurality based on the at least one shared residual parameter before adding the set of second residual parameters.
A method (100) for encoding a dynamic representation of a 3D scene, the method comprising: obtaining a sequence of frames (200) of a 3D scene, the sequence comprising: a first frame (210, 610, 710) including a first set of 3D Gaussian splats (215) representing the 3D scene at a first moment in time; and a second frame (220, 620, 720) including a second set of 3D Gaussian (225) splats representing the 3D scene at a second, different moment in time; for the first frame: defining a first plurality of data structures (500); assigning each 3D Gaussian splat in the first set to one of the data structures in the first plurality; for each 3D Gaussian splat in the first set, adding a set of parameters for the 3D Gaussian splat to the data structure to which the 3D Gaussian splat is assigned; and encoding, into a bitstream (915), each data structure in the first plurality; and for the second frame: defining a second plurality of data structures, wherein each data structure in the second plurality corresponds to a respective data structure in the first plurality, and wherein each data structure in the second plurality has the same number of data entries as the corresponding data structure in the first plurality; assigning each 3D Gaussian splat in the second set to one of the data structures in the second plurality; for each 3D Gaussian splat in the second set, adding a set of parameters for the 3D Gaussian splat to the data structure to which the 3D Gaussian splat is assigned; and encoding, into the bitstream, each data structure in the second plurality.
The method (100) of claim 10, wherein any 3D Gaussian splat in the second set (225) that corresponds to a 3D Gaussian splat in the first set (215) is assigned to a corresponding data entry in the data structure in the second plurality that corresponds to the data structure in the first plurality to which the corresponding 3D Gaussian splat in the first set was assigned.
The method (100) of claim 10 or 11, wherein at least one data structure corresponds to a region of the 3D scene.
The method (100) of claim 12, wherein at least one data structure corresponds to a region defining an object in the 3D scene.
The method (100) of any of claims 10 to 13, wherein: the first frame (210, 610, 710) and the second frame (220, 620, 720) each include at least one further 3D Gaussian splat; and the method further comprises: identifying which of the 3D Gaussian splats included in the first frame form the first set of 3D Gaussian splats (215); and identifying which of the 3D Gaussian splats included in the second frame form the second set of 3D Gaussian splats (225).
A bitstream (915) comprising: a first plurality of encoded data structures (500) for a first frame (210, 610, 710) of a 3D scene, wherein each data entry in each encoded data structure in the first plurality comprises a set of parameters for a respective 3D Gaussian splat in a first set of 3D Gaussian splats (215) representing the 3D scene at a first moment in time; and a second plurality of encoded data structures for a second frame (220, 620, 720) of the 3D scene, wherein each data entry in each encoded data structure in the second plurality comprises a set of parameters for a respective 3D Gaussian splat in a second set of 3D Gaussian splats (225) representing the 3D scene at a second, different moment in time, and wherein each data structure in second plurality corresponds to a respective data structure in the first plurality and has the same number of data entries as the corresponding data structure in the first plurality.

Description

FIELD OF THE INVENTION The invention relates to the field of representations of 3D scenes, and, in particular, to encoding and decoding 3D Gaussian splat data. BACKGROUND OF THE INVENTION Six degree of freedom (6 DoF) immersive video allows a scene to be viewed from different positions and orientations. The creation of 6 DoF immersive video uses multiple cameras to capture images of a scene from different viewpoints. The captured images are then processed to generate an image at a new viewpoint. Various approaches have been used to generate an image at a new viewpoint. For instance, images of a scene from different viewpoints may be processed to estimate a depth map for each image, and the depth maps may be used to project each point in the 3D scene to the imaging plane of a virtual camera at the new viewpoint. This method can lead to low quality images if there are inaccuracies in the depth estimation process (caused, for example, by light reflection or a lack of local variation in color). Another technique involves the use of a neural radiance field (NeRF) algorithm to model the scene. This approach provides high quality images of scenes from new viewpoints, but is much less efficient than traditional depth estimation methods. A NeRF algorithm is trained for a specific scene using images of the scene from different viewpoints as ground truth; this process can take up to several hours, while rendering a new image using the trained NeRF algorithm typically takes up to several seconds per frame. 3D Gaussian Splatting is a new technique for representing a 3D scene, first described in B. Kerbl et al. (2023), "3D Gaussian Splatting for Real-Time Radiance Field Rendering", ACM Transactions on Graphics, 42(4):139(1-14). In this technique, the 3D scene is represented by a set of 3D Gaussian splats, the parameters of which are optimized to generate a 3D representation of the scene. This provides high quality images that can be rendered much faster than NeRF-based methods. Various methods have been proposed for encoding 3D Gaussian Splatting data for a static scene; however, these methods are time-consuming, and are not therefore well-suited to encoding 3D Gaussian Splatting data for a dynamic scene. SUMMARY OF THE INVENTION The invention is defined by the claims. According to examples in accordance with an aspect of the invention, there is provided a method for decoding a dynamic representation of a 3D scene, the method comprising: receiving a bitstream comprising: a first plurality of encoded data structures for a first frame of a 3D scene including a first data structure and a second data structure, wherein each data entry in each encoded data structure in the first plurality comprises a set of parameters for a respective 3D Gaussian splat in a first set of 3D Gaussian splats representing the 3D scene at a first moment in time; and a second plurality of encoded data structures for a second frame of the 3D scene, wherein each data entry in each encoded data structure in the second plurality comprises a set of parameters for a respective 3D Gaussian splat in a second set of 3D Gaussian splats representing the 3D scene at a second, different moment in time, and wherein each data structure in the second plurality corresponds to a respective data structure in the first plurality and has the same number of data entries as the corresponding data structure in the first plurality; determining the number of data structures per frame and the number of data entries in each data structure; decoding a first subset of data structures in the first plurality of encoded data structures; and decoding a second subset of data structures in the second plurality of encoded data structures, wherein, for each data structure in the second subset, the corresponding data structure is included in the first subset. Having the same number of data structures per frame and a fixed number of data entries in corresponding data entries across the frames, and determining the number of data structures per frame and the number of data entries in each data structure prior to decoding the data structures ensures that only decoders capable of decoding the bitstream are used, and enables the memory for the parameters of the 3D Gaussian splats to be pre-allocated on a CPU and GPU of a decoding apparatus. The ability to pre-allocate memory for parameters of 3D Gaussian splats allows performance tests to be carried out, enabling a balance to be struck between power consumption and image quality. In this way, the decoding apparatus is able to use a "once initialized" data structure, rather than a dynamic one, thus improving the performance of the decoding apparatus. The number of data structures per frame may be defined in the bitstream or by external means. Similarly, the number of data entries in each data structure may be defined in the bitstream or by external means. In some examples, the first data structure in the first subset is decoded by a first deco