EP-3777222-B1 - METHOD, APPARATUS AND STREAM FOR VOLUMETRIC VIDEO FORMAT

EP3777222B1EP 3777222 B1EP3777222 B1EP 3777222B1EP-3777222-B1

Inventors

Fleureau, Julien
CHUPEAU, BERTRAND
TAPIE, THIERRY
THUDOR, FRANCK

Dates

Publication Date: 20260506
Application Date: 20190327

Claims (15)

A method of encoding data representative of a 3D scene (10), the method comprising: - encoding (201), into at least a first track, a first texture image (40) obtained by projecting points of the 3D scene visible from a first viewpoint (20), the first texture image being arranged in a plurality of first tiles (81 to 88), a part of the 3D scene being associated with each first tile; - for each first tile: • obtaining a group of patches, a patch being a 2D parametrization of a group of 3D points, consistent in depth, obtained by projecting points of the part of the 3D scene associated with the first tile on a picture visible from a second viewpoint located in a space of view (11) centred on the first viewpoint (20), the 2D parametrization encoding a distance between the second viewpoint and the projected points; • arranging patches of the group of patches in at least one second tile of a second image (130, 151, 152), the at least one second tile being associated with the first tile; wherein the total number of second tiles of the second image is greater than the total number of first tiles of the first frame; - encoding the second image into at least a second track; and - encoding, into at least a third track, at least an instruction to extract at least a part of the first image from the at least a first track and of the second image from the at least a second track.
A device (19) configured to encode data representative of a 3D scene (10), the device comprising a memory (194) associated with at least one processor (192) configured to: - encode, into at least a first track, a first texture image (40) obtained by projecting points of the 3D scene visible from a first viewpoint (20), the first texture image being arranged in a plurality of first tiles (81 to 88), a part of the 3D scene being associated with each first tile; - for each first tile: • obtain a group of patches, a patch being a 2D parametrization of a group of 3D points, consistent in depth, obtained by projecting points of the part of the 3D scene associated with the first tile on a picture visible from a second viewpoint located in a space of view (11) centred on the first viewpoint (20), the 2D parametrization encoding depth data representative of a distance between the second viewpoint and the projected points; • arrange patches of the group of patches in at least one second tile of a second image (130, 151, 152), the at least one second tile being associated with the first tile; wherein the total number of second tiles of the second image is greater than the total number of first tiles of the first frame; • encode the second image into at least a second track; and - encode, into at least a third track, at least an instruction to extract at least a part of the first image from the at least a first track and of the second image from at least a second track.
The method according to claim 1 or the device according to claim 2, wherein said each patch further comprises third data representative of texture information associated with parts of the 3D points of the group viewed from the second viewpoint, the third data being encoded into said at least a second track.
The method according to claim 1 or the device according to claim 2, the further comprising: - for each first tile: • obtaining a group of patches, a patch being obtained by projecting a part of points of the part of the 3D scene associated with the first tile and viewed from the second viewpoint on a picture encoding texture data of the projected points; • arranging patches of the group of patches in at least one third tile of a third image; and - encoding the third image in at least a fourth track.
The method according to one of claims 1, 3 and 4 or the device according to one of claims 2 to 4, wherein when a patch is bigger than a second tile into which the patch is to be arranged, then the patch is partitioned into a plurality of sub-patches smaller than the second tile.
The method according to one of claims 1, 3, 4 and 5 or the device according to one of claims 2 to 5, wherein patches are arranged according to a priority order depending on a visual importance of the patches, the visual importance of a patch depending on the points of the 3D scene projected on the patch.
The method according to one of claims 1 and 3 to 6 or the device according to one of claims 2 to 6, wherein the second tiles have a same size that is fixed for a plurality of temporally successive second images.
A method of decoding data representative of a 3D scene (10), the method comprising: - decoding, from at least a third track, at least an instruction to extract a first image from at least a first track and a second image from at least a second track; - decoding the first image from the at least a first track, the first image being obtained by projecting points of the 3D scene visible from a first viewpoint (20), the first image being arranged in a plurality of first tiles (81 to 88), a part of the 3D scene being associated with each first tile; - decoding the second image from the at least a second track, the second image being arranged in a plurality of second tiles, a second tile being associated with a first tile and comprising patches, a patch of a second tile being a 2D parametrization of a group of 3D points, consistent in depth, obtained by projecting points of the part of the 3D scene associated with the first tile and visible from a second viewpoint located in a space of view (11) centred on the first viewpoint (20), the 2D parametrization encoding a distance between the second viewpoint and the projected points.
A device (19) configured for decoding data representative of a 3D scene, the device comprising a memory (194) associated with at least one processor (192) configured to: - decode, from at least a third track, at least an instruction to extract a first image from at least a first track and a second image from at least a second track; - decode the first image from the at least a first track, the first data being obtained by projecting points of the 3D scene visible from a first viewpoint (20), the first image being arranged in a plurality of first tiles (81 to 88), a part of the 3D scene being associated with each first tile; - decode the second image from the at least a second track, the second image being arranged in a plurality of second tiles, a second tile being associated with a first tile and comprising patches, a patch of a second tile being a 2D parametrization of a group of 3D points, consistent in depth, obtained by projecting points of the part of the 3D scene associated with the first tile and visible from a second viewpoint located in a space of view (11) centred on the first viewpoint (20), the 2D parametrization encoding a distance between the second viewpoint and the projected points.
The method according to claim 8 or the device according to claim 9, wherein third data representative of texture information associated with parts of the 3D points of the group comprised in said each patch viewed from the second viewpoint is further decoded from the least a second track.
The method according to claim 8 or the device according to claim 9, comprising decoding a third image from at least a third track, the third image being arranged in a plurality of third tiles, a third tile being associated with a first tile and comprising patches, a patch of a third tile being obtained by projecting a part of points of the part of the 3D scene associated with the first tile when viewed from the second viewpoint on a picture encoding texture data of the projected points.
The method according to one of claims 8, 10 and 11 or the device according to one of claims 9 to 11, wherein at least a part of the 3D scene is rendered according to the first and second image.
A bitstream carrying data representative of a 3D scene, the data comprising, - in at least a first track, a first image being obtained by projecting points of the 3D scene visible from a first viewpoint, the first image being arranged in a plurality of first tiles, a part of the 3D scene being associated with each first tile; - in at least a second track, a second image arranged in a plurality of second tiles, a second tile being associated with a first tile and comprising patches, a patch of a second tile being a 2D parametrization of a group of 3D points, consistent in depth, obtained by projecting points of the part of the 3D scene associated with the first tile associated with the second tile visible from a second viewpoint located in a space of view (11) centred on the first viewpoint (20), the 2D parametrization encoding a distance between the second viewpoint and the projected points; and - in at least a third syntax element, at least an instruction to extract at least a part of the first image from the first track and of the second image from the second track.
A non-transitory processor readable medium having stored therein instructions for causing a processor to perform at least the steps of the method according to one of claims 1 and 3 to 7.
A non-transitory processor readable medium having stored therein instructions for causing a processor to perform at least the steps of the method according to claim 8 and 10 to 12.

Description

1. Technical field The present disclosure relates to the domain of volumetric video content. The present disclosure is also understood in the context of the encoding and/or the formatting of the data representative of the volumetric content, for example for the rendering on end-user devices such as mobile devices or Head-Mounted Displays. 2. Background This section is intended to introduce the reader to various aspects of art, which may be related to various aspects of the present disclosure that are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present invention. Accordingly, these statements are to be read in this light, and not as admissions of prior art. Recently there has been a growth of available large field-of-view content (up to 360°). Such content is potentially not fully visible by a user watching the content on immersive display devices such as Head Mounted Displays, smart glasses, PC screens, tablets, smartphones and the like. That means that at a given moment, a user may only be viewing a part of the content. However, a user can typically navigate within the content by various means such as head movement, mouse movement, touch screen, voice and the like. It is typically desirable to encode and decode this content. Immersive video, also called 360° flat video, allows the user to watch all around himself through rotations of his head around a still point of view. Rotations only allow a 3 Degrees of Freedom (3DoF) experience. Even if 3DoF video is sufficient for a first omnidirectional video experience, for example using a Head-Mounted Display device (HMD), 3DoF video may quickly become frustrating for the viewer who would expect more freedom, for example by experiencing parallax. In addition, 3DoF may also induce dizziness because of a user never only rotates his head but also translates his head in three directions, translations which are not reproduced in 3DoF video experiences. A large field-of-view content may be, among others, a three-dimension computer graphic imagery scene (3D CGI scene), a point cloud or an immersive video. Many terms might be used to design such immersive videos: Virtual Reality (VR), 360, panoramic, 4π steradians, immersive, omnidirectional or large field of view for example. Volumetric video (also known as 6 Degrees of Freedom (6DoF) video) is an alternative to 3DoF video. When watching a 6DoF video, in addition to rotations, the user can also translate his head, and even his body, within the watched content and experience parallax and even volumes. Such videos considerably increase the feeling of immersion and the perception of the scene depth and prevent from dizziness by providing consistent visual feedback during head translations. The content is created by the means of dedicated sensors allowing the simultaneous recording of color and depth of the scene of interest. The use of rig of color cameras combined with photogrammetry techniques is a common way to perform such a recording. While 3DoF videos comprise a sequence of images resulting from the un-mapping of texture images (e.g. spherical images encoded according to latitude/longitude projection mapping or equirectangular projection mapping), 6DoF video frames embed information from several points of views. They can be viewed as a temporal series of point clouds resulting from a three-dimension capture. Two kinds of volumetric videos may be considered depending on the viewing conditions. A first one (i.e. complete 6DoF) allows a complete free navigation within the video content whereas a second one (aka. 3DoF+) restricts the user viewing space to a limited volume, allowing limited translation of the head and parallax experience. This second context is a valuable trade-off between free navigation and passive viewing conditions of a seated audience member. 3DoF videos may be encoded in a stream as a sequence of rectangular color images generated according to a chosen projection mapping (e.g. cubical projection mapping, pyramidal projection mapping or equirectangular projection mapping). This encoding has the advantage to make use of standard image and video processing standards. 3DoF+ and 6DoF videos require additional data to encode the depth of colored points of point clouds. The kind of rendering (i.e. 3DoF or volumetric rendering) for a volumetric scene is not known a priori when encoding the scene in a stream. Up to date, streams are encoded for one kind of rendering or the other. There is a lack of a stream, and associated methods and devices, that can carry data representative of a volumetric scene that can be encoded at once and decoded either as a 3DoF video or as a volumetric video (3DoF+ or 6DoF). Moreover, the amount of data to be transported for e.g. the rendering on end-user devices may be very important, increasing significantly the needs in bandwidth ov