US-12627803-B2 - Method and apparatus for encoding and decoding one or more views of a scene

US12627803B2US 12627803 B2US12627803 B2US 12627803B2US-12627803-B2

Abstract

Methods are provided for encoding and decoding image or video data comprising two or more views (10) of a scene. The encoding method comprises obtaining ( 11 ), for each of the two or more views, a respective block segmentation mask ( 12 ) of the view and block image data ( 13 ) of the view. The method further comprises generating ( 14 ) at least one packed frame ( 40 ) containing the two or more block segmentation masks and the block image data of the two or more views; and encoding ( 15 ) the at least one packed frame into at least one bitstream ( 16 ). Each view is divided into blocks of pixels ( 30 ), and the block segmentation mask indicates which blocks of pixels belong to an area of interest ( 31 ) in the view. The block image data comprises the blocks of pixels that belong to the area of interest. Also provided are a corresponding encoder, decoder, and bitstream.

Inventors

Christiaan Varekamp

Assignees

KONINKLIJKE PHILIPS N.V.

Dates

Publication Date: 20260512
Application Date: 20210927
Priority Date: 20201002

Claims (20)

1 . A method comprising: obtaining a block segmentation mask of block image data of a view for each of a plurality of views; generating at least one packed frame, wherein the at least one packed frame comprises: the plurality of block segmentation masks; and the block image data of the plurality of views; and encoding the at least one packed frame into at least one bitstream, wherein each view is divided into blocks of pixels, wherein each of the block segmentation masks indicates which blocks of pixels belong to an area of interest, wherein the area of interest comprises a portion of the view, wherein the block image data comprises the blocks of pixels of the area of interest wherein the at least one packed frame comprises a first contiguous part and a second contiguous part, wherein the first contiguous part comprises the block segmentation masks of the plurality of views; and wherein the second contiguous part comprises the block image data of the plurality of views.
2 . The method of claim 1 , wherein the blocks of pixels of the block image data of a view are the same size.
3 . The method of claim 1 , wherein the block image data comprises different views, wherein the block image data is packed in the at least one packed frame in a block-interleaved arrangement, wherein a first block of pixels of a first view is followed consecutively by a second block of pixels of a second view.
4 . The method of claim 1 , wherein the block image data comprises different views, wherein the block image data is packed in the at least one packed frame in a row- interleaved arrangement, wherein the blocks of pixels of a first row of a first view are followed consecutively by the blocks of pixels of a second row of a second view.
5 . The method of claim 1 , wherein encoding the at least one packed frame into the at least one bitstream comprises using a video compression algorithm.
6 . The method of claim 5 , further comprising choosing a quality factor of the video compression algorithm, wherein at least one of the block segmentation masks is reconstructable from the at least one bitstream with an error rate that is dependent upon the quality factor.
7 . The method of claim 5 , further comprising choosing a number of quantization levels, wherein the quantization levels are used in the video compression algorithm, wherein the block segmentation masks are reconstructable from the at least one bitstream with an error rate that is dependent upon the quantization levels.
8 . The method of claim 1 , further comprising: quantizing each of the block segmentation masks to a first number of quantization levels; and quantizing the block image data to a second number of quantization levels, wherein the first number is different from the second number.
9 . The method of claim 1 , wherein the at least one packed frame comprises a depth part, wherein the depth part comprises depth data of the plurality of views.
10 . A non-transitory computer-readable medium comprising a computer program that, when executed on a processor, performs the method as claimed in claim 1 .
11 . The method of claim 5 , further comprising choosing a number of quantization levels, wherein the quantization levels are used in the video compression algorithm, wherein the block segmentation masks are reconstructable from the at least one bitstream with an error rate that is dependent upon the quantization levels.
12 . A method of decoding comprising: receiving at least one bitstream, wherein the at least one bitstream comprises at least one packed frame, wherein the at least one packed frame comprises: a first contiguous part comprising a block segmentation mask of each of a plurality of views; and a second contiguous part comprising block image data of each of the plurality of views, wherein each view is divided into blocks of pixels, wherein the block image data comprises the blocks of pixels of an area of interest, wherein each of the block segmentation masks indicates the locations of the blocks of pixels of an area of interest, wherein the area of interest comprises a portion of the view; decoding the at least one bitstream so as to obtain the at least one packed frame; and reconstructing at least one of the plurality of views by arranging the block image data according to the locations.
13 . An encoder comprising: an input circuit, wherein the input circuit is arranged to obtain a block segmentation mask for each of a plurality of views, wherein the input circuit is arranged to obtain block image data for each of a plurality of views, wherein each view is divided into blocks of pixels, wherein each of the block segmentation masks indicates which blocks of pixels belong to an area of interest, wherein the area of interest comprises a portion of the view, wherein the block image data comprises the blocks of pixels of the area of interest; a packing circuit, wherein the packing circuit is arranged to generate at least one packed frame, wherein the at least one packed frame comprises: a first contiguous part comprising the plurality of block segmentation masks; and a second contiguous part comprising the block image data of the plurality of views; and a video encoder circuit, wherein the video encoder circuit is arranged to encode the at least one packed frame into at least one bitstream.
14 . The method of claim 13 , wherein the blocks of pixels of the block image data of a view are the same size.
15 . A decoder comprising: an input circuit, wherein the input circuit is arranged to receive at least one bitstream, wherein the at least one bitstream comprises at least one packed frame comprising: a first contiguous part comprising a block segmentation mask of each of a plurality of views; and a second contiguous part comprising block image data of each of the plurality of views, wherein each view is divided into blocks of pixels, wherein the block image data comprises the blocks of pixels of an area of interest, wherein each of the block segmentation masks indicates the locations of the blocks of pixels of an area of interest, wherein the area of interest comprises a portion of the view; a video decoder circuit, wherein the video decoder circuit is arranged to decode the at least one bitstream so as to obtain the at least one packed frame; and a reconstruction circuit, wherein the reconstruction circuit is arranged to reconstruct at least one of the plurality of views by arranging the block image data according to the locations.
16 . The decoder of claim 15 , wherein the blocks of pixels of the block image data of a view are the same size.
17 . A method comprising: obtaining a plurality of block segmentation masks of block image data of a view for each of a plurality of views; generating at least one packed frame, wherein the at least one packed frame comprises: a first contiguous part comprising the plurality of block segmentation masks; and a second contiguous part comprising the block image data of the plurality of views; and encoding the at least one packed frame into at least one bitstream, wherein each view is divided into blocks of pixels, wherein the block image data comprises the blocks of pixels of each area of interest in each view, wherein each area of interest comprises a portion of the view, wherein each block segmentation mask is a map of each view, wherein each block of pixels in each view is represented by a single pixel in the block segmentation mask.
18 . The method of claim 17 , wherein the block image data comprises different views, wherein the block image data is packed in the at least one packed frame in at least one of: a block-interleaved arrangement and a row-interleaved arrangement, wherein one or more first blocks of pixels of a first view is followed consecutively by one or more second blocks of pixels of a second view.
19 . A non-transitory computer-readable medium comprising a computer program that, when executed on a processor, performs the method as claimed in claim 17 .
20 . A method of decoding comprising: receiving at least one bitstream, wherein the at least one bitstream comprises at least one packed frame, wherein the at least one packed frame comprises: a first contiguous part comprising a block segmentation mask of each of a plurality of views; and a second contiguous part comprising a block image data of each of the plurality of views, wherein each view is divided into blocks of pixels, wherein the block image data comprises the blocks of pixels of each area of interest in each view, wherein each area of interest comprises a portion of the view, wherein each block segmentation mask is a map of each view, wherein each block of pixels in each view is represented by a single pixel in the block segmentation mask; decoding the at least one bitstream so as to obtain the at least one packed frame; and reconstructing at least one of the plurality of views by arranging the block image data according to the locations.

Description

CROSS-REFERENCE TO PRIOR APPLICATIONS This application is the U.S. National Phase application under 35 U.S.C. § 371 of International Application No. PCT/EP2021/076447, filed on Sep. 27, 2021, which claims the benefit of EP Patent Application No. EP 20199751.7, filed on Oct. 2, 2020. These applications are hereby incorporated by reference herein. FIELD OF THE INVENTION The present invention relates to the coding of image or video-data for one or more views of a scene. It relates particularly to methods and apparatuses for encoding and decoding video sequences for virtual reality (VR) or immersive video applications captured from multiple viewpoints. BACKGROUND OF THE INVENTION Virtual reality can be a very immersive way to view images or video of a scene. When using virtual reality to view captured images or video of a scene, multiple cameras are usually required to capture many views of the scene from varying angles to allow the viewer to move around within the virtual reality scene. The more views that are captured from different angles, the more freedom the viewer can have to move within the virtual reality scene, and the more accurate rendered views of the scene can be. However, increasing the number of views that are captured increases the amount of data that must be processed and transmitted. For a limited bandwidth, this can reduce the image or video quality of the virtual reality scene experienced by the viewer, as the data must be more highly compressed. Multiple views of a scene are often encoded together with meta-data that indicates to the decoder how to recover the original views. Efficient encoding often requires computationally expensive determination steps and causes latency, as the transmission of data to the viewer is delayed. There may be a trade-off between efficiency (in terms of bitrate or pixel rate for a given bandwidth) and latency. For live-streamed video, latency is a particular concern, as the viewer wants to experience the virtual reality scene without delay, particularly in two-way streaming scenarios such as video conferencing. SUMMARY OF THE INVENTION It would be desirable to encode and decode one or more views of a scene efficiently—in terms of computational effort and data rate (bandwidth). The invention is defined by the claims. According to an aspect of the invention, there is provided a method of encoding image or video data, according to claim 1. For each view, the block segmentation mask indicates the locations of the blocks of pixels that belong to the area of interest. There may be more than one area of interest in any given view. Embodiments of the method can facilitate simple and low-latency encoding of multi-view video. The block segmentation masks can, in effect, provide implicit metadata that allows a decoder to reconstruct one or more of the views quickly and easily from the at least one packed frame. Meanwhile, pixel rate can be reduced, because only a part of each view (namely, the area of interest) is encoded/transmitted. In some embodiments, the at least one packed frame may be a single packed frame. The at least one bitstream may be a single bitstream. For each block of pixels in a view there may be a corresponding pixel in the block segmentation mask that indicates whether or not the block of pixels belongs to the area of interest. Thus, there may be as many pixels in the block segmentation mask as there are blocks of pixels in the respective view. In some embodiments, there may be more than one pixel in the block segmentation mask that corresponds with a block of pixels in a view. For example, a block of pixels in a view may have a corresponding block of pixels in the block segmentation mask that indicates whether or not the block of pixels in the view belongs to the area of interest. The block of pixels in the block segmentation mask may be smaller than the block of pixels in a view. In each block segmentation mask, each pixel may comprise a pixel value indicating whether or not the corresponding block of pixels is part of the area of interest. The pixel value may be a luminance value, or another pixel value, such as a chrominance, depth, or transparency value. A pixel value used to indicate blocks belonging to an area of interest may be separated from a pixel value used to indicate blocks not in the area of interest by unused levels. The unused levels can create robustness to small deviations in the pixel value that may be introduced by applying traditional lossy video compression techniques to the packed frame. Provided that the resulting ranges of pixel values remain distinct and separable, in spite of such deviations, it may be possible to reconstruct the block segmentation map without error at the decoder. In some embodiments, there may be more than one area of interest. The pixel values of the block segmentation mask may act as indices for the areas of interest. For example, a first area of interest may be labelled, in the block segmentation mask,