US-20260129233-A1 - MOTION-COMPENSATED COMPRESSION OF DYNAMIC VOXELIZED POINT CLOUDS

US20260129233A1US 20260129233 A1US20260129233 A1US 20260129233A1US-20260129233-A1

Abstract

Disclosed herein are exemplary embodiments of innovations in the area of point cloud encoding and decoding. Example embodiments can reduce the computational complexity and/or computational resource usage during 3D video encoding by selectively encoding one or more 3D-point-cloud blocks using an inter-frame coding (e.g., motion compensation) technique that allows for previously encoded/decoded frames to be used in predicting current frames being encoded. Alternatively, one or more 3D-point-cloud block can be encoded using an intra-frame encoding approach. The selection of which encoding mode to use can be based, for example, on a threshold that is evaluated relative to rate-distortion performance for both intra-frame and inter-frame encoding. Still further, embodiments of the disclosed technology can use one or more voxel-distortion-correction filters to correct distortion errors that may occur during voxel compression. Such filters are uniquely adapted for the particular challenges presented when compressing 3D image data. Corresponding decoding techniques are also disclosed.

Inventors

Philip A. Chou
RICARDO DE QUEIROZ

Assignees

MICROSOFT TECHNOLOGY LICENSING, LLC

Dates

Publication Date: 20260507
Application Date: 20251229

Claims (20)

1 . One or more computer-readable media having programmed thereon encoded data in a bitstream for at least part of a sequence of three-dimensional (“3D”) video frames, the encoded data for the at least part of the sequence including encoded data for one or more occupied 3D-point-cloud blocks in a current 3D video frame among the 3D video frames of the sequence, wherein each of the one or more occupied 3D-point-cloud blocks includes multiple voxels of the current 3D video frame, at least one of the multiple voxels in each of the one or more occupied 3D-point-cloud blocks being an occupied voxel, wherein a given occupied 3D-point-cloud block, among the one or more occupied 3D-point-cloud blocks, is an x×y×z region of voxels, each of x, y, and z being an integer value greater than or equal to 2, the encoded data for the one or more occupied 3D-point-cloud blocks including one or more syntax elements signaling a mode for the one or more occupied 3D-point-cloud blocks, wherein the signaled mode is one of multiple available modes, the multiple available modes including an intra-frame mode and an inter-frame mode, the encoded data for the at least part of the sequence being usable to cause a video decoder, when processing the encoded data for the at least part of the sequence in a computer system having one or more processing units, to perform operations that include: based on the one or more syntax elements signaling the mode, determining that the signaled mode for the given occupied 3D-point-cloud block is the inter-frame mode, the given occupied 3D-point-cloud block being decoded using inter-frame prediction according to the inter-frame mode as the signaled mode; based at least in part on the signaled mode for the given occupied 3D-point-cloud block being the inter-frame mode, determining difference information for the given occupied 3D-point-cloud block, the difference information indicating, in 3D space, differences of the given occupied 3D-point-cloud block with respect to a predicted 3D-point-cloud block determined from a reference 3D video frame stored in a reference frame buffer; and based at least in part on the signaled mode for the given occupied 3D-point-cloud block being the inter-frame mode, applying the difference information for the given occupied 3D-point-cloud block to the predicted 3D-point-cloud block determined from the reference 3D video frame stored in the reference frame buffer.
2 . The one or more computer-readable media of claim 1 , wherein the signaled mode is: signaled for the current 3D video frame; or signaled on a block-by-block basis for the one or more occupied 3D-point-cloud blocks, respectively.
3 . The one or more computer-readable media of claim 1 , wherein the difference information is prediction residuals for the given occupied 3D-point-cloud block.
4 . A method comprising: storing, in a reference frame buffer, a reconstructed version of a previous three-dimensional (“3D”) video frame of a sequence of 3D video frames, for use as a reference 3D video frame in inter-frame prediction; encoding a current 3D video frame among the 3D video frames of the sequence, including: determining one or more occupied 3D-point-cloud blocks in the current 3D video frame, wherein each of the one or more occupied 3D-point-cloud blocks includes multiple voxels of the current 3D video frame, at least one of the multiple voxels in each of the one or more occupied 3D-point-cloud blocks being an occupied voxel, and wherein a given occupied 3D-point-cloud block, among the one or more occupied 3D-point-cloud blocks, is an x×y×z region of voxels, each of x, y, and z being an integer value greater than or equal to 2; and encoding each of the one or more occupied 3D-point-cloud blocks using one of multiple available encoding modes, the multiple available encoding modes including an intra-frame mode and an inter-frame mode, the encoding the each of the one or more occupied 3D point-cloud blocks including, for the given occupied 3D-point-cloud block, the given occupied 3D-point-cloud block being encoded using inter-frame prediction according to the inter-frame mode: determining difference information for the given occupied 3D-point-cloud block, the difference information indicating, in 3D space, differences of the given occupied 3D-point-cloud block with respect to a predicted 3D-point-cloud block determined from the reference 3D video frame stored in the reference frame buffer; and applying the difference information for the given occupied 3D-point-cloud block to the predicted 3D-point-cloud block determined from the reference 3D video frame stored in the reference frame buffer; and outputting, as part of a bitstream, encoded data for the current 3D video frame, the encoded data for the current 3D video frame including encoded data for the one or more occupied 3D-point-cloud blocks.
5 . The method of claim 4 , wherein the encoding the current 3D video frame further comprises applying one or more voxel-distortion-correction filters to at least part of the current 3D video frame in an inter-frame prediction loop.
6 . The method of claim 5 , wherein the one or more voxel-distortion-correction filters comprise: a filter implementing a morphological process; or a filter implementing an adaptive smoothing process.
7 . The method of claim 4 , wherein the encoded data for the one or more occupied 3D-point-cloud blocks includes a syntax element signaling a selected encoding mode for the one or more occupied 3D-point-cloud blocks.
8 . The method of claim 7 , wherein the selected encoding mode is: signaled for the current 3D video frame; or signaled on a block-by-block basis for the one or more occupied 3D-point-cloud blocks, respectively.
9 . The method of claim 4 , wherein the difference information is prediction residuals for the one or more occupied 3D-point-cloud blocks.
10 . The method of claim 9 , wherein the prediction residuals indicate motion for the one or more occupied 3D-point-cloud blocks.
11 . The method of claim 4 , wherein the encoding the given occupied 3D-point-cloud block further comprises: using motion information for the given occupied 3D-point-cloud block to identify the predicted 3D-point-cloud block in the reference 3D video frame stored in the reference frame buffer; and encoding the motion information for the given occupied 3D-point-cloud block.
12 . The method of claim 4 , further comprising selecting the one of the multiple available encoding modes based at least in part on a correspondence-based distortion metric, a projection-based distortion metric, or a combination of both the correspondence-based distortion metric and the projection-based distortion metric.
13 . A computer system comprising one or more processing units and memory, wherein the computer system implements a decoder system comprising: an input buffer configured to store encoded data, from a bitstream, for at least part of a sequence of three-dimensional (“3D”) video frames, the encoded data for the at least part of the sequence including encoded data for one or more occupied 3D-point-cloud blocks in a current 3D video frame among the 3D video frames of the sequence, wherein each of the one or more occupied 3D-point-cloud blocks includes multiple voxels of the current 3D video frame, at least one of the multiple voxels in each of the one or more occupied 3D-point-cloud blocks being an occupied voxel, and wherein a given occupied 3D-point-cloud block, among the one or more occupied 3D-point-cloud blocks, is an x×y×z region of voxels, each of x, y, and z being an integer value greater than or equal to 2, the encoded data for the one or more occupied 3D-point-cloud blocks including one or more syntax elements signaling a mode for the one or more occupied 3D-point-cloud blocks; a reference frame buffer configured to store reconstructed voxelized point cloud data; and a video decoder configured to decode the 3D video frames of the sequence by performing operations, the operations comprising: storing, in the reference frame buffer, a reconstructed version of a previous 3D video frame among the 3D video frames of the sequence, for use as a reference 3D video frame in motion-compensated prediction; and decoding the current 3D video frame, including reconstructing the one or more occupied 3D-point-cloud blocks in accordance with the signaled mode, wherein the signaled mode is one of multiple available modes, the multiple available modes including an intra-frame mode and an inter-frame mode, and wherein the reconstructing the one or more occupied 3D-point-cloud-blocks includes: based on the one or more syntax elements signaling the mode, determining that the signaled mode for the given occupied 3D-point-cloud block is the inter-frame mode, the given occupied 3D-point-cloud block being decoded using inter-frame prediction according to the inter-frame mode as the signaled mode; based at least in part on the signaled mode for the given occupied 3D-point-cloud block being the inter-frame mode, determining difference information for the given occupied 3D-point-cloud block, the difference information indicating, in 3D space, differences of the given occupied 3D-point-cloud block with respect to a predicted 3D-point-cloud block determined from the reference 3D video frame stored in the reference frame buffer; and based at least in part on the signaled mode for the given occupied 3D-point-cloud block being the inter-frame mode, applying the difference information for the given occupied 3D-point-cloud block to the predicted 3D-point-cloud block determined from the reference 3D video frame stored in the reference frame buffer.
14 . The computer system of claim 13 , wherein the signaled mode is signaled for the current 3D video frame.
15 . The computer system of claim 13 , wherein the signaled mode is signaled on a block-by-block basis for the one or more occupied 3D-point-cloud blocks, respectively.
16 . The computer system of claim 13 , wherein the difference information is prediction residuals for the one or more occupied 3D-point-cloud blocks.
17 . The computer system of claim 16 , wherein the prediction residuals indicate motion for the one or more occupied 3D-point-cloud blocks.
18 . The computer system of claim 13 , wherein the decoding the current 3D video frame further comprises, for the given occupied 3D-point-cloud block: decoding motion information for the given occupied 3D-point-cloud block; and using the motion information for the given occupied 3D-point-cloud block to identify the predicted 3D-point-cloud block in the reference 3D video frame stored in the reference frame buffer.
19 . The computer system of claim 13 , wherein the decoding the current 3D video frame further comprises applying one or more voxel-distortion-correction filters to at least part of the current 3D video frame in an inter-frame prediction loop.
20 . The computer system of claim 19 , wherein the one or more voxel-distortion-correction filters comprise: a filter implementing a morphological process; or a filter implementing an adaptive smoothing process.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application is a continuation of U.S. patent application Ser. No. 17/682,018, filed Feb. 28, 2022, which is a continuation of U.S. patent application Ser. No. 15/168,019, filed May 28, 2016, now U.S. Pat. No. 11,297,346, the disclosure of which is hereby incorporated by reference. FIELD The disclosed technology concerns compression schemes for voxelized point clouds as may be used in 3D communication systems, such as augmented-reality or virtual-reality systems. BACKGROUND With the emergence of inexpensive consumer electronic systems for both 3D capture and 3D rendering, visual communication is on the threshold of advancing beyond traditional 2D video to immersive 3D communication systems. Dynamic 3D scene capture can be implemented using color plus depth (RGBD) cameras, while 3D visualization can be implemented using stereoscopic monitors or near-eye displays to render the subject within a virtual or augmented reality. The processing for capture and display can be done in real time using powerful graphics processing units (GPUs). However, representing a complex, dynamic 3D scene generates a large amount of data. Compression is therefore a highly desirable part of enabling these emerging immersive 3D systems for communication. Further, despite improvements in computer hardware, compression of 3D video is extremely time-consuming and resource-intensive in many encoding scenarios. Accordingly, improved compression methods that reduce computational complexity (including computational speed and resource usage) while still maintaining acceptable visual quality are highly desirable. SUMMARY In summary, the detailed description presents innovations for compressing 3D video data. The innovations described herein can help reduce the bit rate and/or distortion of 3D video encoding by selectively encoding one or more 3D-point-cloud blocks using an inter-frame coding (e.g., motion compensation) technique that allows for previously encoded/decoded frames to be used in predicting current frames being encoded. This reduction in the bit rate required for compression allows an encoder/decoder to more quickly perform compression/decompression of a point cloud frame and also reduces computational resource usage, both of which can be useful in real-time encoding/decoding scenarios. Alternatively, one or more 3D-point-cloud blocks can be encoded using an intra-frame encoding approach. The selection of which encoding mode to use can be based, for example, on a threshold that is evaluated relative to rate-distortion performance for both intra-frame and inter-frame encoding. Still further, embodiments of the disclosed technology can use one or more voxel-distortion-correction filters to correct distortion errors that may occur during voxel compression. Such filters are uniquely adapted for the particular challenges presented when compressing 3D image data. Corresponding decoding techniques are also disclosed herein. Dynamic point clouds present a new frontier in visual communication systems. Although some advances have been made with respect to compression schemes for point clouds, few (if any) advances have been made with respect to using temporal redundancies as part of an effective point cloud compression scheme. Embodiments of the disclosed technology enable the encoding of dynamic voxelized point clouds at low bit rates. In embodiments of the disclosed technology, an encoder breaks the voxelized point cloud at each frame into 3D blocks (cubes) of voxels (also referred to as “3D-point-cloud blocks”). Each 3D-point-cloud block is either encoded in intra-frame mode or is replaced by a motion-compensated version of a 3D-point-cloud block in the previous frame. The decision can be based (at least in part) on a rate-distortion metric. In this way, both the geometry and the color can be encoded with distortion, allowing for reduced bit-rates. In certain embodiments, in-loop filtering is also employed to reduce (e.g., minimize) compression artifacts caused by distortion in the geometry information. Simulations reveal that embodiments of the disclosed motion compensated coder can efficiently extend the compression range of dynamic voxelized point clouds to rates below what intra-frame coding alone can accommodate, trading rate for geometry accuracy. The innovations can be implemented as part of a method, as part of a computing device adapted to perform the method or as part of a tangible computer-readable media storing computer-executable instructions for causing a computing device to perform the method. The various innovations can be used in combination or separately. The foregoing and other objects, features, and advantages of the invention will become more apparent from the following detailed description, which proceeds with reference to the accompanying figures. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows six example viewpoints of a voxelized point cloud for an imaged subject. FIG. 2