US-12621458-B2 - Systems and methods for cross-view motion vector prediction

US12621458B2US 12621458 B2US12621458 B2US 12621458B2US-12621458-B2

Abstract

The various implementations described herein include methods and systems for coding video. In one aspect, a method includes receiving a multi-view video bitstream that includes a first block in a first frame corresponding to a first view and a second block in a second frame corresponding to a second view. The method identifies a first set of reference frames in the first view for the first block. The method obtains motion vectors corresponding to a second set of reference frames in the second view for the second block. In accordance with a determination that the first set of reference frames share a display time with the second set of reference frames, the method derives a motion vector predictor (MVP) for the first block corresponding to the first view using the set of motion vectors corresponding to the second view, and decodes the first block using the derived MVP.

Inventors

Xin Zhao
Han Gao
Liang Zhao
Shan Liu

Assignees

Tencent America LLC

Dates

Publication Date: 20260505
Application Date: 20240507

Claims (18)

1 . A method of video decoding performed at a computing system having memory and one or more processors, the method comprising: receiving a multi-view video bitstream comprising a plurality of blocks, wherein the plurality of blocks includes a first block in a first frame corresponding to a first view and a second block in a second frame corresponding to a second view; identifying a first set of reference frames in the first view for the first block; obtaining a set of motion vectors corresponding to a second set of reference frames in the second view for the second block; when the first set of reference frames share a display time with the second set of reference frames, deriving a motion vector predictor (MVP) for the first block corresponding to the first view using the set of motion vectors corresponding to the second view; when the first set of reference frames do not share the display time with the second set of reference frames, deriving the MVP for the first block corresponding to the first view using scaled information from the set of motion vectors corresponding to the second view; and decoding the first block using the derived MVP.
2 . The method of claim 1 , wherein the second block is identified using a disparity vector, wherein the disparity vector is derived from a set of neighboring blocks that are coded using the first view as a reference frame.
3 . The method of claim 1 , wherein the set of motion vectors are scaled accordingly to a scaling factor that is proportional to a ratio of temporal distances.
4 . The method of claim 1 , further comprising constructing a MVP list corresponding to the first block, wherein the MVP list includes a temporal motion vector predictor (TMVP) candidate and/or a view-based motion vector predictor (VMVP) candidate.
5 . The method of claim 4 , wherein the MVP list includes the TMVP candidate at a first index and the VMVP candidate at a second index.
6 . The method of claim 4 , wherein the MVP list is restricted from having more than a predefined number of TMVP and/or VMVP candidates.
7 . The method of claim 4 , wherein the MVP list includes one or more motion vectors corresponding to the second view.
8 . The method of claim 1 , further comprising: constructing a first MVP list corresponding to multiple views; and constructing a second MVP list corresponding to a current frame wherein the MVP for the first block is derived using the first MVP list or the second MVP list in accordance with a signaled indicator in the multi-view video bitstream.
9 . The method of claim 1 , wherein the MVP for the first block is derived using motion vectors corresponding to a plurality of blocks in the second frame corresponding to the second view.
10 . The method of claim 9 , wherein coordinates for the plurality of blocks in the second frame are predefined or derived by the computing system.
11 . The method of claim 1 , further comprising obtaining an additional motion vector for the first block, the additional motion vector indicating a position displacement for the second set of reference frames.
12 . The method of claim 1 , wherein the MVP for the first block is derived based on a motion vector bank associated with the second block.
13 . The method of claim 1 , wherein the MVP for the first block corresponding to the first view is derived using the set of motion vectors corresponding to the second view in accordance with a first indicator in the multi-view video bitstream indicating that motion vectors from different views are to be used for the first block.
14 . A computing system, comprising: control circuitry; memory; and one or more sets of instructions stored in the memory and configured for execution by the control circuitry, the one or more sets of instructions comprising instructions for: receiving video data comprising a plurality of blocks, wherein the plurality of blocks includes a first block in a first frame corresponding to a first view and a second block in a second frame corresponding to a second view; identifying a first set of reference frames in the first view for the first block; obtaining a set of motion vectors corresponding to a second set of reference frames in the second view for the second block; when the first set of reference frames share a display time with the second set of reference frames, selecting a motion vector for the first block from the set of motion vectors corresponding to the second view; when the first set of reference frames do not share the display time with the second set of reference frames, deriving the motion vector for the first block using scaled information from the set of motion vectors corresponding to the second view; and encoding the first block using the selected motion vector.
15 . The computing system of claim 14 , further comprising identifying the second block using a disparity vector, wherein the set of motion vectors are obtained in accordance with identifying the second block.
16 . The computing system of claim 14 , wherein the set of motion vectors correspond to a plurality of blocks in the second view, the plurality of blocks including the second block.
17 . A method for generating a video bitstream, the method comprising obtaining a multi-view video bitstream, including: receiving video data comprising a plurality of blocks, wherein the plurality of blocks includes a first block in a first frame corresponding to a first view and a second block in a second frame corresponding to a second view; identifying a first set of reference frames in the first view for the first block; obtaining a set of motion vectors corresponding to a second set of reference frames in the second view for the second block; when the first set of reference frames share a display time with the second set of reference frames, selecting a motion vector for the first block from the set of motion vectors corresponding to the second view; when the first set of reference frames do not share the display time with the second set of reference frames, deriving the motion vector for the first block using scaled information from the set of motion vectors corresponding to the second view; encoding the first block using the selected motion vector; and transmitting the multi-view video bitstream.
18 . The method of claim 17 , wherein obtaining the multi-view video bitstream further comprises identifying the second block using a disparity vector, wherein the set of motion vectors are obtained in accordance with identifying the second block.

Description

RELATED APPLICATIONS This application claims priority to U.S. Provisional Patent Application No. 63/548,368, entitled “Cross-View Motion Vector Prediction,” filed Nov. 13, 2023, which is hereby incorporated by reference in its entirety. TECHNICAL FIELD The disclosed embodiments relate generally to video coding, including but not limited to systems and methods for motion vector prediction for multiview video (MVV) coding. BACKGROUND Digital video is supported by a variety of electronic devices, such as digital televisions, laptop or desktop computers, tablet computers, digital cameras, digital recording devices, digital media players, video gaming consoles, smart phones, video teleconferencing devices, video streaming devices, etc. The electronic devices transmit and receive or otherwise communicate digital video data across a communication network, and/or store the digital video data on a storage device. Due to a limited bandwidth capacity of the communication network and limited memory resources of the storage device, video coding may be used to compress the video data according to one or more video coding standards before it is communicated or stored. The video coding can be performed by hardware and/or software on an electronic/client device or a server providing a cloud service. Video coding generally utilizes prediction methods (e.g., inter-prediction, intra-prediction, or the like) that take advantage of redundancy inherent in the video data. Video coding aims to compress video data into a form that uses a lower bit rate, while avoiding or minimizing degradations to video quality. Multiple video codec standards have been developed. For example, High-Efficiency Video Coding (HEVC/H.265) is a video compression standard designed as part of the MPEG-H project. ITU-T and ISO/IEC published the HEVC/H.265 standard in 2013 (version 1), 2014 (version 2), 2015 (version 3), and 2016 (version 4). Versatile Video Coding (VVC/H.266) is a video compression standard intended as a successor to HEVC. ITU-T and ISO/IEC published the VVC/H.266 standard in 2020 (version 1) and 2022 (version 2). AOMedia Video 1 (AV1) is an open video coding format designed as an alternative to HEVC. On Jan. 8, 2019, a validated version 1.0.0 with Errata 1 of the specification was released. SUMMARY The present disclosure describes a set of methods for video (image) compression, specifically related to motion vector prediction when multiple views of a scene are being coded. In some embodiments, instead of coding each view and sending bitstreams from each view independently (simulcast coding), a disparity-compensated prediction approach is implemented whereby pictures of other views are included at the same time instance in the reference picture list. This approach, also known as disparity-compensated prediction, can improve coding efficiency by reducing statistical redundancy that exist between different views. In some instances, the approaches disclosed herein can achieve about 70% bitrate savings over simulcast coding. In accordance with some embodiments, a method of video decoding includes (i) receiving a multi-view video bitstream comprising a plurality of blocks, including a first block in a first frame corresponding to a first view and a second block in a second frame corresponding to a second view; (ii) identifying a first set of reference frames in the first view for the first block; (iii) obtaining a set of motion vectors corresponding to a second set of reference frames in the second view for the second block; (iv) in accordance with a determination that the first set of reference frames share a display time with the second set of reference frames, deriving a motion vector predictor (MVP) for the first block corresponding to the first view using the set of motion vectors corresponding to the second view; and (v) decoding the first block using the derived MVP. In accordance with some embodiments, a method of video encoding includes (i) receiving video data comprising a plurality of blocks, wherein the plurality of blocks includes a first block in a first frame corresponding to a first view and a second block in a second frame corresponding to a second view; (ii) identifying a first set of reference frames in the first view for the first block; (iii) obtaining a set of motion vectors corresponding to a second set of reference frames in the second view for the second block; (iv) in accordance with a determination that the first set of reference frames share a display time with the second set of reference frames, selecting a motion vector for the first block from the set of motion vectors corresponding to the second view; and (v) encoding the first block using the selected motion vector. In accordance with some embodiments, a method of bitstream conversion includes (i) obtaining a source video sequence corresponding to a set of views; and (ii) performing a conversion between the source video sequence and a multi-view video bitstream of