KR-20260065906-A - Geometric motion compensation prediction

KR20260065906AKR 20260065906 AKR20260065906 AKR 20260065906AKR-20260065906-A

Abstract

A video decoder for decoding video of a scene from a data stream using motion compensation prediction is configured to, for a current block of a current image, find corresponding positions in a reference image that correspond to the pixels of the current block using video geometry-related parameters describing how the scene is projected onto the images of the video, derive the interior of a prediction block by sampling the reference image at those positions, and reconstruct the current block using the interior of the prediction block.

Inventors

힌츠, 토비아스
헬레, 필리프
메르클레, 필리프
빙켄, 마르틴
슈바르츠, 하이코
파프, 요나탄
마르페, 데틀레프
비간트, 토마스

Assignees

프라운호퍼 게젤샤프트 쭈르 푀르데룽 데어 안겐반텐 포르슝 에. 베.

Dates

Publication Date: 20260511
Application Date: 20240903
Priority Date: 20230906

Claims (20)

As a video decoder (10) for decoding a video (12) of a scene (14) from a data stream (16) using motion compensation prediction, For the current block (18) of the current image (20a): Using video geometric structure related parameters (24) that describe how the above scene (14) is projected (26) onto the images (20) of the video (12), the corresponding positions (22) corresponding to the pixels (23) of the current block (18) in the reference image (20b) are found, and Sampling the reference image (20b) at the corresponding positions (22) Deriving the predicted block inner as a result; and A video decoder configured to reconstruct the current block (18) using the interior of the prediction block.
In claim 1, the video geometric structure related parameters describe a video decoder that projects the scene (14) onto the images (20) of a camera (30) and describes the scene-image projection (26) of the video.
A video decoder according to claim 1 or 2, wherein the video geometric structure related parameters describe a scene-image projection of a camera that projects the scene onto the images of the video for each image of the video.
A video decoder according to any one of claims 1 to 3, wherein the video geometric structure related parameters describe a scene-image projection of a camera that projects the scene onto the images of the video, both image by image and image by image.
In any one of claims 1 to 4, the video geometric structure related parameters are of the one camera One or more extrinsic camera parameters and One or more intrinsic camera parameters A video decoder describing the scene-image projection by one or more of the above.
A video decoder according to claim 5, wherein the one or more extrinsic camera parameters define the position (p → i) of the one camera and/or the orientation (v → i) of the one camera.
A video decoder according to claim 5 or 6, wherein the one or more intrinsic camera parameters define the focal length and/or FOV angle of the one camera.
A video decoder according to any one of claims 1 to 7, wherein the video geometric structure related parameters describe a homomorphic mapping (42) between corresponding positions (401, 402) within pairs of images (201, 202) of the video or a homomorphic mapping between corresponding motion vectors associated with pairs of images of the video.
In claim 8, the video decoder, wherein the video geometric structure related parameters describe the homomorphic mapping by vectors (46 2 ) or tensors at predetermined control points (44 2 ) of the images of the video.
In claim 9, the video decoder, wherein the predetermined control points are corners of the images of the video.
A video decoder according to any one of claims 8 to 10, wherein the video geometric structure related parameters include only one vector or tensor for each of the four corners of the images of the video.
A video decoder according to any one of claims 1 to 11, wherein the video geometric structure related parameters describe a homomorphic mapping between corresponding positions within pairs of images of the video, and the video decoder is configured to derive a scene-image projection for a predetermined image, based on the homomorphic mapping between corresponding positions within pairs of images including the predetermined image.
A video decoder configured to decode video geometric structure-related parameters from a data stream, in any one of claims 1 to 12.
A video decoder configured to determine parameters related to the video geometric structure based on an already decoded portion of the video, in any one of claims 1 to 13.
A video decoder configured to derive a scene-image projection for a predetermined image, wherein, in any one of claims 1 to 14, the video geometric structure related parameters describe a homomorphic mapping between corresponding positions within pairs of images of the video, and the video decoder derives the scene-image projection for a predetermined image based on a sequence of homomorphic mappings between corresponding positions within one or more pairs of images including a base image and the predetermined image, and based on extrinsic and intrinsic parameters for the base image.
In any one of claims 1 to 15, decoding syntax elements from the data stream, and If the above syntax element has a first state, For the current block, using video geometric structure-related parameters describing how the scene is projected onto the images of the video, corresponding positions corresponding to the pixels of the current block in the reference image are found, and the interior of the prediction block is derived by sampling the reference image at the corresponding positions, and A video decoder configured to reconstruct the current block using the interior of the prediction block.
In claim 16, if the syntax element has a second state, Reconstructing the current block independently of the parameters related to the video geometric structure, or A video decoder configured to reconstruct the current block by copying an already decoded video portion at a regular pixel pitch.
In any one of paragraphs 1 through 17, Derive a list of MVP candidates for the above current block — one of the above MVP candidates corresponds to a specific no-motion vector intercoding mode —, Select the MVP from the list of MVP candidates above, and If the above-mentioned selected MVP corresponds to the above-mentioned specific no-motion vector intercoding mode, For the current block, using video geometric structure-related parameters describing how the scene is projected onto the images of the video, corresponding positions corresponding to the pixels of the current block in the reference image are found, and the interior of the prediction block is derived by sampling the reference image at the corresponding positions, and Reconstruct the current block using the interior of the prediction block above, and If the above-mentioned selected MVP corresponds to an MVP candidate other than the above-mentioned specific no-motion vector intercoding mode, Reconstruct the current block using motion vector reward prediction with the selected MVP. A video decoder configured to reconstruct the current block using the selected MVP by doing so.
In any one of paragraphs 1 through 18, A scene model (300) is determined based on the motion vectors (126) of the images and the video geometric structure related parameters (24), and A video decoder configured to find the corresponding positions (22) in the reference image using the above scene model (300) and the above video geometric structure related parameters (24).
In any one of paragraphs 1 through 19, For each of one or more control points (132) of the inter-prediction block (128) of the predetermined image (20b), For each individual control point, a motion vector (130) coded as a data stream for the inter-prediction block determines the source image position (134) within the corresponding reference image (20d) pointed to from the individual control point, and Using the above video geometric structure related parameters (24), A first scene projection line is determined in which the above scene is projected onto the individual control points (132), and Determining a second scene projection line in which the above scene is projected onto the source image position (134), and Determine a scene point on the shortest line connecting any point on the first scene projection line and any point on the second scene projection line, and To make the above scene a point, or By determining the distance of the scene point from the above-determined image and determining that the scene model point is at the said distance on the first scene projection line Determine the above scene model points. Thus, scene model points (302) for forming the basis of the above scene model are determined. A video decoder configured to determine the above scene model (300).

Description

Geometric motion compensation prediction Embodiments according to the present invention relate to devices, namely a video encoder and a video decoder, and methods for encoding or decoding a video of a scene using motion compensation prediction. Hybrid video codecs partition the input signal frame by frame into square blocks called Coded Tree Units (CTUs). CTUs can be subdivided into smaller Coding Units (CUs). The reconstructed samples of the CUs are constructed by superimposing the residual signal transmitted as a bitstream with the predicted samples, and then pass through multiple post-processing filters to improve the quality of the reconstructed samples by removing coding artifacts. Each image is assigned a picture order count (POC) that increases according to the display order. In the prediction of CU, two basic modes are distinguished: an intra mode that predicts samples from already reconstructed regions within the current image, typically from adjacent neighbors; and an inter mode that uses sample information from previously reconstructed images for temporal sample prediction, as well as a combined inter-intra prediction (CIIP) mode, which is a combination of these two modes. A special mode available for intra prediction is the intra block copy (IBC) mode, which copies predicted samples from the resulting location using a displacement vector to an already reconstructed region of the current image. In inter-prediction, CUs are predicted by weighted superposition using one or more reference images. Previously reconstructed images used as reference images are accessed through reference picture lists (RPLs), where a specific reference image selected for prediction is accessed using a reference index (Ref-Idx) within the list. VVC uses up to two reference lists (L0 and L1). For the current block position, the spatial offset of the position from which prediction samples are fetched from the reference image is determined by a motion vector (MV) having a resolution precision ranging from Nx-sample resolution precision to subsample resolution precision. For prediction from subsample positions, one of multiple N-tap interpolation filters is used depending on the subsample position. To leverage the redundancy of motion vector coding, each MV used to fetch a prediction sample from a reference image is predicted by a motion vector predictor (MVP) derived by the motion vector prediction process. This process searches spatially and/or temporally neighboring CUs and/or history-based buffers for suitable MVP candidates. MVP candidates are stored in a list, and the MVP used to predict the MV is selected by the MVP index transmitted in the bitstream unless otherwise derived. The final MV is determined by the superposition of the MVP and the motion vector difference (MVD) transmitted in the bitstream. For some coding modes, such as skip and some merge modes, the motion vector difference is not transmitted in the bitstream, and the final motion vector is derived directly from the MVP. A more complex method for temporal sample prediction is called an affine mode, which computes predicted samples using a multi-parameter affine prediction model. An affine mode uses two or three motion vectors at specific control points to describe a motion vector field that changes linearly according to the sample position within the current block. Here, we can mention a technique that uses block-wise prediction weighting (BCW) for bidirectional prediction, in which BCW-Idx, an index for each CU, is used to address a scaling table that determines the individual weights applied to hypotheses overlaid in bidirectional prediction. Another technique is the merge mode with mv differences (MMVD), in which an index determining the direction and spatial distance of a motion vector, where one of the vector components is 0, is transmitted in the bitstream. An additional technique is the symmetrical motion vector difference (SMVD), which is signaled in the bitstream when bidirectional prediction is used in the current CU and the mode is not merged or skipped. In this special mode, the MVP-Idx of the two RPLs are transmitted in the bitstream, but only the MVD for RPL L0 is transmitted. The MVD applied to the MV predicted from the L1 hypothesis is derived by copying the MVD transmitted for L0 and inverting the sign of each MVD component. Reference images are selected from the L0 and L1 lists among the references stored in individual RPLs, where the reference image from L0 immediately precedes the current image in the display order, and the reference image from L1 follows immediately. Another inter-prediction mode is called the geometric partitioning mode (GPM) and uses two motion vectors per CU. The area covered by the CU is divided into two zones in the tangential direction. The final prediction samples are obtained by applying a pixel-wise weighting matrix to the samples of the two prediction hypotheses, which performs blending at the zone boundary f