EP-3942823-B1 - METHOD AND APPARATUS FOR VIDEO CODING

EP3942823B1EP 3942823 B1EP3942823 B1EP 3942823B1EP-3942823-B1

Inventors

LI, GUICHUN
LI, XIANG
XU, XIAOZHONG
LIU, SHAN

Dates

Publication Date: 20260506
Application Date: 20200318

Claims (11)

A method of video encoding, comprising: determining whether to apply a prediction refinement with optical flow (PROF) to an affine coded block; and responsive to a determination to apply the PROF to the affine coded block, generating a prediction sample I(i,j) at a sample location (i, j) in the affine coded block, generating spatial gradients g x (i, j) and g y (i, j) at the sample location (i, j) in the affine coded block, generating a prediction refinement ΔI(i, j) based on the spatial gradients g x (i, j) and g y (i, j), and adding the prediction refinement ΔI(i, j) to the prediction sample I(i, j) to generate a refined prediction sample; characterised in that the determining whether to apply the PROF to the affine coded block is based on values of affine parameters of an affine model of the affine coded block, comprising: when a minimum absolute value of affine parameter a, b, c, or d, denoted as min_parameter = min{|a|, |b|, |c|, |d|}, is below or equal to a predefined threshold value, the PROF for affine is not applied for the affine coded block, whereby a, b, c and d are the parameters used to determine the PROF adjustment motion vector Δv(x,y) from horizontal and vertical offsets x and y from a pixel location to the center of a sub-block in a current CU; and otherwise, if the min_parameter is above the threshold value, the PROF is capable of being applied to the affine coded block.
The method of claim 1, further comprising: receiving a syntax element indicating whether the PROF is enabled for affine prediction, wherein the syntax element is signaled at a sequence level, a slice level, a tile level, a tile group level, or a picture level.
The method of claim 1, wherein the generating the spatial gradients g x ( i,j ) and g y ( i , j ) at the sample location ( i, j ) includes: generating the spatial gradients g x ( i, j ) and g y ( i, j ) at the sample location ( i, j ) based on a first prediction sample(s) of a first sub-block including the prediction sample I ( i, j ) and a second prediction sample(s) of a second sub-block neighboring the first sub-block, the first sub-block and the second sub-block being partitioned from the affine coded block.
The method of claim 1, wherein the generating the spatial gradients g x ( i, j ) and g y ( i, j ) at the sample location ( i, j ) includes: performing inter prediction for sub-blocks of the affine coded block; and generating spatial gradients at sample locations on a basis of prediction samples of the entire affine coded block.
The method of claim 1, wherein the generating the spatial gradients g x ( i, j ) and g y ( i, j ) at the sample location ( i, j ) includes: generating the spatial gradients g x ( i, j ) and g y ( i, j ) at the sample location ( i, j ) using a generated gradient filter on reference samples in a reference picture of the affine coded block.
The method of claim 5, wherein the generated gradient filter is generated by a convolution of a first gradient filter and an interpolation filter, wherein applying the interpolation filter on the reference samples in the reference picture of the affine coded block generates prediction samples of the affine coded block, and subsequently applying the first gradient filter on the generated prediction samples of the affine coded block generates the spatial gradients g x ( i, j ) and g y ( i , j ).
The method of claim 5, wherein the gradient filter is a 10-tap gradient filter generated by a convolution of a 3-tap PROF gradient filter with taps of [1 0 -1] and an 8-tap interpolation filter used for inter prediction.
The method of claim 5, wherein the gradient filter is generated by: generating a 10-tap gradient filter by a convolution of a 3-tap PROF gradient filter and an 8-tap interpolation filter used for inter prediction; and truncating the 10-tap gradient filter to an 8-tap filter by removing one coefficient from each side thereof.
A method of video decoding by a video decoder, comprising: determining whether to apply a prediction refinement with optical flow (PROF) to an affine coded block; and responsive to a determination to apply the PROF to the affine coded block, generating a prediction sample I(i, j) at a sample location (i, j) in the affine coded block, generating spatial gradients g x (i, j) and g y (i, j) at the sample location (i, j) in the affine coded block, generating a prediction refinement ΔI(i, j) based on the spatial gradients g x (i, j) and g y (i, j), and adding the prediction refinement ΔI(i, j) to the prediction sample I(i, j) to generate a refined prediction sample; characterised in that the determining whether to apply the PROF to the affine coded block is based on values of affine parameters of an affine model of the affine coded block, comprising: when a minimum absolute value of affine parameter a, b, c, or d, denoted as min_parameter = min{|a|, |b|, |c|, |d|}, is below or equal to a predefined threshold value, the PROF for affine is not applied for the affine coded block, whereby a, b, c and d are the parameters used to determine the PROF adjustment motion vector Δv(x,y) from horizontal and vertical offsets x and y from a pixel location to the center of a sub-block in a current CU; and otherwise, if the min_parameter is above the threshold value, the PROF is capable of being applied to the affine coded block.
An apparatus, comprising circuitry configured to perform the method according to any one of claims 1 to 9.
A non-transitory computer readable medium having instructions stored therein, which when executed by a processor in an apparatus cause the processor to execute the method according to any one of claims 1 to 9.

Description

This present application claims the benefit of U.S. Patent Application No. 16/822,075, "Method and Apparatus for Video coding" filed on March 18, 2020, which claims the benefit of U.S. Provisional Application No. 62/820,196, "Affine Inter Prediction Refinement Methods" filed on March 18, 2019, No. 62/828,425, "LIC Signaling and Affine Refinement" filed on April 2, 2019, No. 62/838,798, "Inter Prediction Refinement Methods" filed on April 25, 2019. TECHNICAL FIELD The present disclosure describes embodiments generally related to video coding. BACKGROUND The background description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent the work is described in this background section, as well as aspects of the description that may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present disclosure. Video coding and decoding can be performed using inter-picture prediction with motion compensation. Uncompressed digital video can include a series of pictures, each picture having a spatial dimension of, for example, 1920 x 1080 luminance samples and associated chrominance samples. The series of pictures can have a fixed or variable picture rate (informally also known as frame rate), of, for example 60 pictures per second or 60 Hz. Uncompressed video has significant bitrate requirements. For example, 1080p60 4:2:0 video at 8 bit per sample (1920x1080 luminance sample resolution at 60 Hz frame rate) requires close to 1.5 Gbit/s bandwidth. An hour of such video requires more than 600 GBytes of storage space. One purpose of video coding and decoding can be the reduction of redundancy in the input video signal, through compression. Compression can help reduce the aforementioned bandwidth or storage space requirements, in some cases by two orders of magnitude or more. Both lossless and lossy compression, as well as a combination thereof can be employed. Lossless compression refers to techniques where an exact copy of the original signal can be reconstructed from the compressed original signal. When using lossy compression, the reconstructed signal may not be identical to the original signal, but the distortion between original and reconstructed signals is small enough to make the reconstructed signal useful for the intended application. In the case of video, lossy compression is widely employed. The amount of distortion tolerated depends on the application; for example, users of certain consumer streaming applications may tolerate higher distortion than users of television distribution applications. The compression ratio achievable can reflect that: higher allowable/tolerable distortion can yield higher compression ratios. Motion compensation can be a lossy compression technique and can relate to techniques where a block of sample data from a previously reconstructed picture or part thereof (reference picture), after being spatially shifted in a direction indicated by a motion vector (MV henceforth), is used for the prediction of a newly reconstructed picture or picture part. In some cases, the reference picture can be the same as the picture currently under reconstruction. MVs can have two dimensions X and Y, or three dimensions, the third being an indication of the reference picture in use (the latter, indirectly, can be a time dimension). In some video compression techniques, an MV applicable to a certain area of sample data can be predicted from other MVs, for example from those related to another area of sample data spatially adjacent to the area under reconstruction, and preceding that MV in decoding order. Doing so can substantially reduce the amount of data required for coding the MV, thereby removing redundancy and increasing compression. MV prediction can work effectively, for example, because when coding an input video signal derived from a camera (known as natural video) there is a statistical likelihood that areas larger than the area to which a single MV is applicable move in a similar direction and, therefore, can in some cases be predicted using a similar motion vector derived from MVs of neighboring area. That results in the MV found for a given area to be similar or the same as the MV predicted from the surrounding MVs, and that in turn can be represented, after entropy coding, in a smaller number of bits than what would be used if coding the MV directly. In some cases, MV prediction can be an example of lossless compression of a signal (namely: the MVs) derived from the original signal (namely: the sample stream). In other cases, MV prediction itself can be lossy, for example because of rounding errors when calculating a predictor from several surrounding MVs. Various MV prediction mechanisms are described in H.265/HEVC (ITU-T Rec. H.265, "High Efficiency Video Coding", December 2016). Out of the many MV prediction mechanisms that H.265