EP-4740482-A1 - LEARNING TRANSFORM COEFFICIENTS

EP4740482A1EP 4740482 A1EP4740482 A1EP 4740482A1EP-4740482-A1

Abstract

A method of applying filtering is provided. The method includes providing a neural network having a convolutional layer and one or more additional layers for filtering a block of image samples of spatial size XxY. The method includes providing a set of inputs, at least one input i of spatial size of XxY. The method includes, for the least one input i in the set of inputs: (i) forming AxB sub-blocks of size MxN from the input; (ii) determining at least two coefficients for at least one of the AxB sub-blocks based on the corresponding sub-block, with at least one of the at least two coefficients based on at least two input values; and (iii) re- arranging the at least two coefficients to form a re-sized input of size of AxBxC(i), where C(i) ≤ (M*N). The method includes applying the neural network using the re-sized input to generate an output.

Inventors

LIU, Du
DAMGHANIAN, Mitra
STRÖM, Jacob
WENNERSTEN, PER

Assignees

Telefonaktiebolaget LM Ericsson (publ)

Dates

Publication Date: 20260513
Application Date: 20240701

Claims (20)

1. A method of applying filtering during video encoding and/or decoding using a neural network, the method comprising: providing (si 302) a neural network having a convolutional layer and one or more additional layers for filtering a block of image samples of spatial size XxY, for integers X and Y; providing (sl304) a set of inputs, at least one input i of spatial size of XxY; for the least one input i in the set of inputs (si 306): (i) forming (si 308) AxB sub-blocks of size MxN from the input, for integers A, B, M, and N; (ii) determining (si 310) at least two coefficients for at least one of the AxB subblocks based on the corresponding sub-block, wherein at least one of the at least two coefficients is based on at least two input values in the set of inputs; and (iii) re-arranging (si 312) the at least two coefficients for at least one of the AxB subblocks to form a re-sized input of size of AxBxC(i), where C(i) < (M*N); and applying (s 1314) the neural network using the at least one re-sized input in the set of inputs to generate an output.
2. The method of claim 1, wherein for at least one input i in the set of inputs, C(z) = (M*N) such that all the (M*N) coefficients are used in the re-sized input.
3. The method of claim 1, wherein for at least one input i in the set of inputs, C(z) < (M*N).
4. The method of any one of claims 1-3, wherein the convolutional layer of the neural network receives the at least one re-sized input in the set of inputs, and wherein a kernel in the convolutional layer of one of the C(z) channels belonging to the input z has a kernel size that is the same size as each other kernel in the convolutional layer for each other C(z) channel.
5. The method of any one of claims 1-3, wherein the convolutional layer of the neural network receives the at least one re-sized input in the set of inputs, and wherein a kernel in the convolutional layer of one of the C(z) channels belonging to the input i has a kernel size that is different from at least one other kernel in the convolutional layer for at least one other C(z) channel.
6. The method of any one of claims 1-5, wherein the block of image samples and at least one input in the set of inputs have a third non-spatial dimension of size Z > 1, and the steps (i) - (iii) are performed for at least one input i in the set of inputs and for one or more channels c of the input z, where the channels c correspond to the third non-spatial dimension of size Z.
7. The method of any one of claims 1-6, further comprising re-arranging the output from a size of (X/M)x(Y/N)x(M*N) to a size of XxY.
8. The method of any one of claims 1-7, further comprising re-arranging the output from a size of A'xB'xC'(i), where C'(i) < (M*N), to a single channel output.
9. The method of any one of claims 1-8, wherein for at least one input i in the set of inputs, re-arranging the at least two coefficients for at least one of the AxB sub-blocks to form a re-sized input comprises using a PixelUnshuffle() function.
10. The method of any one of claims 7-9, wherein re-arranging the output comprises using a PixelShuffle() function.
11. The method of any one of claims 1-10, wherein, for at least one input i in the set of inputs, determining the at least two coefficients for at least one of the AxB sub-blocks comprises applying a frequency-domain transformation of size MxN to the corresponding AxB sub-block, resulting in the at least two coefficients belonging to one of (M*N) frequency bands; and re-arranging the at least two coefficients for at least one of the AxB sub-blocks to form a re-sized input of size of AxBxC(i) comprises putting the at least two coefficients for at least one of the AxB sub-blocks into channels based on a frequency band of the (M*N) frequency bands that the at least two coefficients belong to.
12. The method of claim 11, further comprising applying an inverse transformation of size MxN to the output of the neural network, wherein the inverse transformation is an inverse operation to the frequency-domain transformation.
13. The method of any one of embodiments 1-12, wherein the set of inputs includes one or more of: (i) reconstructed samples before deblocking, (ii) prediction samples, (iii) block boundary strength information, (iv) a quantization parameter, and (v) information on whether a particular sample was intra-predicted, uni-predicted, or bi-predicted.
14. The method of any one of claims 11-12, wherein the frequency -domain transformation comprises one of a discrete cosine transform, a discrete sine transform, and a discrete wavelet transform.
15. The method of any one of claims 1-14, wherein both M and N equal one of 2, 4, and 8.
16. The method of any one of claims 1-15, wherein the method is applied during an inloop filter.
17. The method of any one of claims 1-15, wherein the method is applied during a postprocessing filter.
18. A method of applying filtering during video encoding and/or decoding using a neural network, the method comprising: providing (si 402) a neural network having a convolutional layer and one or more additional layers for filtering a block of image samples of spatial size XxY, for integers X and Y; providing (sl404) a set of inputs, at least one input of spatial size of XxY; applying (si 406) the neural network using the set of inputs to generate an output, wherein applying the neural network comprises, for a first layer having an input signal x and second layer having an output signal y in the one or more additional layers of the neural network: (i) forming (si 408) AxB sub-blocks of size MxN from the input signal x, for integers A,B, M, and N; (ii) determining (sl410) at least two coefficients for at least one of the AxB subblocks based on the corresponding sub-block, wherein at least one of the at least two coefficients is based on at least two input values from the input signal x; and (iii) re-arranging (sl412) the at least two coefficients for the at least one AxB subblocks to form a re-sized input signal x of size of AxBx(M*N); and (iv) feeding (sl414) the re-sized input signal x into the first layer; and (v) re-arranging (sl416) the output signal y of the second layer from a size of A'xB'xC'(i), where C'(i) < (M*N), to a single channel output.
19. The method of claim 18, wherein re-arranging the at least two coefficients for at least one of the AxB sub-blocks to form a re-sized input signal x comprises using a PixelUnshuffleO function.
20. The method of any one of claims 18-19, wherein re-arranging the output signal y of the second layer comprises using a PixelShuffle() function.

Description

LEARNING TRANSFORM COEFFICIENTS TECHNICAL FIELD [001] This disclosure relates to video compression, and more particularly, to learning transform coefficients for such video compression. BACKGROUND [002] Video compression [003] Video is the dominant form of data traffic in today’s networks and is projected to still increase its share [1], One way to reduce the data traffic from video is compression. Here the source video is encoded to a bitstream, which then can be stored and transmitted to end users. Using a decoder, the end user can extract the video data and display it on a screen. However, since the encoder may not know what kind of device the encoded bitstream is going to be sent to, it has to compress the video to a predetermined format, such as the standardized format VVC. This way, all devices that support the chosen standard can decode the video. Compression can be lossless, i.e., the decoded video will be identical to the source given to the encoder, or lossy, where a certain degradation of content is accepted. Using lossy compression allows for significantly lower bit rates, i.e., the compression ratio can be much higher. This is because reproducing image noise perfectly can make lossless compression quite expensive. [004] A video sequence contains a sequence of pictures. A color space commonly used in video sequences is YCbCr, where Y is the luma (brightness) component and Cb and Cr are the chroma components. Sometimes the Cb and Cr components are called U and V. Other color spaces are also used, such as ICtCp, IPT, constant-luminance YCbCr, RGB, YCoCg, etc., and embodiments disclosed herein are applicable also in these cases. The order that the pictures are placed in in the video sequence when viewed is called ‘display order’. Each picture is assigned with a Picture Order Count (POC) value to indicate its position in terms of display order. In this document we interchangeably use the terms ‘images’, ‘pictures’ or ‘frames’. [005] Video compression is used to compress video sequences into a sequence of coded pictures. In many existing video codecs, the picture is divided into blocks of different sizes. A block is a two-dimensional array of samples. The blocks serve as the basis for coding. A video decoder then decodes the coded pictures into pictures containing sample values. [006] Commonly used video coding standards [007] Video standards are usually developed by international organizations as these represent different companies and research institutes with different areas of expertise and interests. Currently, the most applied video compression standard is H.264/AVC which was jointly developed by ITU-T and ISO. The first version of H.264/AVC was finalized in 2003, with several updates in the following years. The successor of H.264/AVC, which was also developed by ITU-T and ISO, is known as H.265/HEVC (High Efficiency Video Coding) and was finalized in 2013. MPEG and ITU-T have created a successor to HEVC within the Joint Video Exploratory Team (JVET). The name of this video codec is Versatile Video Coding (VVC) and version 1 of the VVC specification has been published as Rec. ITU-T H.266 | ISO/IEC 23090-3, “Versatile Video Coding”, 2020. [008] The VVC video coding standard is a block-based video codec and utilizes both temporal and spatial prediction. Spatial prediction is achieved using intra (I) prediction from within the current picture. Temporal prediction is achieved using uni-directional (P) or bi-directional inter (B) prediction at the block level from previously decoded reference pictures. In the encoder, the difference between the original pixel data and the predicted pixel data, referred to as the residual, is transformed into the frequency domain, quantized, and then entropy coded before transmitted together with necessary prediction parameters such as prediction mode and motion vectors, also entropy coded. The decoder performs entropy decoding, inverse quantization, and inverse transformation to obtain the residual, and then adds the residual to the intra or inter prediction to reconstruct a picture. [009] The VVC video coding standard uses a block structure referred to as quadtree plus binary tree plus ternary tree block structure (QTBT+TT) where each picture is first partitioned into square blocks called coding tree units (CTU). All CTUs are of the same size and the partitioning of the picture into CTUs is done without any syntax controlling it. Each CTU is further partitioned into coding units (CUs) that can have either square or rectangular shapes. The CTU is first partitioned by a quad tree structure, and then it may be further partitioned with equally sized partitions either vertically or horizontally in a binary structure to form coding units (CUs). A block could thus have either a square or rectangular shape. The depth of the quad tree and binary tree can be set by the encoder in the bitstream. The ternary tree (TT) part adds the possibility to divide a CU into three partit