EP-4736433-A1 - NEURAL NETWORK-BASED IN-LOOP FILTER ARCHITECTURES WITH LOCALIZED MULTI-SCALE FEATURE EXTRACTION FOR VIDEO CODING

EP4736433A1EP 4736433 A1EP4736433 A1EP 4736433A1EP-4736433-A1

Abstract

A device for decoding video data determines a block of a picture; applies a neural network (NN)-based filter process to the block to generate a filtered block, wherein to apply the NN-based filter process, the device performs a first feature extraction on pixel data of the block at a first scale to generate a first set of extracted features for the block; and performs a second feature extraction on the pixel data of the block at a second scale to generate a second set of extracted features for the block, wherein the first scale is different than the second scale; and generates the filtered block based on the first set of extracted features and the second set of extracted features.

Inventors

RUSANOVSKYY, DMYTRO
LI, YUN
KARCZEWICZ, MARTA

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260506
Application Date: 20240628

Claims (20)

1. A method of decoding encoded video data, the method comprising: determining, from the encoded video data, a block of a picture; applying a neural network (NN)-based filter process to the block to generate a filtered block, wherein applying the NN-based filter process comprises: performing a first feature extraction on pixel data of the block at a first scale to generate a first set of extracted features for the block; performing a second feature extraction on the pixel data of the block at a second scale to generate a second set of extracted features for the block, wherein the first scale is different than the second scale; and generating the filtered block based on the first set of extracted features and the second set of extracted features; determining a decoded version of the block based on the filtered block; and outputting a decoded version of the picture comprising the decoded version of the block.
2. The method of claim 1, wherein applying the NN-based filter process comprises: performing a third feature extraction on the pixel data of the block at a third scale to generate a third set of extracted features for the block, wherein the first scale is different than the second scale and the third scale, and the second scale is different than the third scale.
3. The method of claim 1, wherein: performing the first feature extraction on the block at the first scale comprises applying a first convolution filter with a first support size; and performing the second feature extraction on the block at the second scale comprises applying a second convolution filter with a second support size, wherein the first support size is different than the second support size.
4. The method of claim 1, wherein: performing the first feature extraction on the block at the first scale comprises applying a first set of cascading convolution filters; and performing the second feature extraction on the block at the second scale comprises applying a second set of cascading convolution filters.
5. The method of claim 4, wherein each of the cascading convolution filters of the first set have a first support size and each of the cascading convolution filters of the second set have the first support size.
6. The method of claim 1, further comprising: inputting the first set of extracted features for the block into a first parametric rectified linear unit (PReLU) layer; and inputting the second set of extracted features for the block into a second PReLU layer.
7. The method of claim 1, wherein the block comprises a reconstructed block and determining the block of the picture comprises adding a prediction block to a residual block.
8. The method of claim 1, wherein the first scale is 3x3 and the second scale is 5x5.
9. The method of claim 1, wherein the method of decoding is performed as part of a video encoding process.
10. A device for decoding encoded video data, the device comprising: a memory configured to store the encoded video data; one or more processors implemented in circuitry and configured to: determine, from the encoded video data, a block of a picture; apply a neural network (NN)-based filter process to the block to generate a filtered block, wherein to apply the NN-based filter process, the one or more processors are further configured to: perform a first feature extraction on pixel data of the block at a first scale to generate a first set of extracted features for the block; perform a second feature extraction on the pixel data of the block at a second scale to generate a second set of extracted features for the block, wherein the first scale is different than the second scale; and generate the filtered block based on the first set of extracted features and the second set of extracted features; determine a decoded version of the block based on the filtered block; and output a decoded version of the picture comprising the decoded version of the block.
11. The device of claim 10, wherein to apply the NN-based filter process, the one or more processors are further configured to: perform a third feature extraction on the pixel data of the block at a third scale to generate a third set of extracted features for the block, wherein the first scale is different than the second scale and the third scale, and the second scale is different than the third scale.
12. The device of claim 10, wherein: to perform the first feature extraction on the block at the first scale, the one or more processors are further configured to apply a first convolution filter with a first support size; and to perform the second feature extraction on the block at the second scale, the one or more processors are further configured to apply a second convolution filter with a second support size, wherein the first support size is different than the second support size.
13. The device of claim 10, wherein: to perform the first feature extraction on the block at the first scale, the one or more processors are further configured to apply a first set of cascading convolution filters; and to perform the second feature extraction on the block at the second scale, the one or more processors are further configured to apply a second set of cascading convolution filters.
14. The device of claim 13, wherein each of the cascading convolution filters of the first set have a first support size and each of the cascading convolution filters of the second set have the first support size.
15. The device of claim 10, wherein the one or more processors are further configured to: input the first set of extracted features for the block into a first parametric rectified linear unit (PReLU) layer; and input the second set of extracted features for the block into a second PReLU layer.
16. The device of claim 10, wherein the block comprises a reconstructed block and to determine the block of the picture, the one or more processors are further configured to add a prediction block to a residual block.
17. The device of claim 10, wherein the first scale is 3x3 and the second scale is 5x5.
18. The device of claim 10, further comprising a display configured to display decoded video data.
19. The device of claim 10, wherein the device comprises one or more of a camera, a computer, a mobile device, a broadcast receiver device, or a set-top box.
20. A computer-readable storage medium storing instructions that when executed by one or more processors cause the one or more processors to: determine, from encoded video data, a block of a picture; apply a neural network (NN)-based filter process to the block to generate a filtered block, wherein to apply the NN-based filter process, the instructions cause the one or more processors to: perform a first feature extraction on pixel data of the block at a first scale to generate a first set of extracted features for the block; and perform a second feature extraction on the pixel data of the block at a second scale to generate a second set of extracted features for the block, wherein the first scale is different than the second scale; and generate the filtered block based on the first set of extracted features and the second set of extracted features; and determine a decoded version of the block based on the filtered block; and output a decoded version of the picture comprising the decoded version of the block.

Description

NEURAL NETWORK-BASED IN-LOOP FILTER ARCHITECTURES WITH LOCALIZED MULTI-SCALE FEATURE EXTRACTION FOR VIDEO CODING [0001] This application claims priority to U.S. Patent Application No. 18/756,952, filed 27 June 2024 and U.S. Provisional Patent Application No. 63/511,546, filed 30 June 2023, the entire content of each of which are incorporated herein by reference. U.S. Patent Application No. 18/756,952, filed 27 June 2024 claims the benefit of U.S. Patent Application No. 18/756,952, filed 27 June 2024 and U.S. Provisional Patent Application No. 63/511,546, filed 30 June 2023. TECHNICAL FIELD [0002] This disclosure relates to video encoding and video decoding. BACKGROUND [0003] Digital video capabilities can be incorporated into a wide range of devices, including digital televisions, digital direct broadcast systems, wireless broadcast systems, personal digital assistants (PDAs), laptop or desktop computers, tablet computers, e-book readers, digital cameras, digital recording devices, digital media players, video gaming devices, video game consoles, cellular or satellite radio telephones, so-called “smart phones,” video teleconferencing devices, video streaming devices, and the like. Digital video devices implement video coding techniques, such as those described in the standards defined by MPEG-2, MPEG-4, ITU-T H.263, ITU-T H.264/MPEG-4, Part 10, Advanced Video Coding (AVC), ITU-T H.265/High Efficiency Video Coding (HEVC), ITU-T H.266/Versatile Video Coding (VVC), and extensions of such standards, as well as proprietary video codecs/formats such as AOMedia Video 1 (AVI) that was developed by the Alliance for Open Media. The video devices may transmit, receive, encode, decode, and/or store digital video information more efficiently by implementing such video coding techniques. [0004] Video coding techniques include spatial (intra-picture) prediction and/or temporal (inter-picture) prediction to reduce or remove redundancy inherent in video sequences. For block-based video coding, a video slice (e.g., a video picture or a portion of a video picture) may be partitioned into video blocks, which may also be referred to as coding tree units (CTUs), coding units (CUs) and/or coding nodes. Video blocks in an intracoded (I) slice of a picture are encoded using spatial prediction with respect to reference samples in neighboring blocks in the same picture. Video blocks in an inter-coded (P or B) slice of a picture may use spatial prediction with respect to reference samples in neighboring blocks in the same picture or temporal prediction with respect to reference samples in other reference pictures. Pictures may be referred to as frames, and reference pictures may be referred to as reference frames. SUMMARY [0005] This disclosure describes simplifications that may be applied to NN-based filtering techniques while also maintaining coding quality. For example, in accordance with the techniques of this disclosure, a video coder may be configured to perform a NN- based filter process that includes performing a first feature extraction on pixel data of the block at a first scale to generate a first set of extracted features for the block and performing a second feature extraction on the pixel data of the block at a second scale to generate a second set of extracted features for the block, with the first scale being different than the second scale. By performing multi-scale feature extraction on input pixel data, e.g., in a headblock of the filter, an amount of multi-scale processing performed in other parts of the filter, e.g., in a backbone, may be reduces, thus reducing overall complexity of the filter. In this manner, the techniques of this disclosure may improve the performance of a video coding device. Likewise, these techniques may enable many more devices to perform NN-based filtering, thereby improving the field of video coding generally. [0006] According to an example of this disclosure, a method of decoding encoded video data includes determining, from the encoded video data, a block of a picture; applying a neural network (NN)-based filter process to the block to generate a filtered block, wherein applying the NN-based filter process comprises: performing a first feature extraction on pixel data of the block at a first scale to generate a first set of extracted features for the block; and performing a second feature extraction on the pixel data of the block at a second scale to generate a second set of extracted features for the block, wherein the first scale is different than the second scale; and generating the filtered block based on the first set of extracted features and the second set of extracted features; and determining a decoded version of the block based on the filtered block; and outputting a decoded version of the picture comprising the decoded version of the block. [0007] According to an example of this disclosure, a device for decoding encoded video data includes: a memory configured to store the