US-12627795-B2 - Methods for complexity reduction of neural network based video coding tools

US12627795B2US 12627795 B2US12627795 B2US 12627795B2US-12627795-B2

Abstract

A video coder is configured to perform a neural network (NN)-based filter process on reconstructed blocks of vide data. In one example, a video coder may receive a picture of video data, and reconstruct a block of the picture of video data to generate a reconstructed block. The video coder may perform the NN-based filter process on the reconstructed block to generate a filtered block, wherein the NN-based filter process includes performing a plurality separable convolutions to approximate a multi-dimensional convolution.

Inventors

Dmytro Rusanovskyy
Samuel James Eadie
Yun Li
Marta Karczewicz

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260512
Application Date: 20240215

Claims (20)

1 . A method of coding video data, the method comprising: receiving a picture of video data; reconstructing a block of the picture of video data to generate a reconstructed block; and performing a neural network (NN)-based filter process on the reconstructed block to generate a filtered block, wherein the NN-based filter process includes performing a plurality of separable convolutions, in a backbone block of the NN-based filter process, to approximate a multi-dimensional convolution, and wherein performing the plurality of separable convolutions to approximate the multi-dimensional convolution comprises: receiving an input at the residual block; performing a 1×1×K×M convolution on the input; performing a PReLU layer on an output of the 1×1×K×M convolution; performing a 1×1×M×R convolution on an output of the PReLU layer; performing a 3×1×R×R separable convolution on an output of the 1×1×M×R convolution; performing a 1×3×R×R separable convolution on an output of the 3×1×R×R separable convolution; and performing a 1×1×R×K convolution on an output of the 1×3×R×R separable convolution.
2 . The method of claim 1 , wherein the multi-dimensional convolution has a kernel size of n1×n2 in a spatial dimension, and a size of K in a depth dimension.
3 . The method of claim 1 , wherein the backbone block is one of a residual block, a filter block, or an attention residual block.
4 . The method of claim 1 , wherein the 1×1×M×R convolution is a fusion of a 1×1×M×K convolution and a 1×1×K×R convolution.
5 . The method of claim 1 , wherein the NN-based filter process includes a cascaded application of the backbone block.
6 . The method of claim 1 , wherein the NN-based filter process includes a cascaded application of the backbone block applied in two or more parallel processing branches.
7 . The method of claim 1 , further comprising: applying an element-wise activation process as part of the multi-dimensional convolution.
8 . The method of claim 7 , wherein the element-wise activation process is parametrically controlled.
9 . The method of claim 1 , wherein performing the plurality of separable convolutions to approximate the multi-dimensional convolution comprises: performing the plurality of separable convolutions to approximate the multi-dimensional convolution in one or more of a feature extraction section, a fusion block, a transition block, a backbone block, or a tail section of the NN-based filter process.
10 . The method of claim 1 , wherein coding comprises decoding and wherein the method further comprising: using a decoded picture that includes the filtered block as reference for prediction of other coded pictures.
11 . The method of claim 1 , wherein coding comprises encoding and wherein the method further comprising: capturing the picture of video data using a camera.
12 . An apparatus configured to code video data, the apparatus comprising: a memory configured to store a picture of video data; and processing circuitry in communication with the memory, the processing circuitry configured to: receive the picture of video data; reconstruct a block of the picture of video data to generate a reconstructed block; and perform a neural network (NN)-based filter process on the reconstructed block to generate a filtered block, wherein the NN-based filter process includes performing a plurality of separable convolutions, in a backbone block of the NN-based filter process, to approximate a multi-dimensional convolution, and wherein to perform the plurality of separable convolutions to approximate the multi-dimensional convolution, the processing circuitry is configured to: receive an input at the residual block; perform a 1×1×K×M convolution on the input; perform a PReLU layer on an output of the 1×1×K×M convolution; perform a 1×1×M×R convolution on an output of the PReLU layer; perform a 3×1×R×R separable convolution on an output of the 1×1×M×R convolution; perform a 1×3×R×R separable convolution on an output of the 3×1×R×R separable convolution; and perform a 1×1×R×K convolution on an output of the 1×3×R×R separable convolution.
13 . The apparatus of claim 12 , wherein the multi-dimensional convolution is a 3×3 convolution.
14 . The apparatus of claim 12 , wherein the backbone block is one of a residual block, a filter block, or an attention residual block.
15 . The apparatus of claim 12 , wherein the 1×1×M×R convolution is a fusion of a 1×1×M×K convolution and a 1×1×K×R convolution.
16 . The apparatus of claim 12 , wherein the NN-based filter process includes a cascaded application of the backbone block.
17 . The apparatus of claim 12 , wherein the NN-based filter process includes a cascaded application of the backbone block applied in two or more parallel processing branches.
18 . The apparatus of claim 12 , wherein the processing circuitry is further configured to: apply an element-wise activation process as part of the multi-dimensional convolution.
19 . The apparatus of claim 18 , wherein the element-wise activation process is parametrically controlled.
20 . The apparatus of claim 12 , wherein to perform the plurality of separable convolutions to approximate the multi-dimensional convolution, the processing circuitry is further configured to: perform the plurality of separable convolutions to approximate the multi-dimensional convolution in one or more of a feature extraction section, a fusion block, a transition block, a backbone block, or a tail section of the NN-based filter process.

Description

This application claims the benefit of U.S. Provisional Patent Application No. 63/485,862 filed Feb. 17, 2023, the entire content of which is incorporated by reference herein. TECHNICAL FIELD This disclosure relates to video encoding and video decoding. BACKGROUND Digital video capabilities can be incorporated into a wide range of devices, including digital televisions, digital direct broadcast systems, wireless broadcast systems, personal digital assistants (PDAs), laptop or desktop computers, tablet computers, e-book readers, digital cameras, digital recording devices, digital media players, video gaming devices, video game consoles, cellular or satellite radio telephones, so-called “smart phones,” video teleconferencing devices, video streaming devices, and the like. Digital video devices implement video coding techniques, such as those described in the standards defined by MPEG-2, MPEG-4, ITU-T H.263, ITU-T H.264/MPEG-4, Part 10, Advanced Video Coding (AVC), ITU-T H.265/High Efficiency Video Coding (HEVC), ITU-T H.266/Versatile Video Coding (VVC), and extensions of such standards, as well as proprietary video codecs/formats such as AOMedia Video 1 (AV1) that was developed by the Alliance for Open Media. The video devices may transmit, receive, encode, decode, and/or store digital video information more efficiently by implementing such video coding techniques. Video coding techniques include spatial (intra-picture) prediction and/or temporal (inter-picture) prediction to reduce or remove redundancy inherent in video sequences. For block-based video coding, a video slice (e.g., a video picture or a portion of a video picture) may be partitioned into video blocks, which may also be referred to as coding tree units (CTUs), coding units (CUs) and/or coding nodes. Video blocks in an intra-coded (I) slice of a picture are encoded using spatial prediction with respect to reference samples in neighboring blocks in the same picture. Video blocks in an inter-coded (P or B) slice of a picture may use spatial prediction with respect to reference samples in neighboring blocks in the same picture or temporal prediction with respect to reference samples in other reference pictures. Pictures may be referred to as frames, and reference pictures may be referred to as reference frames. SUMMARY In general, this disclosure describes techniques for video coding. In particular, this disclosure describes methods, techniques, and devices that may reduce the computational complexity and memory bandwidth requirements of neural network (NN)-based video coding tools. Example techniques described herein are related to NN-based filtering. However, the techniques of this disclosure are applicable to any NN-based video coding tool that uses input data with certain statistical properties. In some examples, the NN-based coding tool may be a convolutional NN (CNN)-based video coding tools, such as CNN-based filters. The techniques of this disclosure may be used in the context of advanced video codecs, such as extensions of VVC, the next generation of video coding standards, and/or any other video codecs. In accordance with the techniques of this disclosure, a video coder may be configured to utilize separable convolutions in the place a multi-dimensional convolution. For example, two separable one-dimensional convolutions may be used in place of a 3×3 convolution in any section of an NN-based filter. The use of separable convolutions may reduce computation complexity and memory bandwidth requirements. In one example, a method of coding video data includes receiving a picture of video data, reconstructing a block of the picture of video data to generate a reconstructed block, and performing an NN-based filter process on the reconstructed block to generate a filtered block, wherein the NN-based filter process includes performing a plurality separable convolutions to approximate a multi-dimensional convolution. In another example an apparatus configured to code video data includes a memory configured to store a picture of video data, and processing circuitry in communication with the memory, the processing circuitry configured to receive the picture of video data, reconstruct a block of the picture of video data to generate a reconstructed block, and perform an NN-based filter process on the reconstructed block to generate a filtered block, wherein the NN-based filter process includes performing a plurality separable convolutions to approximate a multi-dimensional convolution. In another example, an apparatus configured to code video data includes means for receiving a picture of video data, means for reconstructing a block of the picture of video data to generate a reconstructed block, and means for performing an NN-based filter process on the reconstructed block to generate a filtered block, wherein the NN-based filter process includes performing a plurality separable convolutions to approximate a multi-dimensional convolution. In another e