US-20260127776-A1 - METHOD, APPARATUS, AND MEDIUM FOR VIDEO PROCESSING

US20260127776A1US 20260127776 A1US20260127776 A1US 20260127776A1US-20260127776-A1

Abstract

Embodiments of the present disclosure provide a solution for video processing. A method for video processing is proposed. In the method, a conversion between a current video unit of a video and a bitstream of the video is performed. A neural network (NN) filter is applied to the current video unit for the conversion. At least one output of at least one intermediate layer in the NN filter is normalized with layer normalization.

Inventors

Yue Li
Kai Zhang
Li Zhang

Assignees

BYTEDANCE INC.

Dates

Publication Date: 20260507
Application Date: 20251231

Claims (20)

1 . A method for video processing, comprising: performing a conversion between a current video unit of a video and a bitstream of the video, wherein a neural network (NN) filter is applied to the current video unit for the conversion, wherein at least one output of at least one intermediate layer in the NN filter is normalized with layer normalization, and/or wherein the NN filter comprises a vanilla convolutional layer and at least one of: a depth-wise convolutional layer, or a group convolutional layer, and/or wherein the NN filter is applied to the current video unit for the conversion based on an attention.
2 . The method of claim 1 , wherein the at least one output of the at least one intermediate layer in the NN filter is normalized with layer normalization, and wherein the layer normalization is conducted for each channel in the at least one output of the at least one intermediate layer.
3 . The method of claim 1 , wherein the NN filter comprises the vanilla convolutional layer and the depth-wise convolutional layer, and an element in an output channel is determined by involving corresponding elements in an input channel corresponding to the output channel.
4 . The method of claim 1 , wherein the NN filter comprises the vanilla convolutional layer and the group convolutional layer, and an element in an output channel is determined by involving corresponding elements in a plurality of input channels corresponding to the output channel.
5 . The method of claim 4 , wherein at least one input channel of the NN filter is excluded from the plurality of input channels.
6 . The method of claim 1 , wherein the NN filter is applied to the current video unit for the conversion based on the attention, wherein the attention comprises a channel attention determined from an output of an intermediate layer in the NN filter, and the channel attention is used to recalibrate the output.
7 . The method of claim 6 , wherein the channel attention is obtained by squeezing spatial information of the output of the intermediate layer into channels and scaling the channels with a vector.
8 . The method of claim 7 , wherein the channel attention is determined by A=W×pool(G), where A denotes the channel attention, W∈R N denotes a weighting vector, pool( ) denotes a pooling function, and G∈R N×W×H denotes the output of the intermediate channel, N, W, and H are the channel numbers, width, and height respectively.
9 . The method of claim 8 , wherein the attention is applied by: φ i,j,k =G i,j,k ×A i , 1≤i≤N, 1≤j≤W, 1≤k≤H, where φ denotes recalibrated feature maps.
10 . The method of claim 8 , wherein the attention is applied by: φ i,j,k =G i,j,k ×f(A i ), 1≤i≤N, 1≤j≤W, 1≤k≤H, where φ denotes recalibrated feature maps, and f denotes a mapping function applied on each element of the attention, or wherein the attention is applied by: φ i,j,k =G i,j,k ×f(A i )+G i,j,k , 1≤i≤N, 1≤j≤W, 1≤k≤H, where φ denotes recalibrated feature maps, and f denotes a mapping function applied on each element of the attention.
11 . The method of claim 10 , wherein the mapping function comprises one of: a sigmoid function, or a hyperbolic tangent function.
12 . The method of claim 10 , wherein a first attention for a first channel of feature maps is different from a second attention for a second channel of the feature maps, and/or wherein a first mapping function for the first channel of the feature maps is different from a second mapping function for the second channel of the feature maps.
13 . The method of claim 1 , wherein the NN filter is applied to the current video unit for the conversion based on the attention, and wherein the attention is applied to at least one predetermined layer inside the NN filter.
14 . The method of claim 1 , wherein the NN filter is applied to the current video unit for the conversion based on the attention, and wherein a gate function is utilized by the NN filter in addition to a non-linear activation function.
15 . The method of claim 14 , wherein the gate function divides a feature map into a plurality of parts in a channel dimension and multiplies the plurality of parts of the feature map.
16 . The method of claim 1 , wherein the conversion includes encoding the current video unit into the bitstream.
17 . The method of claim 1 , wherein the conversion includes decoding the current video unit from the bitstream.
18 . An apparatus for video processing comprising a processor and a non-transitory memory with instructions thereon, wherein the instructions upon execution by the processor, cause the processor to: perform a conversion between a current video unit of a video and a bitstream of the video, wherein a neural network (NN) filter is applied to the current video unit for the conversion, wherein at least one output of at least one intermediate layer in the NN filter is normalized with layer normalization, and/or wherein the NN filter comprises a vanilla convolutional layer and at least one of: a depth-wise convolutional layer, or a group convolutional layer, and/or wherein the NN filter is applied to the current video unit for the conversion based on an attention.
19 . A non-transitory computer-readable storage medium storing instructions that cause a processor to perform operations comprising: performing a conversion between a current video unit of a video and a bitstream of the video, wherein a neural network (NN) filter is applied to the current video unit for the conversion, wherein at least one output of at least one intermediate layer in the NN filter is normalized with layer normalization, and/or wherein the NN filter comprises a vanilla convolutional layer and at least one of: a depth-wise convolutional layer, or a group convolutional layer, and/or wherein the NN filter is applied to the current video unit for the conversion based on an attention.
20 . A non-transitory computer-readable recording medium storing a bitstream of a video which is generated by a method performed by an apparatus for video processing, wherein the method comprises: generating the bitstream of the video from a current video unit of the video, wherein a neural network (NN) filter is applied to the current video unit of the video, wherein at least one output of at least one intermediate layer in the NN filter is normalized with layer normalization, and/or wherein the NN filter comprises a vanilla convolutional layer and at least one of: a depth-wise convolutional layer, or a group convolutional layer, and/or wherein the NN filter is applied to the current video unit for the conversion based on an attention.

Description

CROSS REFERENCE TO RELATED APPLICATION This application is a continuation of International Application No. PCT/US2024/036580, filed on Jul. 2, 2024, which claims the benefit of U.S. Provisional Patent Application No. 63/511,813, entitled “METHOD, APPARATUS, AND MEDIUM FOR VIDEO PROCESSING”, and filed on Jul. 3, 2023. The entire contents of these applications are hereby incorporated by reference in their entireties. FIELD Embodiments of the present disclosure relates generally to video coding techniques, and more particularly, to neural network (NN)-based filter for video coding. BACKGROUND In nowadays, digital video capabilities are being applied in various aspects of peoples' lives. Multiple types of video compression technologies, such as MPEG-2, MPEG-4, ITU-TH.263, ITU-TH.264/MPEG-4 Part 10 Advanced Video Coding (AVC), ITU-TH.265 high efficiency video coding (HEVC) standard, versatile video coding (VVC) standard, have been proposed for video encoding/decoding. However, coding efficiency of conventional video coding techniques is generally very low, which is undesirable. SUMMARY Embodiments of the present disclosure provide a solution for video processing. In a first aspect, a method for video processing is proposed. The method comprises: performing a conversion between a current video unit of a video and a bitstream of the video, wherein a neural network (NN) filter is applied to the current video unit for the conversion, wherein at least one output of at least one intermediate layer in the NN filter is normalized with layer normalization. The method in accordance with the first aspect of the present disclosure applies a NN filter with layer normalization. The coding efficiency can thus be improved. In a second aspect, another method for video processing is proposed. The method comprises: performing a conversion between a current video unit of a video and a bitstream of the video, wherein a neural network (NN) filter is applied to the current video unit for the conversion, wherein the NN filter comprises a vanilla convolutional layer and at least one of: a depth-wise convolutional layer, or a group convolutional layer. The method in accordance with the second aspect of the present disclosure applies a NN filter with a depth-wise convolutional layer or a group convolutional layer. The coding efficiency can thus be improved. In a third aspect, another method for video processing is proposed. The method comprises: performing a conversion between a current video unit of a video and a bitstream of the video, wherein a neural network (NN) filter is applied to the current video unit for the conversion based on an attention. The method in accordance with the third aspect of the present disclosure applies the attention for a NN filter. The coding efficiency can thus be improved. In a fourth aspect, an apparatus for video processing is proposed. The apparatus comprises a processor and a non-transitory memory with instructions thereon. The instructions upon execution by the processor, cause the processor to perform a method in accordance with the first, second, or third aspect of the present disclosure. In a fifth aspect, a non-transitory computer-readable storage medium is proposed. The non-transitory computer-readable storage medium stores instructions that cause a processor to perform a method in accordance with the first, second, or third aspect of the present disclosure. In a sixth aspect, another non-transitory computer-readable recording medium is proposed. The non-transitory computer-readable recording medium stores a bitstream of a video which is generated by a method performed by an apparatus for video processing. The method comprises: generating the bitstream of the video from a current video unit of the video, wherein a neural network (NN) filter is applied to the current video unit of the video, wherein at least one output of at least one intermediate layer in the NN filter is normalized with layer normalization. In a seventh aspect, another non-transitory computer-readable recording medium is proposed. The non-transitory computer-readable recording medium stores a bitstream of a video which is generated by a method performed by an apparatus for video processing. The method comprises: generating the bitstream of the video from a current video unit of the video, wherein a neural network (NN) filter is applied to the current video unit of the video, wherein the NN filter comprises a vanilla convolutional layer and at least one of: a depth-wise convolutional layer, or a group convolutional layer. In an eighth aspect, another non-transitory computer-readable recording medium is proposed. The non-transitory computer-readable recording medium stores a bitstream of a video which is generated by a method performed by an apparatus for video processing. The method comprises: generating the bitstream of the video from a current video unit of the video, wherein a neural network (NN) filter is applied to the curren