US-20260129246-A1 - METHOD, APPARATUS, AND MEDIUM FOR VISUAL DATA PROCESSING

US20260129246A1US 20260129246 A1US20260129246 A1US 20260129246A1US-20260129246-A1

Abstract

Embodiments of the present disclosure provide a solution for visual data processing. A method for visual data processing is proposed. In the method, for a conversion between a current visual unit of visual data and a bitstream of the visual data, a probability representation of the current visual unit is determined based on a multistage context module. The conversion is performed based on the probability representation. The multistage context module at least comprises at least one prediction fusion network. A conditional context network is excluded from the multistage context module.

Inventors

Yaojun WU

Assignees

Douyin Vision Co., Ltd.

Dates

Publication Date: 20260507
Application Date: 20251230
Priority Date: 20230630

Claims (20)

1 . A method for visual data processing, comprising: determining, for a conversion between a current visual unit of visual data and a bitstream of the visual data, a probability representation of the current visual unit based on a multistage context module; and performing the conversion based on the probability representation, wherein the multistage context module at least comprises at least one prediction fusion network, and wherein a conditional context network is excluded from the multistage context module.
2 . The method of claim 1 , wherein the multistage context module comprises a four-stage context module, and the conditional context network and a channel-wise splitting are excluded from the four-stage context module.
3 . The method of claim 2 , wherein the four-stage context module applies a spatial shuffle to obtain four groups of sub tensors, wherein applying the spatial shuffle comprises: obtaining a tensor with a shape [N, C, H, W] as an input, N, C, H and W being positive integers; and outputting the four groups of sub tensors by dividing the tensor in a spatial domain by: x 1 = X [ : , : , 0 :: 2 , 0 :: 2 ] , x 2 = X [ : , : , 1 :: 2 , 1 :: 2 ] , x 3 = X [ : , : , 0 :: 2 , 1 :: 2 ] , x 4 = X [ : , : , 1 :: 2 , 0 :: 2 ] , wherein X denotes the tensor, x 1 , x 2 , x3 and x 4 denotes the four groups of sub tensors, respectively.
4 . The method of claim 1 , wherein a tensor is outputted by a prediction module, the tensor being with a shape [N, 4C, H/2, W/2], N, C, H and W being positive integers, and wherein at least one sub tensor is obtained by directly using channel-wise splitting on the tensor.
5 . The method of claim 1 , wherein the multistage context module comprises a convolutional module replacing a multistage context network, wherein a kernel size of the convolutional module is 3×3, a stride of the convolutional module is 1, and a flag for padding of the convolutional module is 1.
6 . The method of claim 5 , wherein for context modeling of a sub tensor with an index i, pre-coded group information with an index j is used as a reference and concat in channel dimension to obtain intermediate information, j being less than i, wherein the intermediate information is fed into the convolutional module to obtain a reference tensor for the at least one prediction fusion network in the multistage context module, and/or wherein for a first prediction of a first group of sub tensors with the index i, an input channel of the convolutional module is based on the index of the first group of sub tensors, and/or wherein the input channel is determined by (i−1)*chs, where i denotes the index, and chs denotes the number of channels of the input tensor, wherein if the index i is equal to zero, a four-stage context network is not applied, and the four-stage context network is replaced with a zero padding in channel dimension.
7 . The method of claim 5 , wherein one or more convolution layers are added in the convolutional module, or wherein a grouped convolution is used by the convolutional module.
8 . The method of claim 1 , wherein the at least one prediction fusion network of the multistage context module takes a reference tensor and hyper parameter as an input, and an output of the at least one prediction fusion network comprises a prediction mean value, wherein for a plurality of sub prediction mean values, the at least one prediction fusion network applies a same network structure and a plurality of weights, and/or wherein an output of each of the at least one prediction fusion network comprises a mean value of a sub residual representation, and a reconstructed representation is obtained by adding the mean value to the sub residual representation.
9 . The method of claim 1 , wherein a channel-wise splitting is excluded from the multistage context module, and spatial shuffle is not used by the multistage context module, wherein the multistage context module comprises a plurality of convolutional layers for processing a hyper parameter, a kernel size of the plurality of convolutional layers being 3×3, and a stride of the plurality of convolutional layers being 2, wherein for context modeling of a plurality of subgroups of sub tensors, a plurality of corresponding weights of the plurality of convolutional layers are different, and/or wherein the multistage context module comprises a four-stage context network, and a first mask convolution of a first subnetwork of the four-stage context network is different from a second mask convolution of a second subnetwork of the four-stage context network, the first and second masks being used for processing context modeling.
10 . The method of claim 9 , wherein the first mask of the four-stage context network comprises a matrix of ( 0 0 0 0 1 0 0 0 0 ) , the second mask of the four-stage context network comprises a matrix of ( 1 0 1 0 1 0 1 0 1 ) , and a third mask of the four-stage context network comprises a matrix ( 1 0 1 1 1 1 1 0 1 ) .
11 . The method of claim 10 , wherein a first mask convolution of the first mask comprises a 3×3 convolution with stride being two and padding being 1, or wherein a first mask convolution of the first mask comprises a lxi convolution with stride being two and padding being 0, or wherein a second mask convolution of the second mask comprises a 3×3 convolution with stride being two and padding being 1, or wherein a third mask convolution of the third mask comprises a 3×3 convolution with stride being two and padding being 1.
12 . The method of claim 1 , wherein the multistage context module comprises a four-stage context network, wherein an output of the at least one prediction fusion network is determined by concating a first tensor from a convolution and a second tensor from the four-stage context network in channel dimension, the first tensor being based on a hyper parameter, wherein the output of the at least one prediction fusion network comprises at least one subgroup of residual predictions, and/or wherein the at least one prediction fusion network comprises a plurality of subnetworks with a same network structure, a plurality of weights of the plurality of subnetworks being different.
13 . The method of claim 1 , wherein the multistage context module comprises a four-stage context network, and a grouped convolution is used in the four-stage context network.
14 . The method of claim 1 , wherein information regarding whether to and/or how to apply the method is indicated at at least one of: a block level, a sequence level, a group of pictures level, a picture level, a slice level, or a tile group level, or wherein information regarding whether to and/or how to apply the method is included in a coding structure, the coding structure comprising at least one of: a coding tree unit (CTU), a coding unit (CU), a transform unit (TU), a prediction unit (PU), a coding tree block (CTB), a coding block (CB), a transform block (TB), a prediction block (PB), a sequence header, a picture header, a sequence parameter set (SPS), a video parameter set (VPS), a decoded parameter set (DPS), decoding capability information (DCI), a picture parameter set (PPS), an adaptation parameter set (APS), a slice header, or a tile group header, or wherein information regarding whether to and/or how to apply the method is based on coded information, the coded information comprising at least one of: a block size, a color format, a single or dual tree partitioning, a color component, a slice type or a picture type.
15 . The method of claim 1 , wherein the method is used in a coding tool requires chroma fusion.
16 . The method of claim 1 , wherein a syntax element in the bitstream is binarized as at least one of: a flag, a fixed length code, an exponential Golomb (EG)(x) code, a unary code, a truncated unary code, or a truncated binary code, and wherein the syntax element is signed or unsigned, and/or wherein a syntax element in the bitstream is coded with at least one context model, or bypass coded, and/or wherein a syntax element is included in the bitstream based on a condition, the condition comprising at least one of: that a function associated with the syntax element is applicable, or that a dimension of the current video block satisfied a dimension condition, and/or wherein a syntax element in the bitstream is at at least one of: a block level, a sequence level, a group of pictures level, a picture level, a slice level, or a tile group level, and/or wherein a syntax element in the bitstream is included in a coding structure, the coding structure comprising at least one of: a coding tree unit (CTU), a coding unit (CU), a transform unit (TU), a prediction unit (PU), a coding tree block (CTB), a coding block (CB), a transform block (TB), a prediction block (PB), a sequence header, a picture header, a sequence parameter set (SPS), a video parameter set (VPS), a decoded parameter set (DPS), decoding capability information (DCI), a picture parameter set (PPS), an adaptation parameter set (APS), a slice header, or a tile group header.
17 . The method of claim 1 , wherein the conversion comprises decoding the current visual unit from the bitstream, or wherein the conversion comprises encoding the current visual unit into the bitstream.
18 . An apparatus for visual data processing comprising a processor and a non-transitory memory with instructions thereon, wherein the instructions upon execution by the processor, cause the processor to: determine, for a conversion between a current visual unit of visual data and a bitstream of the visual data, a probability representation of the current visual unit based on a multistage context module; and perform the conversion based on the probability representation, wherein the multistage context module at least comprises at least one prediction fusion network, and wherein a conditional context network is excluded from the multistage context module.
19 . A non-transitory computer-readable storage medium storing instructions that cause a processor to perform operations comprising: determining, for a conversion between a current visual unit of visual data and a bitstream of the visual data, a probability representation of the current visual unit based on a multistage context module; and performing the conversion based on the probability representation, wherein the multistage context module at least comprises at least one prediction fusion network, and wherein a conditional context network is excluded from the multistage context module.
20 . A non-transitory computer-readable recording medium storing a bitstream of visual data which is generated by a method performed by an apparatus for visual data processing, wherein the method comprises: determining a probability representation of a current visual unit of the visual data based on a multistage context module; and generating the bitstream based on the probability representation, wherein the multistage context module at least comprises at least one prediction fusion network, and wherein a conditional context network is excluded from the multistage context module.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application is a continuation of International Application No. PCT/CN2024/102789, filed on Jun. 30, 2024, which claims the benefit of International Application No. PCT/CN2023/105119 filed on Jun. 30, 2023. The entire contents of these applications are hereby incorporated by reference in their entireties. FIELDS Embodiments of the present disclosure relate generally to visual data processing techniques, and more particularly, to multistage context module for visual data processing. BACKGROUND Image/video compression is an essential technique to reduce the costs of image/video transmission and storage in a lossless or lossy manner. Image/video compression techniques can be divided into two branches, the classical video coding methods and the neural-network-based video compression methods. Classical video coding schemes adopt transform-based solutions, in which researchers have exploited statistical dependency in the latent variables (e.g., wavelet coefficients) by carefully hand-engineering entropy codes modeling the dependencies in the quantized regime. Neural network-based video compression is in two flavors, neural network-based coding tools and end-to-end neural network-based video compression. The former is embedded into existing classical video codecs as coding tools and only serves as part of the framework, while the latter is a separate framework developed based on neural networks without depending on classical video codecs. Coding efficiency of image/video coding is generally expected to be further improved. SUMMARY Embodiments of the present disclosure provide a solution for visual data processing. In a first aspect, a method for visual data processing is proposed. The method comprises: determining, for a conversion between a current visual unit of visual data and a bitstream of the visual data, a probability representation of the current visual unit based on a multistage context module; and performing the conversion based on the probability representation, wherein the multistage context module at least comprises at least one prediction fusion network, and wherein a conditional context network is excluded from the multistage context module. In this way, the multistage context module such as a multistage context model can be simplified. The coding effectiveness and coding efficiency can thus be improved. In a second aspect, an apparatus for visual data processing is proposed. The apparatus comprises a processor and a non-transitory memory with instructions thereon. The instructions upon execution by the processor, cause the processor to perform a method in accordance with the first aspect of the present disclosure. In a third aspect, a non-transitory computer-readable storage medium is proposed. The non-transitory computer-readable storage medium stores instructions that cause a processor to perform a method in accordance with the first aspect of the present disclosure. In a fourth aspect, another non-transitory computer-readable recording medium is proposed. The non-transitory computer-readable recording medium stores a bitstream of a video which is generated by a method performed by an apparatus for visual data processing. The method comprises: determining a probability representation of a current visual unit of the visual data based on a multistage context module; and generating the bitstream based on the probability representation, wherein the multistage context module at least comprises at least one prediction fusion network, and wherein a conditional context network is excluded from the multistage context module. In a fifth aspect, a method for storing a bitstream of a video is proposed. The method comprises: determining a probability representation of a current visual unit of the visual data based on a multistage context module; generating the bitstream based on the probability representation; and storing the bitstream in a non-transitory computer-readable recording medium, wherein the multistage context module at least comprises at least one prediction fusion network, and wherein a conditional context network is excluded from the multistage context module. This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. BRIEF DESCRIPTION OF THE DRAWINGS Through the following detailed description with reference to the accompanying drawings, the above and other objectives, features, and advantages of example embodiments of the present disclosure will become more apparent. In the example embodiments of the present disclosure, the same reference numerals usually refer to the same components. FIG. 1 illustrates a block diagram that illustrates an example visual data coding system, in accorda