US-20260129219-A1 - METHOD, APPARATUS, AND MEDIUM FOR VISUAL DATA PROCESSING

US20260129219A1US 20260129219 A1US20260129219 A1US 20260129219A1US-20260129219-A1

Abstract

Embodiments of the present disclosure provide a solution for visual data processing. A method for visual data processing is proposed. The method comprises: performing a conversion between visual data and a bitstream of the visual data with a neural network (NN)-based model, all of upsampling components in the NN-based model being implemented with a same structure.

Inventors

Zhaobin Zhang
Semilh ESENLIK
Kai Zhang
Li Zhang

Assignees

BYTEDANCE INC.

Dates

Publication Date: 20260507
Application Date: 20251230

Claims (20)

1 . A method for visual data processing, comprising: performing a conversion between visual data and a bitstream of the visual data with a neural network (NN)-based model, all of upsampling components in the NN-based model being implemented with a same structure.
2 . The method of claim 1 , wherein the same structure comprises a convolution with a pixel shuffle.
3 . The method of claim 2 , wherein a upsampling scale factor of the pixel shuffle is larger than 1.
4 . The method of claim 2 , wherein a kernel size of the convolution is one of the following: 2×2, 3×3, or 4×4.
5 . The method of claim 1 , wherein the same structure comprises a transposed convolution.
6 . The method of claim 5 , wherein a stride of the transposed convolution is larger than 1.
7 . The method of claim 5 , wherein a kernel size of the transposed convolution is one of the following: 2×2, 3×3, or 4×4.
8 . The method of claim 1 , wherein the NN-based model comprises at least one of the following: a synthesis transform sub-model, a hyper decoder sub-model, or a hyper scale decoder sub-model.
9 . The method of claim 1 , wherein first information regarding at least one of the following is indicated in the bitstream: whether all of upsampling components in the NN-based model are implemented with the same structure, or how to implement all of upsampling components in the NN-based model.
10 . The method of claim 9 , wherein the first information is indicated at one of the following: a block level, a sequence level, a group of pictures level, a picture level, a slice level, or a tile group level, or wherein the first information is indicated in one of the following: a coding structure of a coding tree unit (CTU), a coding structure of a coding unit (CU), a coding structure of a transform unit (TU), a coding structure of a prediction unit (PU), a coding structure of a coding tree block (CTB), a coding structure of a coding block (CB), a coding structure of a transform block (TB), a coding structure of a prediction block (PB), a sequence header, a picture header, a sequence parameter set (SPS), a video parameter set (VPS), a dependency parameter set (DPS), a decoding capability information (DCI), a picture parameter set (PPS), an adaptation parameter sets (APS), a slice header, or a tile group header.
11 . The method of claim 9 , wherein the first information is dependent on coded information of the visual data, and the coded information comprises at least one of the following: a block size, a color format, a single tree partitioning, a dual tree partitioning, a color component, a slice type, or a picture type.
12 . The method of claim 9 , wherein the first information is indicated by a syntax element.
13 . The method of claim 12 , wherein the syntax element is binarized as one of the following: a flag, a fixed length code, an exponential Golomb (EG) code, a unary code, a truncated unary code, or a truncated binary code, or wherein the syntax element is coded with at least one context model, or wherein the syntax element is bypass coded.
14 . The method of claim 1 , wherein a syntax element indicating at least one of the following is signaled based on a condition: whether all of upsampling components in the NN-based model are implemented with the same structure, or how to implement all of upsampling components in the NN-based model.
15 . The method of claim 1 , wherein the visual data comprise a video, a picture of the video, or an image.
16 . The method of claim 1 , wherein the conversion includes encoding the visual data into the bitstream.
17 . The method of claim 1 , wherein the conversion includes decoding the visual data from the bitstream.
18 . An apparatus for visual data processing comprising a processor and a non-transitory memory with instructions thereon, wherein the instructions upon execution by the processor, cause the processor to perform operations comprising: performing a conversion between visual data and a bitstream of the visual data with a neural network (NN)-based model, all of upsampling components in the NN-based model being implemented with a same structure.
19 . A non-transitory computer-readable storage medium storing instructions that cause a processor to perform operations comprising: performing a conversion between visual data and a bitstream of the visual data with a neural network (NN)-based model, all of upsampling components in the NN-based model being implemented with a same structure.
20 . A non-transitory computer-readable recording medium storing a bitstream of visual data which is generated by a method performed by an apparatus for visual data processing, wherein the method comprises: performing a conversion between visual data and a bitstream of the visual data with a neural network (NN)-based model, all of upsampling components in the NN-based model being implemented with a same structure.

Description

CROSS REFERENCE This application is a continuation of International Application No. PCT/US2024/036171, filed on Jun. 28, 2024, which claims the benefit of U.S. Provisional Application No. 63/511,431, filed on Jun. 30, 2023. The entire contents of these applications are hereby incorporated by reference in their entireties. FIELDS Embodiments of the present disclosure relates generally to visual data processing techniques, and more particularly, to neural network-based visual data coding. BACKGROUND The past decade has witnessed the rapid development of deep learning in a variety of areas, especially in computer vision and image processing. Neural network was invented originally with the interdisciplinary research of neuroscience and mathematics. It has shown strong capabilities in the context of non-linear transform and classification. Neural network-based image/video compression technology has gained significant progress during the past half decade. It is reported that the latest neural network-based image compression algorithm achieves comparable rate-distortion (R-D) performance with Versatile Video Coding (VVC). With the performance of neural image compression continually being improved, neural network-based video compression has become an actively developing research area. However, coding efficiency of neural network-based image/video coding is generally expected to be further improved. SUMMARY Embodiments of the present disclosure provide a solution for visual data processing. In a first aspect, a method for visual data processing is proposed. The method comprises: performing a conversion between visual data and a bitstream of the visual data with a neural network (NN)-based model, all of upsampling components in the NN-based model being implemented with a same structure. Based on the method in accordance with the first aspect of the present disclosure, all of upsampling components in the NN-based model are implemented with a same structure. Compared with the conventional solution where upsampling components in the NN-based model are implemented with multiple types of structures, the proposed method can advantageously unify the implementation of upsampling components in the NN-based model, and thus simplify the implementation of the NN-based model. Thereby, the coding efficiency can be improved. In a second aspect, an apparatus for visual data processing is proposed. The apparatus comprises a processor and a non-transitory memory with instructions thereon. The instructions upon execution by the processor, cause the processor to perform a method in accordance with the first aspect of the present disclosure. In a third aspect, a non-transitory computer-readable storage medium is proposed. The non-transitory computer-readable storage medium stores instructions that cause a processor to perform a method in accordance with the first aspect of the present disclosure. In a fourth aspect, another non-transitory computer-readable recording medium is proposed. The non-transitory computer-readable recording medium stores a bitstream of visual data which is generated by a method performed by an apparatus for visual data processing. The method comprises: performing a conversion between visual data and a bitstream of the visual data with a neural network (NN)-based model, all of upsampling components in the NN-based model being implemented with a same structure. In a fifth aspect, a method for storing a bitstream of visual data is proposed. The method comprises: generating the bitstream of the visual data with a neural network (NN)-based model, all of upsampling components in the NN-based model being implemented with a same structure; and storing the bitstream in a non-transitory computer-readable recording medium. This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. BRIEF DESCRIPTION OF THE DRAWINGS Through the following detailed description with reference to the accompanying drawings, the above and other objectives, features, and advantages of example embodiments of the present disclosure will become more apparent. In the example embodiments of the present disclosure, the same reference numerals usually refer to the same components. FIG. 1A illustrates a block diagram that illustrates an example visual data coding system, in accordance with some embodiments of the present disclosure; FIG. 1B is a schematic diagram illustrating an example transform coding scheme; FIG. 2 illustrates example latent representations of an image; FIG. 3 is a schematic diagram illustrating an example autoencoder implementing a hyperprior model; FIG. 4 is a schematic diagram illustrating an example combined model configured to jointly optimize a context model along