CN-122029808-A - Method, apparatus and medium for visual data processing

CN122029808ACN 122029808 ACN122029808 ACN 122029808ACN-122029808-A

Abstract

Embodiments of the present disclosure provide a solution for visual data processing. A method for visual data processing is presented. The method includes performing a conversion between visual data and a bitstream of the visual data using a Neural Network (NN) based model, wherein a first format for encoding and decoding the visual data and a second format from the converted output visual data are indicated with different indications, the first format indicating a first relationship between a size of a first component of the decoded visual data and a size of a second component of the decoded visual data, and the second format indicating a second relationship between a size of the first component of the output visual data and a size of the second component of the output visual data.

Inventors

S. Eisenleck
ZHANG ZHAOBIN
WU YAOJUN
WANG MENG
ZHANG KAI
ZHANG LI

Assignees

抖音视界有限公司
字节跳动有限公司

Dates

Publication Date: 20260512
Application Date: 20241009
Priority Date: 20231010

Claims (20)

1. A method for visual data processing, comprising: Conversion between visual data and a bit stream of the visual data is performed using a Neural Network (NN) based model, Wherein a first format for encoding and decoding the visual data and a second format for outputting visual data from the conversion are indicated with different indications, the first format indicating a first relationship between a size of a first component of the visual data decoded and a size of a second component of the visual data decoded, and the second format indicating a second relationship between a size of the first component of the output visual data and a size of the second component of the output visual data.
2. The method of claim 1, wherein the first relationship comprises at least one of: A ratio between a height of the first component of the decoded visual data and a height of the second component of the decoded visual data, or A ratio between a width of the first component of the decoded visual data and a width of the second component of the decoded visual data.
3. The method of claim 2, wherein the bitstream includes a first indication and a second indication, the first indication indicating the ratio between the height of the first component of the decoded visual data and the height of the second component of the decoded visual data, and the second indication indicating the ratio between the width of the first component of the decoded visual data and the width of the second component of the decoded visual data.
4. A method according to any one of claims 1-3, wherein the second relationship comprises at least one of: A ratio between a height of the first component of the output visual data and a height of the second component of the output visual data, or A ratio between a width of the first component of the output visual data and a width of the second component of the output visual data.
5. The method of claim 4, wherein the bitstream includes a third indication and a fourth indication, the third indication indicating the ratio between the height of the first component of the output visual data and the height of the second component of the output visual data, and the fourth indication indicating the ratio between the width of the first component of the output visual data and the width of the second component of the output visual data.
6. The method of claims 1-5, wherein the first format is allowed to differ from the second format.
7. The method of any of claims 1-6, wherein if the first format is a 4:2:0 format, the second format is allowed to be one of a 4:4:4 format, the 4:2:0 format, or a 4:2:2 format, or If the first format is the 4:2:2 format, the second format is allowed to be the 4:4:4 format or the 4:2:2 format, or If the first format is the 4:4:4 format, the second format is allowed to be the 4:4:4 format.
8. The method of any of claims 1-7, wherein a size of one or more tensors used to codec the visual data is dependent on the first format.
9. The method of any of claims 1-8, wherein a size of the second component of the output visual data is dependent on the second format.
10. The method of any of claims 1-9, wherein the second component of the visual data of the first format is subjected to a resampling operation based on the first format and the second format.
11. The method of claim 10, wherein the resampling operation is performed based on a ratio between the first format and the second format.
12. The method of any of claims 1-11, wherein the operation for adjusting the size of the tensor is performed based on at least one of the first format or the second format.
13. The method of claim 12, wherein the operations are implemented with a processing layer in the NN-based model.
14. The method of claim 13, wherein the treatment layer is a shuffle layer.
15. The method of any of claims 1-14, wherein a filter in the NN-based model is applied based on at least one of the first format or the second format.
16. The method of claim 1, wherein the different indication indicating whether the first format and the second format are included in the bitstream depends on a presence flag.
17. The method of claim 12, wherein the operation for adjusting the size of the tensor comprises an upsampling operation, an interpolation operation, or a downsampling operation.
18. The method of claim 12, wherein the operation for adjusting the size of the tensor is performed prior to at least one of: The transformation is performed by a combination of the steps, Signal decoder module, or And splicing the two tensors.
19. The method of any of claims 1-18, wherein an entropy encoding process is performed based on at least one of the first format or the second format.
20. The method of any of claims 1-19, wherein the bitstream includes an indication indicating a position of the first component corresponding to a position of the second component.

Description

Method, apparatus and medium for visual data processing Technical Field Embodiments of the present disclosure relate generally to visual data processing technology, and more particularly, to neural network-based visual data codec. Background Deep learning has evolved rapidly in various fields over the last decade, particularly in the fields of computer vision and image processing. Neural networks were originally invented through interdisciplinary studies of neuroscience and mathematics. It shows great capability in the context of nonlinear transformation and classification. Image/video compression techniques based on neural networks have made significant progress in the last five years. It is reported that the latest image compression algorithm based on the neural network achieves rate distortion (R-D) performance equivalent to that of the multi-functional video codec (VVC). As the performance of neural image compression continues to increase, video compression based on neural networks has become an actively developing area of research. However, the codec efficiency of image/video codec based on neural network is generally expected to be further improved. Disclosure of Invention Embodiments of the present disclosure provide a solution for visual data processing. In a first aspect, a method for visual data processing is presented. The method includes performing a conversion between visual data and a bitstream of the visual data using a Neural Network (NN) based model, wherein a first format for encoding and decoding the visual data and a second format from the converted output visual data are indicated with different indications, the first format indicating a first relationship between a size of a first component of the decoded visual data and a size of a second component of the decoded visual data, and the second format indicating a second relationship between a size of the first component of the output visual data and a size of the second component of the output visual data. Based on the method according to the first aspect of the present disclosure, a first format for encoding and decoding visual data and a second format from converted output visual data are indicated with different indications. Compared to conventional solutions where the first format and the second format are coupled together and controlled by the same syntax element(s), the proposed method can advantageously decouple signaling of the first format and the second format and thus the first format and the second format can be controlled independently. In this way, codec flexibility and codec efficiency can be improved. In a second aspect, an apparatus for visual data processing is presented. The apparatus includes a processor and a non-transitory memory having instructions thereon. The instructions, when executed by a processor, cause the processor to perform a method according to the first aspect of the present disclosure. In a third aspect, a non-transitory computer readable storage medium is presented. The non-transitory computer readable storage medium stores instructions that cause a processor to perform a method according to the first aspect of the present disclosure. In a fourth aspect, another non-transitory computer readable recording medium is presented. The non-transitory computer readable recording medium stores a bitstream of visual data generated by a method performed by an apparatus for visual data processing. The method includes performing a conversion between visual data and a bitstream using a Neural Network (NN) based model, wherein a first format for encoding and decoding the visual data and a second format from the converted output visual data are indicated with different indications, the first format indicating a first relationship between a size of a first component of the decoded visual data and a size of a second component of the decoded visual data, and the second format indicating a second relationship between a size of the first component of the output visual data and a size of the second component of the output visual data. In a fifth aspect, a method for storing a bitstream of visual data is presented. The method includes performing conversion between visual data and a bitstream using a Neural Network (NN) -based model, and storing the bitstream in a non-transitory computer-readable recording medium, wherein a first format for encoding and decoding the visual data and a second format for outputting the visual data from the conversion are indicated with different indications, the first format indicating a first relationship between a size of a first component of the decoded visual data and a size of a second component of the decoded visual data, and the second format indicating a second relationship between a size of the first component of the output visual data and a size of the second component of the output visual data. This summary is provided to introduce a selection of concepts in a simplified form that a