US-12627804-B2 - Method, apparatus, and medium for visual data processing

US12627804B2US 12627804 B2US12627804 B2US 12627804B2US-12627804-B2

Abstract

Embodiments of the present disclosure provide a solution for visual data processing. A method for visual data processing is proposed. The method comprises: performing, for a conversion between visual data and a bitstream of the visual data, a quantization process on a dataset comprising at least one of: input visual data of a neural network model used for the conversion, or a parameter of the neural network model; and performing the conversion based on the quantization process.

Inventors

Yaojun WU
Semih ESENLIK
Zhaobin Zhang
Yue Li
Kai Zhang
Li Zhang

Assignees

BEIJING BYTEDANCE NETWORK TECHNOLOGY CO., LTD.
BYTEDANCE INC.

Dates

Publication Date: 20260512
Application Date: 20240909
Priority Date: 20230309

Claims (18)

1 . A method for visual data processing, comprising: performing, for a conversion between visual data and a bitstream of the visual data, a quantization process on a dataset comprising at least one of: input visual data of a neural network model used for the conversion, or a parameter of the neural network model; and performing the conversion based on the quantization process, wherein performing the quantization process comprises: determining at least one threshold for the dataset; and determining a fix-point representation of the dataset by performing the quantization process based on the at least one threshold, wherein performing the quantization process comprises: determining the fix-point representation of the dataset by clipping the dataset based on the at least one threshold.
2 . The method of claim 1 , wherein performing the quantization process comprises: obtaining the fix-point representation of the dataset by performing the quantization process on a floating-point representation of the dataset.
3 . The method of claim 1 , wherein the parameter of the neural network model comprises at least one of the following: a first parameter associated with a convolution layer or a transpose convolution layer of the neural network model, or a second parameter associated with an activation metric of the neural network model.
4 . The method of claim 3 , wherein the first parameter comprises a weight of the convolution layer or the transpose convolution layer of the neural network model, wherein the first parameter does not comprise a bias of the convolution layer or the transpose convolution layer in floating-point.
5 . The method of claim 1 , wherein the at least one threshold comprises at least one of: a first threshold of a value or absolute value of the input visual data of the neural network model, or a second threshold of a value or absolute value of a weight of the neural network model.
6 . The method of claim 5 , wherein determining the at least one threshold comprises: determining the at least one threshold based on a maximum number of bits used in a fix-point operation and a shape of a convolution layer or a transpose convolution layer of the neural network model.
7 . The method of claim 6 , wherein the at least one threshold is determined based on a first metric as follows: V × W = 2 B - 1 / ( M × N × K H × K W ) , wherein I represents a first threshold of the at least one threshold for the input visual data of the neural network model, W represents a second threshold of the at least one threshold for a weight of the neural network model, B represents the maximum number of bits, M represents a number of input channels of the neural network model, N represents a number of output channels of the neural network model, K H represents a height of a kernel of the neural network model, and K W represents a width of the kernel.
8 . The method of claim 7 , wherein the first and second thresholds are determined by V=W=√{square root over (2 B-1 /(M×N×K H ×K W ))}.
9 . The method of claim 6 , wherein the at least one threshold is determined based on a second metric as follows: log 2 ⁢ V + log 2 ⁢ W = B - ceil ⁡ ( log 2 ( M × N × K H × K W ) ) - 1 , wherein V represents a first threshold of the at least one threshold for the input visual data of the neural network model, W represents a second threshold of the at least one threshold for a weight of the neural network model, B represents the maximum number of bits, M represents a number of input channels of the neural network model, N represents a number of output channels of the neural network model, K H represents a height of a kernel of the neural network model, and K W represents a width of the kernel, and ceil( ) represents a ceiling metric.
10 . The method of claim 9 , wherein the first and second thresholds are further determined by: determining a ratio between the first threshold and the second threshold based on statistic average values of the input visual data; and determining the first and second thresholds based on the ratio and the second metric.
11 . The method of claim 9 , wherein the first and second thresholds are determined by using V=W=2 (B-ceil(log 2 (M×N×K H ×K W ))-1)//2 , // representing an integer division operation.
12 . The method of claim 1 , wherein determining the at least one threshold comprises: determining a plurality of thresholds for weights of a plurality of layers of the neural network model; determining a minimum threshold of the plurality of thresholds; and determining the minimum threshold as a threshold for weights of the plurality of layers of the neural network model.
13 . The method of claim 1 , wherein determining the fix-point representation of the dataset comprises: determining the fix-point representation of the input visual data of the neural network model by clipping the input visual data based on a maximum number of bits, wherein the maximum number of bits is log 2 V bits, V representing a first threshold of the at least one threshold for the input visual data, and wherein the input visual data is clipped to be in one of the following ranges: a first range of zero to V−1, or a second range of −V/2 to V/2−1.
14 . The method of claim 1 , wherein determining the fix-point representation of the dataset comprises: determining the fix-point representation of a weight of the neural network model by clipping the weight based on a maximum number of bits, wherein the maximum number of bits is log 2 W bits, W representing a second threshold of the at least one threshold for the weight, and wherein the weight is clipped to be in one of the following ranges: a third range of zero to W−1, or a fourth range of −W/2 to W/2−1.
15 . The method of claim 1 , wherein the conversion includes encoding the visual data into the bitstream, or wherein the conversion includes decoding the visual data from the bitstream.
16 . An apparatus for data processing comprising a processor and a non-transitory memory with instructions thereon, wherein the instructions upon execution by the processor, cause the processor to: perform, for a conversion between visual data and a bitstream of the visual data, a quantization process on a dataset comprising at least one of: input visual data of a neural network model used for the conversion, or a parameter of the neural network model; and perform the conversion based on the quantization process, wherein performing the quantization process comprises: determining at least one threshold for the dataset; and determining a fix-point representation of the dataset by performing the quantization process based on the at least one threshold, wherein performing the quantization process comprises: determining the fix-point representation of the dataset by clipping the dataset based on the at least one threshold.
17 . A non-transitory computer-readable storage medium storing instructions that cause a processor to perform a method performed by a data processing apparatus, wherein the method comprises: performing, for a conversion between visual data and a bitstream of the visual data, a quantization process on a dataset comprising at least one of: input visual data of a neural network model used for the conversion, or a parameter of the neural network model; and performing the conversion based on the quantization process, wherein performing the quantization process comprises: determining at least one threshold for the dataset; and determining a fix-point representation of the dataset by performing the quantization process based on the at least one threshold, wherein performing the quantization process comprises: determining the fix-point representation of the dataset by clipping the dataset based on the at least one threshold.
18 . A non-transitory computer-readable recording medium storing a bitstream of data which is generated by a method performed by an apparatus for data processing, wherein the method comprises: performing a quantization process on a dataset comprising at least one of: input visual data of a neural network model used for generating the bitstream, or a parameter of the neural network model; and generating the bitstream based on the quantization process, wherein generating the bitstream based on the quantization process comprises: determining at least one threshold for the dataset; and determining a fix-point representation of the dataset by performing the quantization process based on the at least one threshold, wherein performing the quantization process comprises: determining the fix-point representation of the dataset by clipping the dataset based on the at least one threshold.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application is continuation of International Application No. PCT/CN2023/080423, filed on Mar. 9, 2023, which claims priority to Chinese Application No. PCT/CN2022/080028 filed on Mar. 9, 2022. The entire contents of these applications are hereby incorporated by reference in their entireties. FIELDS Embodiments of the present disclosure relates generally to visual data processing techniques, and more particularly, to quantization process, a scaling process and a neural network model for visual data processing. BACKGROUND Image/video compression is an essential technique to reduce the costs of image/video transmission and storage in a lossless or lossy manner. Image/video compression techniques can be divided into two branches, the classical video coding methods and the neural-network-based video compression methods. Classical video coding schemes adopt transform-based solutions, in which researchers have exploited statistical dependency in the latent variables (e.g., wavelet coefficients) by carefully hand-engineering entropy codes modeling the dependencies in the quantized regime. Neural network-based video compression is in two flavors, neural network-based coding tools and end-to-end neural network-based video compression. The former is embedded into existing classical video codecs as coding tools and only serves as part of the framework, while the latter is a separate framework developed based on neural networks without depending on classical video codecs. Coding efficiency of image/video coding is generally expected to be further improved. SUMMARY Embodiments of the present disclosure provide a solution for visual data processing. In a first aspect, a method for visual data processing is proposed. The method comprises: performing, for a conversion between visual data and a bitstream of the visual data, a quantization process on a dataset comprising at least one of: input visual data of a neural network model used for the conversion, or a parameter of the neural network model; and performing the conversion based on the quantization process. The method in accordance with the first aspect of the present disclosure converts data such as visual data in floating-point to data in fix-point. In this way, the neural network model can perform fix-point calculation, and thus can provide an unchanged precision on different devices. Thus, coding efficiency and coding effectiveness can be improved. In a second aspect, another method for visual data processing is proposed. The method comprises: performing, for a conversion between visual data and a bitstream of the visual data, a scaling process on a dataset comprising at least one of: input visual data of a neural network model used for the conversion, a parameter of the neural network model, intermediate visual data of the neural network model, or output visual data of the neural network model; and performing the conversion based on the scaling process. By performing the scaling process in accordance with the second aspect of the present disclosure, overflow issues occurring during the multiplication and accumulation of convolution may be avoided. Thus, the quantization loss can be reduced, and coding efficiency and coding effectiveness can be improved. In a third aspect, another method for visual data processing is proposed. The method comprises: performing a conversion between visual data and a bitstream of the visual data by using a neural network model, wherein the neural network model is characterized in at least one of the following: a number of channels of a layer of the neural network model being less than a threshold channel number, the neural network model using a rectified linear unit (ReLU), a number of layers of the neural network model being less than a threshold layer number, the neural network model using a group convolution or a group transpose convolution, or a kernel size of the neural network model being less than a threshold kernel size. According to the method in accordance with the third aspect of the present disclosure, an interoperability-friendly model can be used in visual data coding. In this way, devices interoperability can be improved, and thus coding efficiency and coding effectiveness can be improved. In a fourth aspect, an apparatus for visual data processing is proposed. The apparatus comprises a processor and a non-transitory memory with instructions thereon. The instructions upon execution by the processor, cause the processor to perform a method in accordance with the first aspect, the second aspect, or the third aspect of the present disclosure. In a fifth aspect, a non-transitory computer-readable storage medium is proposed. The non-transitory computer-readable storage medium stores instructions that cause a processor to perform a method in accordance with the first aspect, the second aspect, or the third of the present disclosure. In a sixth aspect, another non-transitory computer-