US-12627794-B2 - Content aware dataset generation and model selection for learned image compression

US12627794B2US 12627794 B2US12627794 B2US 12627794B2US-12627794-B2

Abstract

A system may receive an input image block, and input the input image block into multiple models which may be trained using a plurality of different datasets of image blocks. Each model of the multiple models may be trained using a dataset having similar attributes. The system may determine a model having a highest compression efficiency from among the multiple models, and encode the input image block using the determined model.

Inventors

Yan Ye
Wei Jiang
Wei Wang

Assignees

ALIBABA (CHINA) CO., LTD.

Dates

Publication Date: 20260512
Application Date: 20230703

Claims (14)

1 . A method implemented by a computing device, the method comprising: receiving an input image block; inputting the input image block into multiple models, wherein the multiple models are trained using a plurality of different datasets of image blocks, a model of the multiple models being trained using a dataset including multiple image blocks with a similar variance; and after inputting the input image block into the multiple models, determining a model having a highest compression efficiency from among the multiple models, wherein at least another model of the multiple models is trained using a dataset including a plurality of image blocks having similar attributes, and the similar attributes comprise at least one of an object included in the plurality of image blocks, or a texture associated with the plurality of image blocks.
2 . The method of claim 1 , further comprising: encoding the input image block using the determined model; and sending the encoded image block in a bitstream.
3 . The method of claim 2 , further comprising: adding a model selection flag representing the determined model for the input image block in the bitstream.
4 . The method of claim 1 , further comprising grouping different image blocks into a plurality of different datasets based at least in part on one or more selection criteria.
5 . The method of claim 1 , further comprising grouping different image blocks into a plurality of different datasets based at least in part on a grouping or clustering algorithm or a classification algorithm.
6 . The method of claim 1 , further comprising reducing a capacity of the determined model.
7 . The method of claim 6 , wherein reducing the capacity of the determined model comprises performing at least one of model sparsification, pruning, unification, quantization, or model compression.
8 . One or more computer readable media storing executable instructions that, when executed by one or more processors, cause the one or more processors to perform acts comprising: receiving an input image block; inputting the input image block into multiple models, wherein the multiple models are trained using different datasets of image blocks, a model of the multiple models being trained using a dataset including multiple image blocks with a similar variance; and after inputting the input image block into the multiple models, determining a model having a highest compression efficiency from among the multiple models, wherein at least another model of the multiple models is trained using a dataset including a plurality of image blocks having similar attributes, and the similar attributes comprise at least one of an object included in the plurality of image blocks, or a texture associated with the plurality of image blocks.
9 . The one or more computer readable media of claim 8 , the acts further comprising: encoding the input image block using the determined model; and sending the encoded image block in a bitstream.
10 . The one or more computer readable media of claim 9 , the acts further comprising: adding a model selection flag representing the determined model for the input image block in the bitstream.
11 . The one or more computer readable media of claim 8 , the acts further comprising grouping different image blocks into a plurality of different datasets based at least in part on one or more selection criteria.
12 . The one or more computer readable media of claim 8 , the acts further comprising grouping different image blocks into a plurality of different datasets based at least in part on a grouping or clustering algorithm or a classification algorithm.
13 . The one or more computer readable media of claim 8 , the acts further comprising reducing a capacity of the determined model.
14 . The one or more computer readable media of claim 13 , wherein reducing the capacity of the determined model comprises performing at least one of model sparsification, pruning, unification, quantization, and model compression.

Description

CROSS REFERENCE TO RELATED PATENT APPLICATION This application claims the benefit of U.S. Provisional Patent Application No. 63/389,780, entitled “Content Aware Dataset Generation and Model Selection for Learned Image Compression” and filed Jul. 15, 2022, which is expressly incorporated herein by reference in its entirety. BACKGROUND Image/video compression plays a critical role in image/video transmission and storage systems. Over the past few decades, various image/video coding standards, such as JPEG, JPEG2000, H.264/MPEG-4 Part 10 AVC standard and H.265/HEVC standard, etc., for image/video compression have been developed. In recent years, a new Versatile Video Coding (VVC) standard has been developed and finalized in 2020 to further improve the video coding efficiency. In all these standards, a hybrid coding framework which includes intra/inter prediction, transform, quantization and entropy coding, is used to exploit spatial/temporal redundancy, visual redundancy, and statistic redundancy in image/video. In recent years, deep image/video compression methods exhibit a fast developing trend with promising results. Compared with traditional image/video compression methods which mainly rely on hand-crafted modules that need to be designed individually, the deep image/video compression methods can optimize all the modules in an image/video compression framework in an end-to-end manner. In addition, as compared with the traditional image/video compression methods, the deep image/video compression methods can easily perform optimization using different distortion metrics. BRIEF DESCRIPTION OF THE DRAWINGS The detailed description is set forth with reference to the accompanying figures. In the figures, the left-most digit(s) of a reference number identifies the figure in which the reference number first appears. The use of the same reference numbers in different figures indicates similar or identical items. FIG. 1 illustrates an encoder block diagram of an example block-based hybrid video coding system. FIG. 2 illustrates a decoder block diagram of an example block-based hybrid video coding system. FIG. 3 illustrates a schematic block diagram of an example joint autoregressive and hierarchical priors model for Learned Image Compression. FIG. 4 illustrates an example system for implementing the above-described processes and methods. DETAILED DESCRIPTION In example implementations, the VVC may be constructed based on the same hybrid video coding system that has been used in modern video compression standards such as HEVC, H.264/AVC, MPEG2, H.263, etc. FIG. 1 shows an encoder block diagram of an example hybrid video coding system 100. In example implementations, an input video 102 may be processed block by block. In example implementations, the hybrid video coding system 100 may divide a picture or image of the input video 102 into macroblocks (“MBs”), each having predefined dimensions (such as N×N pixels, where N is a positive integer), and divide or partition each macroblock into a plurality partitions. By way of example and not limitation, the hybrid video coding system 100 may divide a picture or image of the input video 102 into coding tree units (CTUs). In example implementations, a coded tree unit (CTU) in VVC may be defined as the largest block unit, and may be as large as 128×128 luma samples (plus corresponding chroma samples depending on a chroma format that is used). In example implementations, a CTU may be further partitioned into coding units (CUs) using a quad-tree, binary tree, or ternary tree. In example implementations, at the leaf nodes of such partitioning structure, coding information such as a coding mode (e.g., intra mode or inter mode, etc.), motion information (such as a reference index, motion vectors, etc.) if inter coded, and quantized residual coefficients may be sent. In alternative implementations, the hybrid video coding system 100 may divide a picture into units of N×N pixels, which may then be further subdivided into subunits. Each of these largest subdivided units of a picture may generally be referred to as a “block” for the purpose of the present disclosure. In example implementations, a CU is coded using one block of luma samples and two corresponding blocks of chroma samples, where pictures are not monochrome and are coded using one coding tree. In example implementations, if intra prediction 104 (which is also called as spatial prediction) is used, spatial neighboring samples may be used to predict a current block to be coded. In example implementations, if inter prediction 106 (which is also called as temporal prediction or motion compensated prediction) is used, samples from already coded pictures (i.e., reference pictures) may be used to predict the current block. In example implementations, different prediction methods may be used for inter prediction, which include, but are not limited to uni-prediction, bi-prediction, etc. In example implementations, if uni-prediction