EP-4740470-A1 - CONVENTIONAL AND NEURAL NETWORK CODECS FOR RANDOM ACCESS VIDEO CODING

EP4740470A1EP 4740470 A1EP4740470 A1EP 4740470A1EP-4740470-A1

Abstract

An example device for decoding video data includes a processing system comprising one or more processors implemented in circuitry and configured to: determine that a first temporal layer identifier of a first picture of the video data is included in a first set of temporal layers; in response to the first temporal layer identifier being included in the first set of temporal layers, decode blocks of the first picture on a block by block basis; determine that a second temporal layer identifier of a second picture of the video data is included in a second set of temporal layers, the second set of temporal layers being higher than the first set of temporal layers; and in response to the second temporal layer identifier being included in the second set of temporal layers, execute a neural network-based video decoder to decode the second picture.

Inventors

RYDER, Thomas Alexander
EADIE, Samuel James
KARCZEWICZ, MARTA
COBAN, MUHAMMED ZEYD
SEREGIN, VADIM

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260513
Application Date: 20240621

Claims (20)

1. A method of decoding video data, the method comprising: determining that a first temporal layer identifier of a first picture of video data is included in a first set of temporal layers; in response to the first temporal layer identifier being included in the first set of temporal layers, decoding blocks of the first picture on a block by block basis; determining that a second temporal layer identifier of a second picture of the video data is included in a second set of temporal layers, the second set of temporal layers being higher than the first set of temporal layers; and in response to the second temporal layer identifier being included in the second set of temporal layers, decoding the second picture using a neural network-based video decoder.
2. The method of claim 1, further comprising: encoding the blocks of the first picture on a block by block basis; and encoding the second picture using a neural network-based video encoder.
3. The method of claim 2, wherein encoding the blocks of the first picture includes: for a current block of the first pi cture, forming a predicti on block using inter- prediction, intra-prediction, affine prediction, or intra block copy (IBC) mode; forming a residual block representing differences between the cunent block and the prediction block; and encoding the residual block and predicti on information used to form the prediction block.
4. The method of claim 1, wherein decoding the blocks of the first picture includes: for a current block of the blocks of the first picture, forming a prediction block using one of inter-prediction, intra-prediction, affine prediction, or intra block copy (IBC) mode; decoding a residual block for the current block; and combining the prediction block with the residual block to form a decoded block for the current block.
5. The method of claim 1, wherein at least one picture of the second set of temporal layers is predicted from a reference picture of the first set of temporal layers.
6. The method of claim 1, further comprising providing each of the pictures of the first set of temporal layers to the neural network-based video decoder as input for use when decoding pictures in the second set of temporal layers.
7. The method of claim 1, wherein decoding the blocks of the first picture and decoding the second picture comprises decoding the first picture before decoding the second picture, the method further comprising: determining that the second picture has a display order before a display order of the first picture; and outputting the second picture before outputting the first picture based on the second picture having the display order before the display order of the first picture.
8. A device for decoding video data, the device comprising: a memory configured to store video data; and a processing system comprising one or more processors implemented in circuitry, the processing system being configured to: determine that a first temporal layer identifier of a first picture of the video data is included in a first set of temporal layers; in response to the first temporal layer identifier being included in the first set of temporal layers, decode blocks of the first picture on a block by block basis; determine that a second temporal layer identifier of a second picture of the video data is included in a second set of temporal layers, the second set of temporal layers being higher than the first set of temporal layers; and in response to the second temporal layer identifier being included in the second set of temporal layers, execute a neural network-based video decoder to decode the second picture.
9. The device of claim 8, wherein the processing system is further configured to: encode the blocks of the first picture on a block by block basis; and execute a neural network-based video encoder to encode the second picture.
10. The device of claim 9, wherein to encode the blocks of the first picture, the processing system is configured to: for a current block of the first picture, form a prediction block using inter- prediction, intra-prediction, affine prediction, or intra block copy (IBC) mode; form a residual block representing differences between the current block and the prediction block; and encode the residual block and prediction information used to form the prediction block.
11. The device of claim 8, wherein to decode the blocks of the first picture, the processing system is configured to: for a current block of the blocks of the first picture, form a prediction block using one of inter-prediction, intra-prediction, affine prediction, or intra block copy (IBC) mode; decode a residual block for the current block; and combine the prediction block with the residual block to form a decoded block for the current block.
12. The device of claim 8, wherein at least one picture of the second set of temporal layers is predicted from a reference picture of the first set of temporal layers.
13. The device of claim 8, wherein the processing system is configured to provide each of the pictures of the first set of temporal layers to the neural network-based video decoder as input for use when decoding pictures in the second set of temporal layers.
14. The device of claim 8, wherein to decode the blocks of the first picture and to decode the second picture, the processing system is configured to decode the first picture before executing the neural network-based video decoder to decode the second picture, and wherein the processing system is further configured to: determine that the second picture has a display order before a display order of the first picture; and output the second picture before outputting the first picture based on the second picture having the display order before the display order of the first picture.
15. A device for decoding video data, the device comprising: means for determining that a first temporal layer identifier of a first picture of video data is included in a first set of temporal layers; means for decoding blocks of the first picture on a block by block basis in response to the first temporal layer identifier being included in the first set of temporal layers; means for determining that a second temporal layer identifier of a second picture of the video data is included in a second set of temporal layers, the second set of temporal layers being higher than the first set of temporal layers; and means for decoding the second picture using a neural network-based video decoder in response to the second temporal layer identifier being included in the second set of temporal layers.
16. The device of claim 15, further comprising: means for encoding the bl ocks of the first picture on a block by block basis; and means for encoding the second picture using a neural network-based video encoder.
17. The device of claim 16, wherein the means for encoding the blocks of the first picture includes: means for forming a prediction block using inter-prediction, intra-prediction, affine prediction, or intra block copy (IBC) mode for a current block of the first picture; means for forming a residual block representing differences between the current block and the prediction block; and means for encoding the residual block and prediction information used to form the prediction block.
18. The device of claim 15, wherein the means for decoding the blocks of the first picture includes: means for forming a prediction block using one of inter-prediction, intra- prediction, affine prediction, or intra block copy (IBC) mode for a current block of the blocks of the first picture; means for decoding a residual block for the current block; and means for combining the prediction block with the residual block to form a decoded block for the current block.
19. The device of claim 15, further comprising means for providing each of the pictures of the first set of temporal layers to the neural network-based video decoder as input for use when decoding pictures in the second set of temporal layers.
20. The device of claim 15, wherein the means for decoding the blocks of the first picture is configured to decode the blocks of the first picture before the means for decoding the second picture decodes the second picture, further comprising: means for determining that the second picture has a display order before a display order of the first picture; and means for outputting the second picture before outputting the first picture based on the second picture having the display order before the display order of the first picture.

Description

CONVENTIONAL AND NEURAL NETWORK CODECS FOR RANDOM ACCESS VIDEO CODING [0001] This application claims priority to U. S. Patent Application No. 18/744,171, filed June 14, 2024 and U.S. Provisional Application No. 63/511,836, filed July 3, 2023, the entire contents of each of which are hereby incorporated by reference. U.S. Patent Application No. 18/744,171, filed June 14, 2024 claims the benefit of U.S. Provisional Application No. 63/511,836, filed July 3, 2023. TECHNICAL FIELD [0002] This disclosure relates to video coding, including video encoding and video decoding. BACKGROUND [0003] Digital video capabilities can be incorporated into a wide range of devices, including digital televisions, digital direct broadcast systems, wireless broadcast systems, personal digital assistants (PDAs), laptop or desktop computers, tablet computers, e-book readers, digital cameras, digital recording devices, digital media players, video gaming devices, video game consoles, cellular or satellite radio telephones, so-called “smart phones,” video teleconferencing devices, video streaming devices, and the like. Digital video devices implement video coding techniques, such as those described in the standards defined by MPEG-2, MPEG-4, ITU-T H.263, ITU-T H.264/MPEG-4, Part 10, Advanced Video Coding (AVC), ITU-T H.265/High Efficiency Video Coding (HEVC), ITU-T H.266/Versatile Video Coding (VVC), and extensions of such standards, as well as proprietary video codecs/formats such as AOMedia Video 1 (AV1) developed by the Alliance for Open Media. The video devices may transmit, receive, encode, decode, and/or store digital video information more efficiently by implementing such video coding techniques. [0004] Video coding techniques include spatial (intra-picture) prediction and/or temporal (inter-picture) prediction to reduce or remove redundancy inherent in video sequences. For block-based video coding, a video slice (e.g., a video picture or a portion of a video picture) may be partitioned into video blocks, which may also be referred to as coding tree units (CTUs), coding units (CUs) and/or coding nodes. Video blocks in an intra-coded (I) slice of a picture are encoded using spatial prediction with respect to reference samples in neighboring blocks in the same picture. Video blocks in an inter-coded (P or B) slice of a picture may use spatial prediction with respect to reference samples in neighboring blocks in the same picture or temporal prediction with respect to reference samples in other reference pictures. Pictures may be referred to as frames, and reference pictures may be referred to as reference frames. SUMMARY [0005] In general, this disclosure describes techniques for coding video data using a combined conventional encoder/decoder and a neural encoder/decoder. For example, a combined encoder may encode certain temporal layers of video data using a conventional encoder and other temporal layers using a neural network-based video encoder. Similarly, a combined decoder may decode certain temporal layers of video data using a conventional decoder and other temporal layers using a neural network- based video decoder. [0006] In particular, conventional decoding techniques include prediction of blocks of video data and coding of residual blocks, where the residual blocks represent differences between prediction blocks and the original blocks of video data. Thus, a video coder may code blocks of pictures of a first set of temporal layers using prediction and residual coding. That is, for the blocks of the pictures of the first set of temporal layers, the video coder may form prediction blocks using, e.g., one of inter-prediction, intra- prediction, intra-block copy (IBC), affine prediction, or the like. The video coder may also code (encode or decode) residual blocks for the blocks of the pictures of the first set of temporal layers. By contrast, the video coder may apply neural network based coding techniques to pictures of a second set of temporal layers having higher temporal layer values than the first set of temporal layers (e.g., where pictures of the second set of temporal layers may be predicted from the pictures of the first set of temporal layers). [0007] In one example, a method of decoding video data includes: determining that a first temporal layer identi fier of a first picture of video data is included in a first set of temporal layers; in response to the first temporal layer identifier being included in the first set of temporal layers, decoding blocks of the first picture on a block by block basis; determining that a second temporal layer identifier of a second picture of the video data is included in a second set of temporal l ayers, the second set of temporal layers being higher than the first set of temporal layers; and in response to the second temporal layer identifier being included in the second set of temporal layers, decoding the second picture using a neural network-based video decoder. [0008] In