EP-4738827-A2 - VIDEO CODEC ASSISTED REAL-TIME VIDEO ENHANCEMENT USING DEEP LEARNING

EP4738827A2EP 4738827 A2EP4738827 A2EP 4738827A2EP-4738827-A2

Abstract

Techniques related to accelerated video enhancement using deep learning selectively applied based on video codec information are discussed. Such techniques include applying a deep learning video enhancement network selectively to decoded non-skip blocks that are in low quantization parameter frames, bypassing the deep learning network for decoded skip blocks in low quantization parameter frames, and applying non-deep learning video enhancement to high quantization parameter frames.

Inventors

WANG, CHEN
ZHANG, XIMIN
DOU, Huan
CHIU, YI-JEN
LEE, SANG-HEE

Assignees

INTEL Corporation

Dates

Publication Date: 20260506
Application Date: 20201214

Claims (15)

One or more non-transitory computer-readable media storing instructions executable to perform operations, the operations comprising: obtaining a video and metadata of the video; generating, by a deep learning network, a first output based on at least part of the video and the metadata of the video; generating, by an upsampler, a second output based on the video, wherein the second output is generated by bypassing the deep learning network; and merging the first output and the second output to generate a new video, wherein the new video has a higher resolution than the video.
The one or more non-transitory computer-readable media of claim 1, wherein the video and the metadata of the video are generated by a video decoder, and wherein the metadata comprises decoding information.
The one or more non-transitory computer-readable media of claim 2, wherein the metadata of the video comprises a quantization parameter or a parameter indicating an intra mode or an inter mode.
The one or more non-transitory computer-readable media of claim 1, wherein the deep learning network is a super resolution network, or wherein the deep learning network comprises one or more convolutional layers, or wherein the video is generated by reducing a resolution of another video.
The one or more non-transitory computer-readable media of claim 4, wherein the deep learning network further comprises a rectified linear unit layer.
A method comprising: obtaining a video and metadata of the video; generating, by a deep learning network, a first output based on at least part of the video and the metadata of the video; generating, by an upsampler, a second output based on the video, wherein the second output is generated by bypassing the deep learning network; and merging the first output and the second output to generate a new video, wherein the new video has a higher resolution than the video.
The method of claim 6, wherein the video and the metadata of the video are generated by a video decoder, and wherein the metadata comprises decoding information.
The method of claim 7, wherein the metadata of the video comprises a quantization parameter or a parameter indicating an intra mode or an inter mode.
The method of claim 6, wherein the deep learning network is a super resolution network, or wherein the deep learning network comprises one or more convolutional layers, or wherein the video is generated by reducing a resolution of another video.
The method of claim 9, wherein the deep learning network further comprises a rectified linear unit layer.
An apparatus, comprising: a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations comprising: obtaining a video and metadata of the video, generating, by a deep learning network, a first output based on at least part of the video and the metadata of the video, generating, by an upsampler, a second output based on the video, wherein the second output is generated by bypassing the deep learning network, and merging the first output and the second output to generate a new video, wherein the new video has a higher resolution than the video.
The apparatus of claim 11, wherein the video and the metadata of the video are generated by a video decoder, and wherein the metadata comprises decoding information.
The apparatus of claim 12, wherein the metadata of the video comprises a quantization parameter or a parameter indicating an intra mode or an inter mode.
The apparatus of claim 11, wherein the deep learning network is a super resolution network, or wherein the deep learning network comprises one or more convolutional layers.
The apparatus of claim 14, wherein the deep learning network further comprises a rectified linear unit layer.

Description

BACKGROUND In deep learning super-resolution, video is upscaled using deep learning networks such as convolutional neural networks trained using training video and ground truth upscaled video. Current deep learning-based video super-resolution (i.e., video upscaling) requires significant computation resources and memory bandwidth to achieve real-time performance (e.g., 1080p to 4K upscaling at 60 fps). Such requirements prohibit its wide deployment on many hardware platforms that have limited resources or stringent power budgets, such as laptops and tablets that only include integrated graphics. Techniques to accelerate deep learning-based video super-resolution including simplifying the network topology of the employed deep learning network. This includes reducing the number of layers, the number of channels, the number of connections between two consecutive layers, and the bit-precision to represent the weight and activation of the network. Other techniques use low rank approximation to reduce the complexity of the most computation-intensive layers (e.g., convolution and fully connected layer). Finally, some networks reduce complexity by seeking temporal correlations via another neural network, which predicts per-pixel motion vectors. Complexity reduction techniques reduce the number of layers, channels, and/or bit-precision for improved speed but the quality of video super-resolution is also sacrificed. Notably, a super-resolution network needs to be "deep" enough (i.e., maintaining at least a minimum amount of network layers/channels/bit-precision) to show noticeable quality improvement over traditional methods such as bicubic or Lanczos interpolation. The requirement of a deep network for improved upsampling quality performance limits the complexity that can be reduced. The same issues persist for low rank approximation techniques. Finally, for temporal-based neural networks, as the motion vectors between two frames are computed by another computationally and memory expensive network, the computation saving is very limited. It may be advantageous to provide deep learning based super-resolution for video that improves super-resolution quality and/or provides acceleration, reduced computational complexity, and memory cost. It is with respect to these and other considerations that the present improvements have been needed. Such improvements may become critical as the desire to upscale video becomes more widespread. BRIEF DESCRIPTION OF THE DRAWINGS The material described herein is illustrated by way of example and not by way of limitation in the accompanying figures. For simplicity and clarity of illustration, elements illustrated in the figures are not necessarily drawn to scale. For example, the dimensions of some elements may be exaggerated relative to other elements for clarity. Further, where considered appropriate, reference labels have been repeated among the figures to indicate corresponding or analogous elements. In the figures: FIG. 1 is an illustrative diagram of an example system for processing via selective application of a deep learning network;FIG. 2 illustrates example block-wise selective application of a deep learning network for a low quantization parameter frame;FIG. 3 illustrates example pixel value transfer processing in super-resolution contexts;FIG. 4 illustrates example frame-wise selective application of a deep learning network for an I-frame;FIG. 5 illustrates example frame-wise selective application of a deep learning network for a low quantization parameter frame;FIG. 6 is a flow diagram illustrating an example process 600 for providing adaptive video enhancement processing based on frame level quantization parameter and block level video coding modes;FIG. 7 is a flow diagram illustrating an example process for providing adaptive enhancement video processing;FIG. 8 is an illustrative diagram of an example system for providing adaptive enhancement video processing;FIG. 9 is an illustrative diagram of an example system; andFIG. 10 illustrates an example device, all arranged in accordance with at least some implementations of the present disclosure. DETAILED DESCRIPTION One or more embodiments or implementations are now described with reference to the enclosed figures. While specific configurations and arrangements are discussed, it should be understood that this is done for illustrative purposes only. Persons skilled in the relevant art will recognize that other configurations and arrangements may be employed without departing from the spirit and scope of the description. It will be apparent to those skilled in the relevant art that techniques and/or arrangements described herein may also be employed in a variety of other systems and applications other than what is described herein. While the following description sets forth various implementations that may be manifested in architectures such as system-on-a-chip (SoC) architectures for example, implementation of the techniques a