Search

EP-4738240-A1 - EFFICIENT SUPER-SAMPLING IN VIDEOS USING HISTORICAL INTERMEDIATE FEATURES

EP4738240A1EP 4738240 A1EP4738240 A1EP 4738240A1EP-4738240-A1

Abstract

Systems and methods for providing a high-resolution gaming experience on typical computer systems, including computer systems without high-end d-GPUs. In particular, systems and methods are provided for optimizing deep learning-based super-sampling methods. A hardware-aware optimization technique for super-sampling machine learning networks uses a subset of intermediate outputs of the machine learning model for the previous game frame for convolution operations on the current frame, thereby reducing compute usage and latency without sacrificing quality of the output. The inputs are concatenated and passed through a convolutional neural network (CNN), such as a U-net-based CNN. The output of the CNN is a high-resolution image frame that can be post-processed to generate a final output. The hardware optimization technique can be implemented in a neural network framework that divides the machine learning inference across available compute resources on the computer platform.

Inventors

  • SAHA, Tanujay

Assignees

  • INTEL Corporation

Dates

Publication Date
20260506
Application Date
20250730

Claims (15)

  1. A method comprising: receiving, at an input channel, input video including a current image frame and a previous image frame; performing, at a first convolution layer, a first set of convolution operations on the current image frame to generate a first subset of intermediate convolution outputs; generating a first set of intermediate convolution outputs, including the first subset of intermediate convolution outputs and a previous subset of intermediate convolution outputs from the previous image frame; performing, at a second convolution layer, a second set of convolution operations on the first set of intermediate convolution outputs; and outputting a high-resolution image frame.
  2. The method of claim 1, wherein generating the first set of intermediate convolution outputs includes concatenating the previous subset of intermediate convolution outputs to the first subset of intermediate convolution outputs.
  3. The method of claim 2, wherein the first subset of intermediate convolution outputs includes outputs 1 through k for the current frame, and the previous subset of intermediate convolution outputs includes outputs k + 1 through n for the previous frame.
  4. The method of claims 1-3, wherein outputting the high-resolution image includes outputting a super-sampled image frame.
  5. The method of claims 1-4, further comprising accessing, at the second convolution layer, via a skip connection, information from the first convolution layer.
  6. The method of claims 1-5, wherein the first subset of intermediate convolution outputs is a current first subset of intermediate convolution outputs, wherein the previous subset of intermediate convolution outputs from the previous image frame is a previous first subset of intermediate convolution outputs, wherein performing the second set of convolution operations includes generating a current second subset of intermediate convolution outputs, and further comprising: generating a second set of intermediate convolution outputs, including the current second subset of intermediate convolution outputs and a previous second subset of intermediate convolution outputs from the previous image frame.
  7. The method of claims 1-6, further comprising dividing the input channel into a plurality of sections and stacking the sections in parallel to expand a spatial perceptual field of the first and second convolutional layers.
  8. The method of claims 1-7, wherein performing the first set of convolution operations includes encoding the current image frame at an encoding layer.
  9. The method of claims 1-8, further comprising performing preprocessing on the current image frame and the previous image frame and generating preprocessed image frame data, and wherein performing the first set of convolution operations includes performing the first set of convolution operations on the preprocessed image frame data.
  10. One or more non-transitory computer-readable media storing instructions executable to perform operations, the operations comprising: receiving, at an input channel, input video including a current image frame and a previous image frame; performing, at a first convolution layer, a first set of convolution operations on the current image frame to generate a first subset of intermediate convolution outputs; generating a first set of intermediate convolution outputs, including the first subset of intermediate convolution outputs and a previous subset of intermediate convolution outputs from the previous image frame; performing, at a second convolution layer, a second set of convolution operations on the first set of intermediate convolution outputs; and outputting a high-resolution image frame.
  11. The one or more non-transitory computer-readable media of claim 10, wherein generating the first set of intermediate convolution outputs includes concatenating the previous subset of intermediate convolution outputs to the first subset of intermediate convolution outputs.
  12. The one or more non-transitory computer-readable media of claim 11, wherein the first subset of intermediate convolution outputs includes outputs 1 through k for the current frame, and the previous subset of intermediate convolution outputs includes outputs k + 1 through n for the previous frame.
  13. An apparatus, comprising: a computer processor for executing computer program instructions; and a non-transitory computer-readable memory storing computer program instructions executable by the computer processor to perform operations comprising: receiving, at an input channel, input video including a current image frame and a previous image frame; performing, at a first convolution layer, a first set of convolution operations on the current image frame to generate a first subset of intermediate convolution outputs; generating a first set of intermediate convolution outputs, including the first subset of intermediate convolution outputs and a previous subset of intermediate convolution outputs from the previous image frame; performing, at a second convolution layer, a second set of convolution operations on the first set of intermediate convolution outputs; and outputting a high-resolution image frame.
  14. The apparatus of claim 13, wherein generating the first set of intermediate convolution outputs includes concatenating the previous subset of intermediate convolution outputs to the first subset of intermediate convolution outputs.
  15. The apparatus of claims 13-14, wherein the first subset of intermediate convolution outputs is a current first subset of intermediate convolution outputs, wherein the previous subset of intermediate convolution outputs from the previous image frame is a previous first subset of intermediate convolution outputs, wherein performing the second set of convolution operations includes generating a current second subset of intermediate convolution outputs, and further comprising: generating a second set of intermediate convolution outputs, including the current second subset of intermediate convolution outputs and a previous second subset of intermediate convolution outputs from the previous image frame.

Description

Technical Field This disclosure relates generally to signal processing, and more specifically, to image signal processing and artificial intelligence processing. Background The last decade has witnessed a rapid rise in Al (artificial intelligence) based data processing. Gaming platforms use artificial intelligence to provide real-time high-definition game frame renderings. High-end discrete general processing units (d-GPUs) are used to provide the high-definition renderings due to the high-compute power utilized by the algorithms. However, many computers do not include high-end d-GPUs, thus restricting access to high-definition renderings and the high-end gaming experience to users with selected computers. Brief Description of the Drawings Embodiments will be readily understood by the following detailed description in conjunction with the accompanying drawings. To facilitate this description, like reference numerals designate like structural elements. Embodiments are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings. FIG. 1 is a block diagram of an example deep learning system, in accordance with various embodiments.FIG. 2 illustrates an example overview of a convolution pipeline that includes replacing convolutions on a current frame with outputs from a previous frame, in accordance with various embodiments.FIG. 3 illustrates an example of a super sampling pipeline, in accordance with various embodiments.FIG. 4 illustrates an example DNN, in accordance with various embodiments.FIG. 5 is a block diagram illustrating an example of a neural network architecture that can perform efficient super-sampling methods using intermediate outputs for a previous image frame, in accordance with various embodiments.FIG. 6 is a flowchart showing a method of super sampling in videos using previous intermediate features, in accordance with various embodiments.FIG. 7 is a block diagram of an example computing device, in accordance with various embodiments. Detailed Description Overview Systems and methods are presented herein for providing a high-resolution gaming experience on typical computer systems, including computer systems without high-end d-GPUs. In particular, systems and methods are provided for optimizing deep learning-based super-sampling methods. A hardware-aware optimization technique for super-sampling machine learning networks uses intermediate outputs of the machine learning model for the previous game frame for convolution operations on the current frame. The intermediate outputs can be substituted for convolution operations on the current frame, thereby reducing compute usage and latency without sacrificing quality of the output. The hardware optimization technique can be implemented in a neural network framework that divides the machine learning inference across available compute resources on the computer platform, including, for example, the central processing unit (CPU), the integrated graphics processing unit (iGPU), and the integrated neural processing unit (iNPU), in a system-on-chip (SOC) platform. By spreading the machine learning inference across various compute resources, the iGPU can have bandwidth to compute other game rendering tasks. Traditional gaming platforms include high-end d-GPUs that use artificial intelligence to provide real-time high-definition game frame renderings. High-end discrete general processing units (d-GPUs) are used to provide the high-definition renderings due to the high-compute power utilized by the algorithms. Additionally, traditional machine-learning-based super-sampling techniques use substantial computational resources that are not suitable for real-time application on computing devices that do not include high-end d-GPUs, such as gaming laptops. However, many computers do not include high-end d-GPUs, thus restricting access to high-definition renderings and the high-end gaming experience to users with selected computers. According to various implementations, systems and methods are provided for decreasing compute resources used for machine-learning-based super-sampling. In general, super-sampling takes as input the current frame rendered at a low resolution by the GPU, the high-resolution output of the previous frame, and the motion vectors for the current frame. Systems and methods are provided herein to use as input the current frame rendered at a low resolution by the GPU, the previous frame rendered at a low resolution by the GPU, the motion vectors for the current frame, and intermediate convolution outputs from processing of the previous frame. The inputs are concatenated and passed through a convolutional neural network (CNN), such as a U-net-based CNN. The output of the CNN is a high-resolution image frame that can be post-processed to generate a final output. According to various examples, high-end frame generation technologies consume most of the compute resources in SoC systems, and the systems and