US-12620215-B1 - Systems and methods for deep learning-based image and video modifications

US12620215B1US 12620215 B1US12620215 B1US 12620215B1US-12620215-B1

Abstract

Systems and methods for deep learning-based image and video modifications are provided. Particularly, a combination of two different neural networks may be used to perform a modification to existing image and/or video content (for example, increasing the resolution of an older video). The original image and/or video content may be provided to the first neural network, which may extract information about the image and/or frames of the video content. This information may then be provided to a second neural network, which may use the information to produce the modified image and/or video content.

Inventors

Oliver Dayun LIU
Wenbin OUYANG

Assignees

AMAZON TECHNOLOGIES, INC.

Dates

Publication Date: 20260505
Application Date: 20230323

Claims (20)

1 . A method comprising: receiving, by a first neural network, a sequence of frames of a video, the sequence of frames including at least a first frame and a second frame, wherein the first frame and the second frame are initially rendered at a lower resolution; determining, by the first neural network, a first video information embedding for the first frame, wherein the first video information embedding includes an indication of at least one of: that a resolution modification is to be performed to the first frame or a modified resolution for the first frame; determining, by the first neural network, a second video information embedding for the second frame, wherein the second video information embedding includes an indication that the second frame includes an intentional effect including at least one of: intentional blurriness, varied level of transparency, an intended lighting artifact, or a particle effect; outputting, by the first neural network and to a second neural network, one or more vectors including the first video information embedding and the second video information embedding; receiving, by the second neural network, the one or more vectors; determining, by the second neural network and based on the first video information embedding for the first frame and the second video information embedding for the second frame, that the first frame is to be upscaled to an increased resolution and the second frame is to remain at a same resolution to prevent the intentional effect from being diminished or removed from the second frame by the upscaling; modifying, by the second neural network, the first frame to produce a modified first frame at the increased resolution; and outputting, by the second neural network, the modified first frame and the second frame.
2 . The method of claim 1 , further comprising: receiving, by a first loss function, the one or more vectors; determining a comparison between the first video information embedding and the second video information embedding and ground truth data; and training the first neural network based on the comparison.
3 . The method of claim 1 , further comprising: receiving, by a second loss function, the modified first frame; outputting, by the second loss function, and indication of a likelihood that the modified first frame was produced by the second neural network; and training the second neural network based on the indication.
4 . The method of claim 1 , further comprising: receiving, by a third loss function, the modified first frame; determining a second comparison between the modified first frame and ground truth image data; and training the second neural network based on the second comparison.
5 . A method comprising: receiving, by a first neural network, first image data and second image data, wherein the first image data is associated with a first video frame and the second image data is associated with a second video frame; determining, by the first neural network, a first classification for the first image data and a second classification for the second image data, wherein the second classification indicates that the second video frame includes an intended effect; receiving, by a second neural network, the first image data, the second image data, the first classification for the first image data, and the second classification for the second image data; determining that modifying the second video frame would remove or diminish the intended effect; determining, by the second neural network and based on the first classification for the first image data and the second classification for the second image data, a first type of modification to perform to first image data instead of the second image data based on the determination that modifying the second video frame would remove or diminish the intended effect; and outputting, by the second neural network, the second image data and third image data including the first type of modification.
6 . The method of claim 5 , wherein the first type of modification includes an increase in a resolution of the first image data.
7 . The method of claim 5 , wherein the first image data is a first frame of a video and the second image data is a second frame of a video.
8 . The method of claim 5 , further comprising: receiving, by the first neural network, third image data; determining, by the first neural network, a third classification for the third image data; receiving, by the second neural network, the third image data and the third classification for the third image data; determining, by the second neural network and based on the third classification for the third image data, a second type of modification to perform to third image data; and outputting, by the second neural network, fourth image data including the second type of modification, wherein the first type of modification is different than the second type of modification.
9 . The method of claim 5 , wherein the first classification includes an indication of at least one of: that a resolution modification is to be performed to the first image data, a modified resolution for the first image data, that the first image data includes intentional blurriness, that the first image data includes varied level of transparency, that the first image data and/or the second image data includes an intended lighting artifact, and that the first image data includes a particle effect.
10 . The method of claim 5 , further comprising: receiving, by a first loss function, the first classification and the second classification; determining a comparison between the first classification and the second classification and ground truth data; and training the first neural network based on the comparison.
11 . The method of claim 5 , further comprising: receiving, by a second loss function, the third image data; outputting, by the second loss function, and indication of a likelihood that the third image data was produced by the second neural network; and training the second neural network based on the indication.
12 . The method of claim 5 , further comprising: receiving, by a third loss function, the third image data; determining a second comparison between the third image data and ground truth image data; and training the second neural network based on the second comparison.
13 . A system comprising: memory that stores computer-executable instructions; and one or more processors configured to access the memory and execute the computer-executable instructions to: receive, by a first neural network, first image data and second image data, wherein the first image data is associated with a first video frame and the second image data is associated with a second video frame; determine, by the first neural network, a first classification for the first image data and a second classification for the second image data, wherein the second classification indicates that the second video frame includes an intended effect; determine that modifying the second video frame would remove or diminish the intended effect; receive, by a second neural network, the first image data, the second image data, the first classification for the first image data, and the second classification for the second image data; determine, by the second neural network and based on the first classification for the first image data and the second classification for the second image data, a first type of modification to perform to first image data instead of the second image data based on the determination that modifying the second video frame would remove or diminish the intended effect; and output, by the second neural network, the second image data and third image data including the first type of modification.
14 . The system of claim 13 , wherein the first type of modification includes an increase in a resolution of the first image data.
15 . The system of claim 13 , wherein the first image data is a first frame of a video and the second image data is a second frame of a video.
16 . The system of claim 13 , wherein the one or more processors are further configured to execute the computer-executable instructions to: receive, by the first neural network, third image data; determine, by the first neural network, a third classification for the third image data; receiving, by the second neural network, the third image data and the third classification for the third image data; determine, by the second neural network and based on the third classification for the third image data, a second type of modification to perform to third image data; and output, by the second neural network, fourth image data including the second type of modification, wherein the first type of modification is different than the second type of modification.
17 . The system of claim 13 , wherein the first classification includes an indication of at least one of: that a resolution modification is to be performed to the first image data, a modified resolution for the first image data, that the first image data includes intentional blurriness, that the first image data includes varied level of transparency, and that the first image data includes a particle effect.
18 . The system of claim 13 , wherein the one or more processors are further configured to execute the computer-executable instructions to: receive, by a first loss function, the first classification and the second classification; determine a comparison between the first classification and the second classification and ground truth data; and train the first neural network based on the comparison.
19 . The system of claim 13 , wherein the one or more processors are further configured to execute the computer-executable instructions to: receive, by a second loss function, the third image data; output, by the second loss function, and indication of a likelihood that the third image data was produced by the second neural network; and train the second neural network based on the indication.
20 . The system of claim 13 , wherein the one or more processors are further configured to execute the computer-executable instructions to: receive, by a third loss function, the third image data; determine a second comparison between the third image data and ground truth image data; and train the second neural network based on the second comparison.

Description

BACKGROUND In the film industry, visual effects have undergone a drastic change as a result of advancements in computer vision and rendering algorithms. Computer-generated imagery (CGI) has become increasingly realistic over the years. A major roadblock currently associated with CGI, however, is rendering time. As the visual effects used in movies have become increasingly complex, the time required to render the visual effects has also significantly increased. It is not uncommon for the rendering of visual effects shots to take days to complete. This is even more problematic for movies that are rendered at higher resolutions, such as 4k. For example, typically, a visual effects scene at a 4k resolution may take 20 times as long to render than the same scene at a 2k resolution, while occupying an incredible amount of power and hardware resources in the process. One partial solution to this problem is to utilize technology advancements in the area of super-resolution. For example, scenes may be rendered at a lower resolution (for example, 2k) and the scenes may then be upscaled to a greater resolution (for example, 4k). This upscaling process may save rendering time while maintaining the same level of quality as if the content was originally rendered in 4k. BRIEF DESCRIPTION OF THE DRAWINGS The detailed description is set forth with reference to the accompanying drawings. The drawings are provided for purposes of illustration only and merely depict example embodiments of the disclosure. The drawings are provided to facilitate understanding of the disclosure and shall not be deemed to limit the breadth, scope, or applicability of the disclosure. In the drawings, the left-most digit(s) of a reference numeral may identify the drawing in which the reference numeral first appears. The use of the same reference numerals indicates similar, but not necessarily the same or identical components. However, different reference numerals may be used to identify similar components as well. Various embodiments may utilize elements or components other than those illustrated in the drawings, and some elements and/or components may not be present in various embodiments. The use of singular terminology to describe a component or element may depending on the context, encompass a plural number of such components or elements and vice versa. FIG. 1 is an example use case for deep learning-based image and video modifications in accordance with one or more example embodiments of the disclosure. FIG. 2 is an example method for deep learning-based image and video modifications in accordance with one or more example embodiments of the disclosure. FIG. 3 is a flow diagram illustrating example operations associated with a first neural network in accordance with one or more example embodiments of the disclosure. FIG. 4 is a flow diagram illustrating example operations associated with a second neural network in accordance with one or more example embodiments of the disclosure. FIGS. 5A-5B is a flow diagram illustrating example operations associated with a combination of the first neural network of FIG. 3 and the second neural network of FIG. 4 in accordance with one or more example embodiments of the disclosure. FIG. 6 is an example system for deep learning-based image and video modifications in accordance with one or more example embodiments of the disclosure. FIG. 7 is an example computing device in accordance with one or more example embodiments of the disclosure. DETAILED DESCRIPTION This disclosure relates to, among other things, systems and methods for deep learning-based image and video modifications. Particularly, the systems and methods may involve the use of two neural networks to produce modified image and/or video content. As a non-limiting example, a sequence of frames associated with a movie that was originally rendered at a lower resolution may be provided as an input to one or more neural networks and the output of the one or more neural networks may be an upscaled version of the movie that is at a higher resolution. In addition to performing modifications to the image and/or video content, the systems and methods described herein are advantageous over conventional methods for modifying image and/or video content by selectively performing modifications or different types of modifications to certain portions of the images and/or video content based on characteristics of those portions of the images and/or video content. As an example, it may not be desired for all frames of a piece of video content to be modified to the same increased resolution. For example, some scenes in movies contain motion blur resulting from rapid character motions and some scenes are intentionally blurry to depict the dazed vision of a character that is waking up from sleep. For frames including these example characteristics, increasing the resolution may only cause the frames to become overly sharp and remove the intended effect of the blurring in the