US-12621502-B2 - Machine learning model-based video compression

US12621502B2US 12621502 B2US12621502 B2US 12621502B2US-12621502-B2

Abstract

A system processing hardware executes a machine learning (ML) model-based video compression encoder to receive uncompressed video content and corresponding motion compensated video content, compare the uncompressed and motion compensated video content to identify an image space residual, transform the image space residual to a latent space representation of the uncompressed video content, and transform, using a trained image compression ML model, the motion compensated video content to a latent space representation of the motion compensated video content. The ML model-based video compression encoder further encodes the latent space representation of the image space residual to produce an encoded latent residual, encodes, using the trained image compression ML model, the latent space representation of the motion compensated video content to produce an encoded latent video content, and generates, using the encoded latent residual and the encoded latent video content, a compressed video content corresponding to the uncompressed video content.

Inventors

Abdelaziz DJELOUAH
Leonhard Markus Helminger
Roberto Gerson de Albuquerque Azevedo
Scott Labrozzi
Christopher Richard Schroers
Yuanyi XUE

Assignees

DISNEY ENTERPRISES, INC.
ETH Zürich (Eidgenössische Technische Hochschule Zürich)

Dates

Publication Date: 20260505
Application Date: 20240913

Claims (20)

1 . A system comprising: a computing platform including a processing hardware and a system memory storing a knowledge distilled neural decoder including a first neural network (NN) corresponding to a trained image compression machine learning (ML) model, and a second NN functioning as a residual decoder neural network; the processing hardware configured to execute the knowledge distilled neural decoder to: receive a bitstream including (i) encoded latent video content corresponding to a motion compensated version of a frame of video, and (ii) encoded latent residual information representing a residual corresponding the frame and the motion compensated version; obtain, using the first NN, one or more features derived from the encoded latent video content; decode, using the second NN, the encoded latent residual information to obtain one or more residual features in at least one of an image space or a latent space; merge the one or more features derived from the encoded latent video content with the one or more residual features to generate a reconstructed version of the frame; and train parameters of the second NN using a knowledge distillation process in which the parameters of the first NN are held fixed.
2 . The computing platform of claim 1 , wherein when training the parameters of the second NN using the knowledge distillation process, the second NN is tuned using a loss function that enforces consistency between the reconstructed version of the frame and an image produced for the frame by the trained image compression ML model.
3 . The computing platform of claim 2 , wherein when training the parameters of the second NN using the knowledge distillation process, the second NN is constrained to include fewer trainable parameters than a decoder of the trained image compression ML model.
4 . The computing platform of claim 3 , wherein training includes replacing one or more neural network blocks of the decoder of the trained image compression ML model with smaller neural network blocks having at least one of fewer channels or fewer parameters to form the second NN.
5 . The computing platform of claim 1 , wherein merging the features derived from the encoded latent video content with the residual features comprises applying a mask having binary values indicating whether to select the one or more features derived from the encoded latent video content or the one or more residual features.
6 . The computing platform of claim 1 , wherein the knowledge distilled neural decoder further comprises a temporal smoothing NN, and wherein the processing hardware is further configured to execute the knowledge distilled neural decoder to: receive a previously decoded frame of video and a motion vector field associated with the previously decoded frame; warp the previously decoded frame using the motion vector field to obtain a warped frame; and apply the temporal smoothing NN to the reconstructed version of the frame and the warped frame to provide a temporally processed version of the reconstructed frame having reduce temporal artifacts.
7 . The computing platform of claim 6 , wherein the processing hardware is further configured to execute the knowledge distilled neural decoder to: train the temporal smoothing NN using a loss function including a temporal distortion term defined over regions in which the motion vector field is valid.
8 . The computing platform of claim 7 , wherein the loss function further includes an adversarial term that penalizes temporal inconsistencies between consecutive reconstructed frames of video.
9 . The computing platform of claim 1 , wherein the trained image compression ML model comprises an NN trained adversarially with a discriminator neural network using a rate-distortion adversarial loss.
10 . The computing platform of claim 9 , wherein the rate-distortion adversarial loss includes (I) a rate term corresponding to a bitstream length required to encode a quantized latent representation of a video frame, (II) a distortion term corresponding to a distortion between a reconstructed version of the video frame and a ground-truth version of the video frame, and (III) an adversarial term based on outputs of the discriminator neural network applied to the reconstructed version of the video frame and the ground-truth version of the video frame.
11 . A method for use by a system having a computing platform including a processing hardware and a system memory storing a knowledge distilled neural decoder including a first neural network (NN) corresponding to a trained image compression machine learning (ML) model, and a second NN functioning as a residual decoder neural network, the method comprising: receiving, by the processing hardware executing the knowledge distilled neural decoder, a bitstream including (i) encoded latent video content corresponding to a motion compensated version of a frame of video, and (ii) encoded latent residual information representing a residual corresponding the frame and the motion compensated version; obtaining, by the processing hardware executing the knowledge distilled neural decoder and using the first NN, one or more features derived from the encoded latent video content; decode, by the processing hardware executing the knowledge distilled neural decoder and using the second NN, the encoded latent residual information to obtain one or more residual features in at least one of an image space or a latent space; merging, by the processing hardware executing the knowledge distilled neural decoder, the one or more features derived from the encoded latent video content with the one or more residual features to generate a reconstructed version of the frame; and training, by the processing hardware executing the knowledge distilled neural decoder, parameters of the second NN using a knowledge distillation process in which the parameters of the first NN are held fixed.
12 . The method of claim 11 , wherein when training the parameters of the second NN using the knowledge distillation process, the second NN is tuned using a loss function that enforces consistency between the reconstructed version of the frame and an image produced for the frame by the trained image compression ML model.
13 . The method of claim 12 , wherein when training the parameters of the second NN using the knowledge distillation process, the second NN is constrained to include fewer trainable parameters than a decoder of the trained image compression ML model.
14 . The method of claim 13 , wherein training includes replacing one or more neural network blocks of the decoder of the trained image compression ML model with smaller neural network blocks having at least one of fewer channels or fewer parameters to form the second NN.
15 . The method of claim 11 , wherein merging the features derived from the encoded latent video content with the residual features comprises applying a mask having binary values indicating whether to select the one or more features derived from the encoded latent video content or the one or more residual features.
16 . The method of claim 11 , wherein the knowledge distilled neural decoder further comprises a temporal smoothing NN, and the method further comprising: receiving, by the processing hardware executing the knowledge distilled neural decoder, a previously decoded frame of video and a motion vector field associated with the previously decoded frame; warping, by the processing hardware executing the knowledge distilled neural decoder, the previously decoded frame using the motion vector field to obtain a warped frame; and applying, by the processing hardware executing the knowledge distilled neural decoder, the temporal smoothing NN to the reconstructed version of the frame and the warped frame to provide a temporally processed version of the reconstructed frame having reduce temporal artifacts.
17 . The method of claim 16 , wherein the processing hardware is further configured to execute the knowledge distilled neural decoder to: train the temporal smoothing NN using a loss function including a temporal distortion term defined over regions in which the motion vector field is valid.
18 . The method of claim 17 , wherein the loss function further includes an adversarial term that penalizes temporal inconsistencies between consecutive reconstructed frames of video.
19 . The method of claim 11 , wherein the trained image compression ML model comprises an NN trained adversarially with a discriminator neural network using a rate- distortion adversarial loss.
20 . The method of claim 19 , wherein the rate-distortion adversarial loss includes (I) a rate term corresponding to a bitstream length required to encode a quantized latent representation of a video frame, (II) a distortion term corresponding to a distortion between a reconstructed version of the video frame and a ground-truth version of the video frame, and (III) an adversarial term based on outputs of the discriminator neural network applied to the reconstructed version of the video frame and the ground-truth version of the video frame.

Description

RELATED APPLICATIONS The present application is a Continuation of U.S. patent application Ser. No. 17/704,692, filed Mar. 25, 2022, which claims the benefit of and priority to Provisional Patent Application Ser. No. 63/172,315, filed Apr. 8, 2021, and titled “Neural Network Based Video Codecs,” and Provisional Patent Application Ser. No. 63/255,280, filed Oct. 13, 2021, and titled “Microdosing For Low Bitrate Video Compression,” which are hereby incorporated fully by reference into the present application. BACKGROUND Video content represents the majority of total Internet traffic and is expected to increase even more as spatial resolution frame rate, and color depth of videos increase and more users adopt streaming services. Although existing codecs have achieved impressive performance, they have been engineered to the point where adding further small improvements is unlikely to meet future demands. Consequently, exploring fundamentally different ways to perform video coding may advantageously lead to a new class of video codecs with improved performance and flexibility. For example, one advantage of using a trained machine learning (ML) model, such as a neural network (NN), in the form of a generative adversarial network (GAN) for example, to perform video compression is that it enables the ML model to infer visual details that it would otherwise be costly in terms of data transmission, to obtain. However, training ML models such as GANs is typically challenging because the training alternates between minimization and maximization steps to converge to a saddle point of the loss function. The task becomes more challenging when considering the temporal domain and the increased complexity it introduces. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 shows a diagram of an exemplary system for performing machine learning (ML) model-based video compression, according to one implementation; FIG. 2A shows a diagram of an exemplary ML model-based video codec architecture, according to one implementation; FIG. 2B shows a diagram of an exemplary ML model-based video codec architecture, according to another implementation; FIG. 2C shows a diagram of an exemplary ML model-based video codec architecture, according to yet another implementation; FIG. 3 shows a flowchart outlining an exemplary method for performing ML model-based video compression, according to one implementation; and FIG. 4 shows a flowchart outlining an exemplary method for performing ML model-based video compression, according to another implementation. DETAILED DESCRIPTION The following description contains specific information pertaining to implementations in the present disclosure. One skilled in the art will recognize that the present disclosure may be implemented in a manner different from that specifically discussed herein. The drawings in the present application and their accompanying detailed description are directed to merely exemplary implementations. Unless noted otherwise, like or corresponding elements among the figures may be indicated by like or corresponding reference numerals. Moreover, the drawings and illustrations in the present application are generally not to scale, and are not intended to correspond to actual relative dimensions. As noted above, video content represents the majority of total Internet traffic and is expected to increase even more as spatial resolution frame rate, and color depth of videos increase and more users adopt streaming services. Although existing codecs have achieved impressive performance, they have been engineered to the point where adding further small improvements is unlikely to meet future demands. Consequently, exploring fundamentally different ways to perform video coding may advantageously lead to a new class of video codecs with improved performance and flexibility. For example, and as further noted above, one advantage of using a trained machine learning (ML) model, such as a neural network (NN), in the form of a generative adversarial network (GAN) for example, to perform video compression is that it enables the ML model to infer visual details that it would otherwise be costly in terms of data transmission, to obtain. However, training ML models such as GANs is typically challenging because the training alternates between minimization and maximization steps to converge to a saddle point of the loss function. The task becomes more challenging when considering the temporal domain and the increased complexity it introduces if only because of the increased data. The present application discloses a framework based on knowledge distillation and latent space residual to use any adversarially trained image compression ML model as a basis to build a video compression codec that has similar hallucination capacity to a trained GAN which is particularly important when targeting low bit-rate video compression. The images resulting from the present ML model-based video compression solution are visually pleasing w