EP-4315270-B1 - MACHINE LEARNING MODEL BASED EMBEDDING FOR ADAPTABLE CONTENT EVALUATION

EP4315270B1EP 4315270 B1EP4315270 B1EP 4315270B1EP-4315270-B1

Inventors

DOGGETT, Erika, Varis
BEARD, Audrey Coyote, Aura
SCHROERS, Christopher, Richard
AZEVEDO, Robert Gerson de Albuquerque
LABROZZI, SCOTT
XUE, YUANYI
ZIMMERMAN, JAMES

Dates

Publication Date: 20260506
Application Date: 20220317

Claims (10)

A method for use with at least one machine learning, ML, model (120) trained using contrastive learning based on a similarity metric to map each of a plurality of video segments (224) to a respective embedding (240, 352) in a continuous vector space (350) , wherein the similarity metric is a video encoding metric, and the ML model (120) is trained to label a pair of video segments (224) as similar based on both having performed well on a particular encoding schema and as dissimilar based on both having performed well on different encoding schemas, the method comprising: receiving an input (128) including the plurality of video segments (224); mapping, using the at least one ML model (120), each of the plurality of video segments (224) to the respective embedding (240, 352) in the continuous vector space (350) to provide a plurality of mapped embeddings (240, 352) corresponding respectively to the plurality of video segments (224); performing one of a classification or a regression of the plurality of video segments (224) using the plurality of mapped embeddings (240, 352); classifying, based on the one of the classification or the regression, the plurality of video segments (224) into a video content category among a plurality of video content categories with respect to the similarity metric; determining an encoding schema corresponding to the video content category; and encoding the plurality of video segments (224) using the encoding schema .
The method of claim 1, wherein the classification comprises grouping each of at least one of the plurality of mapped embeddings (240, 352) into one or more clusters (354) each corresponding respectively to a distinct category of the similarity metric.
The method of claim 2, further comprising: selecting, based on the video content category, a pre-processing algorithm for pre-processing the plurality of video segments (224); and pre-processing the plurality of video segments (224) using the selected pre-processing algorithm.
The method of claim 1, wherein the at least one ML model (120) comprises at least one of a one-dimensional convolutional neural network, a two-dimensional convolutional neural network, or a three-dimensional convolutional neural network.
The method of claim 1, wherein the continuous vector space (350) is multi-dimensional.
The method of claim 1, wherein the similarity metric comprises one of a quantitative similarity metric or a perceptual similarity metric.
The method of claim 1, wherein the one of the classification or the regression is performed using a respective one of a trained classification ML model (120) or a trained regression ML model (120), and wherein the at least one ML model (120) and the respective one of the trained classification ML model (120) or the trained regression ML model (120) are trained independently of one another.
The method of claim 1, wherein the one of the classification or the regression is performed using a respective one of a trained classification ML model (120) or a trained regression ML model (120), and wherein the respective one of the trained classification ML model (120) or the trained regression ML model (120) comprises a trained neural network, NN.
The method of claim 1, wherein the one of the classification or the regression is performed using a respective one of a classification block or a regression block of the at least one ML model (120), and wherein the at least one ML model (120) including the respective one of the classification block or the regression block and the trained NN is trained using end-to-end learning.
A system (100) comprising a processing hardware (104) and a memory (106) storing a software code (110), characterized by : the processing hardware (104) adapted to execute the software code (110) to perform the method of any of claims 1 to 9.

Description

The present invention relates to a method for use with at least one machine learning model and a corresponding system for categorizing video content for video encoding. Due to its nearly universal popularity as a content medium, ever more visual media content is being produced and made available to consumers. As a result, the efficiency with which visual images can be analyzed, classified, and processed has become increasingly important to the producers, owners, and distributors of that visual media content. One significant challenge to the efficient classification and processing of visual media content is that entertainment and media studios produce many different types of content having differing features, such as different visual textures and movement. In the case of audio-video (AV) film and television content, for example the content produced may include live action content with realistic computer-generated imagery (CGI) elements, high complexity three-dimensional (3D) animation, and even two-dimensional (2D) hand-drawn animation. Moreover, each different type of content produced may require different treatment in pre-production, post-production, or both. Consider, for example, the post-production treatment of AV or video content. Different types of AV or video content may benefit from different encoding schemes for streaming, or different workflows for localization. In the conventional art, the classification of content as being of a particular type is typically done manually, through human inspection, and in the example use case of video encoding, the most appropriate workflow may not be identifiable even after manual inspection, but may require trial and error to determine how to classify the content for encoding purposes. This classification process can be particularly challenging for mixed content types, such as animation embedded in otherwise live action content, or for visually complex 3D animation which may be better suited for post-processing using live action content workflows than traditional animation workflows. State of the art disclosing methods and functions of image discovery and generating labels and clusters can be derived from the following: YEN-CHANG HSU ET AL: "Deep Image Category Discovery using a Transferred Similarity Function", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 5 December 2016 (2016-12-05), XP080736757;HAN KAI ET AL: "Learning to Discover Novel Visual Categories via Deep Transfer Clustering", 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), IEEE, 27 October 2019 (2019-10-27), pages 8400-8408, XP033723601, DOI: 10.1109/ICCV.2019.00849;SYLVESTRE-ALVISE REBUFFI ET AL: "LSD-C: Linearly Separable Deep Clusters", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 17 June 2020 (2020-06-17), XP081698071; andKAI HAN ET AL: "Automatically Discovering and Learning New Visual Categories with Ranking Statistics", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 13 February 2020 (2020-02-13), XP081599096. The present invention is defined by a method with the features of claim 1 and a system with the features of claim 10. The dependent claims describe preferred embodiments. Figure 1 shows a diagram of an exemplary system for performing machine learning (ML) model based embedding for adaptable content evaluation, according to one implementation;Figure 2A shows a diagram illustrating an exemplary training process for an ML model suitable for use in the system of Figure 1, according to one implementation;Figure 2B shows a diagram illustrating an exemplary training process for an ML model suitable for use in the system of Figure 1, according to another implementation;Figure 3A shows an exemplary two-dimensional (2D) subspace of a continuous multi-dimensional vector space including embedded vector representations of content with respect to a particular similarity metric, according to one implementation;Figure 3B shows the subspace of Figure 3A including clusters of embeddings, each cluster identifying a different category of content with respect to the similarity metric on which the mapping of Figure 3A is based, according to one implementation; andFigure 4 shows a flowchart describing an exemplary method for performing ML model based embedding for adaptable content evaluation, according to one implementation. The following description contains specific information pertaining to implementations in the present disclosure. One skilled in the art will recognize that the present disclosure may be implemented in a manner different from that specifically discussed herein. The drawings in the present application and their accompanying detailed description are directed to merely exemplary implementations. Unless noted otherwise, like or corresponding elements among the figures may be indicated by like or corresponding reference numerals. Moreover