US-12625902-B1 - Multimodal machine learning model for content evaluation

US12625902B1US 12625902 B1US12625902 B1US 12625902B1US-12625902-B1

Abstract

Embodiments provide for improved machine learning. A request for supplemental content to be provided in association with a media content item is received, and a set of candidate supplemental content items for the request is determined. A user embedding corresponding to a user embedding corresponding to a user associated with the media content item, a media embedding corresponding to the media content item, and a set of supplemental content embeddings corresponding to the set of candidate supplemental content items are accessed from one or more storage repositories. A set of interaction scores is generated based on processing the user embedding, the media embedding, and the set of supplemental content embeddings using an interaction machine learning model. A first supplemental content item of the set of candidate supplemental content items is selected for the request based on the set of interaction scores.

Inventors

Yupeng Gao
Pengfei Gao
Yan Zhang
Zhe Wang
Yasir Hossain
Xingpeng Xiao
Mengzhe LI
Gianluca MILANO

Assignees

DISNEY ENTERPRISES, INC.

Dates

Publication Date: 20260512
Application Date: 20250124

Claims (17)

1 . A method, comprising: receiving a request for supplemental content to be provided in association with a media content item; determining a set of candidate supplemental content items for the request; accessing, from one or more storage repositories, a user embedding corresponding to a user associated with the media content item, a media embedding corresponding to the media content item, and a set of supplemental content embeddings corresponding to the set of candidate supplemental content items; generating a set of interaction scores based on processing the user embedding, the media embedding, and the set of supplemental content embeddings using an interaction machine learning model, wherein: a set of embedding machine learning models and the interaction machine learning model were jointly trained during an offline phase, the user embedding, the media embedding, and the set of supplemental content embeddings were generated using the set of embedding machine learning models during the offline phase, the set of interaction scores are generated using the interaction machine learning model during an online phase, and the set of embedding machine learning models do not process data during the online phase; and selecting, for the request, a first supplemental content item of the set of candidate supplemental content items based on the set of interaction scores.
2 . The method of claim 1 , wherein determining the set of candidate supplemental content items comprises identifying a subset of supplemental content items from a library of supplemental content items based on a set of constraints corresponding to the user.
3 . The method of claim 1 , wherein the user embedding was generated offline based on processing one or more features of the user using a user embedding machine learning model, wherein the one or more features comprise one or more demographics of the user.
4 . The method of claim 1 , wherein the media embedding was generated offline based on: generating a set of image features based on processing image data from the media content item using a first media embedding machine learning model; generating a set of audio features based on processing audio data from the media content item using a second media embedding machine learning model; and aggregating the set of image features and the set of audio features.
5 . The method of claim 4 , wherein the media embedding was further generated offline based on processing the aggregated set of image features and set of audio features using a third media embedding machine learning model.
6 . The method of claim 1 , wherein a first supplemental content embedding of the set of supplemental content embeddings was generated offline based on processing one or more features of the first supplemental content item using a supplemental content embedding machine learning model, wherein the one or more features comprise characteristics of the first supplemental content item.
7 . The method of claim 1 , wherein generating the set of interaction scores comprises, for each respective candidate supplemental content item of the set of candidate supplemental content items: generating a respective aggregated input based on concatenating the user embedding, the media embedding, and a respective supplemental content embedding, of the set of supplemental content embeddings, corresponding to the respective candidate supplemental content item; and processing the respective aggregated input using the interaction machine learning model to generate a respective interaction score for the respective candidate supplemental content item.
8 . The method of claim 7 , wherein generating the set of interaction scores further comprises, for each respective candidate supplemental content item of the set of candidate supplemental content items: determining a respective set of interaction features, wherein the respective set of interaction features corresponds to at least one of: (i) interactions between the user and the media content item, (ii) interactions between the user and the respective candidate supplemental content item, or (iii) interactions between the media content item and the respective supplemental content item; and generating the respective aggregated input based further on the set of interaction features.
9 . The method of claim 1 , wherein the set of interaction scores correspond to an aggregation of one or more weighted probabilities of one or more positive interactions and one or more weighted probabilities of one or more negative interactions with respect to first user, the media content item, and the set of candidate supplemental content items.
10 . One or more non-transitory computer readable media containing, in any combination, computer program code that, when executed by operation of any combination of one or more processors, performs an operation comprising: receiving a request for supplemental content to be provided in association with a media content item; determining a set of candidate supplemental content items for the request; accessing, from one or more storage repositories, a user embedding corresponding to a user associated with the media content item, a media embedding corresponding to the media content item, and a set of supplemental content embeddings corresponding to the set of candidate supplemental content items; generating a set of interaction scores based on processing the user embedding, the media embedding, and the set of supplemental content embeddings using an interaction machine learning model, wherein: a set of embedding machine learning models and the interaction machine learning model were jointly trained during an offline phase, the user embedding, the media embedding, and the set of supplemental content embeddings were generated using the set of embedding machine learning models during the offline phase, the set of interaction scores are generated using the interaction machine learning model during an online phase, and the set of embedding machine learning models do not process data during the online phase; and selecting, for the request, a first supplemental content item of the set of candidate supplemental content items based on the set of interaction scores.
11 . The one or more non-transitory computer readable media of claim 10 , wherein the user embedding was generated offline based on processing one or more features of the user using a user embedding machine learning model, wherein the one or more features comprise one or more demographics of the user.
12 . The one or more non-transitory computer readable media of claim 10 , wherein the media embedding was generated offline based on: generating a set of image features based on processing image data from the media content item using a first media embedding machine learning model; generating a set of audio features based on processing audio data from the media content item using a second media embedding machine learning model; and aggregating the set of image features and the set of audio features.
13 . The one or more non-transitory computer readable media of claim 10 , wherein a first supplemental content embedding of the set of supplemental content embeddings was generated offline based on processing one or more features of the first supplemental content item using a supplemental content embedding machine learning model, wherein the one or more features comprise characteristics of the first supplemental content item.
14 . The one or more non-transitory computer readable media of claim 10 , wherein generating the set of interaction scores comprises, for each respective candidate supplemental content item of the set of candidate supplemental content items: generating a respective aggregated input based on concatenating the user embedding, the media embedding, and a respective supplemental content embedding, of the set of supplemental content embeddings, corresponding to the respective candidate supplemental content item; and processing the respective aggregated input using the interaction machine learning model to generate a respective interaction score for the respective candidate supplemental content item.
15 . A system, comprising: one or more processors; and one or more memories storing a program, which, when executed on any combination of the one or more processors, performs operations, the operations comprising: receiving a request for supplemental content to be provided in association with a media content item; determining a set of candidate supplemental content items for the request; accessing, from one or more storage repositories, a user embedding corresponding to a user associated with the media content item, a media embedding corresponding to the media content item, and a set of supplemental content embeddings corresponding to the set of candidate supplemental content items; generating a set of interaction scores based on processing the user embedding, the media embedding, and the set of supplemental content embeddings using an interaction machine learning model, wherein: a set of embedding machine learning models and the interaction machine learning model were jointly trained during an offline phase, the user embedding, the media embedding, and the set of supplemental content embeddings were generated using the set of embedding machine learning models during the offline phase, the set of interaction scores are generated using the interaction machine learning model during an online phase, and the set of embedding machine learning models do not process data during the online phase; and selecting, for the request, a first supplemental content item of the set of candidate supplemental content items based on the set of interaction scores.
16 . The system of claim 15 , wherein: the user embedding was generated offline based on processing one or more features of the user using a user embedding machine learning model, wherein the one or more features comprise one or more demographics of the user, the media embedding was generated offline based on: generating a set of image features based on processing image data from the media content item using a first media embedding machine learning model; generating a set of audio features based on processing audio data from the media content item using a second media embedding machine learning model; and aggregating the set of image features and the set of audio features, and a first supplemental content embedding of the set of supplemental content embeddings was generated offline based on processing one or more features of the first supplemental content item using a supplemental content embedding machine learning model, wherein the one or more features comprise characteristics of the first supplemental content item.
17 . The system of claim 15 , wherein generating the set of interaction scores comprises, for each respective candidate supplemental content item of the set of candidate supplemental content items: generating a respective aggregated input based on concatenating the user embedding, the media embedding, and a respective supplemental content embedding, of the set of supplemental content embeddings, corresponding to the respective candidate supplemental content item; and processing the respective aggregated input using the interaction machine learning model to generate a respective interaction score for the respective candidate supplemental content item.

Description

BACKGROUND The digital content landscape is continuously evolving. Not only is there a tremendous variety of primary content (e.g., multimedia such as a video stream, audio stream, and the like) available to users, but there is also a similarly vast assortment of supplemental content (e.g., promotional content, recommendations, live events, and the like) which can be provided along with the primary content. Though significant resources have been expended seeking to improve supplemental content selection, there remains substantial opportunity for improvement. Recently, some attempts have been made to use machine learning to improve content selection. However, such approaches have thus far been suboptimal in their selections. Further, such approaches generally incur substantial computational expense (e.g., relying on substantial compute resources such as memory). Further, such approaches generally introduce significant latency (e.g., significant time is consumed processing the various data to select content), rendering these approaches unsuitable for many digital content environments where these delays are unacceptable. BRIEF DESCRIPTION OF THE DRAWINGS So that the manner in which the above recited aspects are attained and can be understood in detail, a more particular description of embodiments described herein, briefly summarized above, may be had by reference to the appended drawings. It is to be noted, however, that the appended drawings illustrate typical embodiments and are therefore not to be considered limiting; other equally effective embodiments are contemplated. FIG. 1 depicts an example system for multimodal machine learning, according to some embodiments of the present disclosure. FIG. 2 depicts an example model architecture for multimodal machine learning, according to some embodiments of the present disclosure. FIG. 3 depicts an example serving architecture for multimodal machine learning, according to some embodiments of the present disclosure. FIG. 4 is a flow diagram depicting an example method for multimodal machine learning, according to some embodiments of the present disclosure. FIG. 5 is a flow diagram depicting an example method for training multimodal machine learning models, according to some embodiments of the present disclosure. FIG. 6 is a flow diagram depicting an example method for feature generation for multimodal machine learning, according to some embodiments of the present disclosure. FIG. 7 is a flow diagram depicting an example method for machine learning, according to some embodiments of the present disclosure. FIG. 8 depicts an example computing device configured to perform various embodiments of the present disclosure. DETAILED DESCRIPTION Many modern content providers face challenging problems relating to providing digital content, including supplemental content (e.g., in-stream promotions, live events, advertisements, recommendations on social media feeds, interactive advertisements or other media, and the like). In some embodiments, digital content (e.g., streaming video, audio, games, and any other suitable content) may be often supported by (or include slots for insertion of) various supplemental content items. Online supplemental content serving demands low latency (e.g., to prevent delay in the providing of the digital content) as well as high prediction accuracy (e.g., to ensure the supplemental content is relevant or not disruptive). However, many modern approaches neglect a wide variety of contextual features which can hold valuable information for enhancing personalized user experiences. In some embodiments of the present disclosure, a multimodal model architecture for online supplemental content evaluation is provided, designed to deliver swift and highly accurate predictions while harnessing the power of contextual features. Further, in some embodiments, portions of the mode may be executed in an offline fashion (e.g., prior to beginning serving of any content to a user). In some embodiments, a relatively small portion of the model can be executed online during runtime (e.g., while users consume content) while leveraging the information gleaned during offline execution. This hybrid offline-online architecture can enable the model to generate online predictions with significantly reduced computational expense (e.g., relying on less memory and compute, as well as consuming less energy and generating less heat) as well as reduced latency (e.g., more rapid predictions). Further, in some aspects, this architecture can reduce bandwidth usage by reducing the amount of data that is loaded and/or used during the online phase. In some embodiments, selection of appropriate supplemental content is performed based on evaluation of multiple modalities of information in order to improve the selection process. For example, while some approaches seek to identify the most relevant supplemental content for a particular user to whom the content is being delivered, these approa