US-12625905-B2 - Performing video moment retrieval utilizing deep learning

US12625905B2US 12625905 B2US12625905 B2US 12625905B2US-12625905-B2

Abstract

The present disclosure relates to systems, methods, and non-transitory computer-readable media that learns parameters for a natural language video localization model utilizing a curated dataset. In particular, in some embodiments, the disclosed systems generate a set of similarity scores between a target query and a video dataset that includes a plurality of digital videos. For instance, the disclosed systems determines a false-negative threshold by utilizing the set of similarity scores to exclude a subset of false-negative samples from the plurality of digital videos. Further, the disclosed systems determines a negative sample distribution and generates a curated dataset that includes a subset of negative samples with the subset of false-negative samples excluded.

Inventors

Seunghyun Yoon

Assignees

ADOBE INC.

Dates

Publication Date: 20260512
Application Date: 20230719

Claims (20)

1 . A computer-implemented method comprising: constructing a curated dataset to learn parameters for a natural language video localization model by: generating, utilizing a text embedding model, a set of similarity scores between a target query and a video dataset comprising a plurality of digital videos by: identifying a set of video captions corresponding with the plurality of digital videos; generating, utilizing the text embedding model, digital video embeddings for the set of video captions; generating, utilizing the text embedding model, a target query embedding for the target query; and generating the set of similarity scores by comparing the target query embedding with the digital video embeddings; determining a false-negative threshold by utilizing the set of similarity scores to identify a subset of false-negative samples; excluding the subset of false-negative samples from the plurality of digital videos based on the subset of false-negative samples satisfying the false-negative threshold; determining a negative sample distribution for the plurality of digital videos based on the target query with the subset of false-negative samples excluded, wherein the negative sample distribution is determined based on one or more moments in one or more digital videos of the plurality of digital videos being in a category considered not relevant to the target query; and generating the curated dataset comprising a subset of negative samples based on the negative sample distribution without the subset of false-negative samples; and learning parameters for the natural language video localization model utilizing the curated dataset.
2 . The computer-implemented method of claim 1 , wherein identifying the set of video captions corresponding with the plurality of digital videos comprises: identifying a first video caption corresponding to a first frame of a first digital video; identifying a second video caption corresponding to a second frame of the first digital video; identifying a third video caption corresponding to a first frame of a second digital video; and generating the digital video embeddings from the first video caption, the second video caption, and the third video caption.
3 . The computer-implemented method of claim 1 , wherein generating the set of similarity scores comprises: generating, utilizing a cosine similarity, a similarity score between the target query embedding and a subset of the digital video embeddings for a first video of the plurality of digital videos; and generating, utilizing the cosine similarity, an additional similarity score between the target query embedding and an additional subset of the digital video embeddings for a second video of the plurality of digital videos, wherein the set of similarity scores comprises the similarity score and the additional similarity score.
4 . The computer-implemented method of claim 1 , further comprises generating a false-negative distribution based on a mean distribution value and standard deviation value of the set of similarity scores.
5 . The computer-implemented method of claim 4 , further comprises determining the false-negative threshold for the video dataset comprising the plurality of digital videos based on the false-negative distribution by determining a threshold for negative sample candidates, wherein the false-negative threshold comprises a predetermined similarity score above the threshold for negative sample candidates.
6 . The computer-implemented method of claim 1 , further comprises: identifying the subset of false-negative samples from the plurality of digital videos based on the subset of false-negative samples satisfying the false-negative threshold; wherein satisfying the false-negative threshold comprises a similarity score of the subset of false-negative samples being above a predetermined similarity score corresponding with negative sample candidates.
7 . The computer-implemented method of claim 1 , wherein determining the negative sample distribution further comprises: determining a mean distribution value and a standard deviation value of the set of similarity scores with the subset of false-negative samples excluded; and identifying the subset of negative samples of the plurality of digital videos from the negative sample distribution.
8 . The computer-implemented method of claim 1 , wherein generating the curated dataset further comprises constructing the curated dataset by: extracting positive samples corresponding with the target query from the video dataset to include within the curated dataset; including the subset of negative samples within the curated dataset; and excluding the subset of false-negative samples from the curated dataset.
9 . A system comprising: one or more memory devices comprising a pre-trained natural language video localization model, wherein the pre-trained natural language video localization model is pre-trained on a curated dataset generated based on a negative sample distribution that is determined from one or more moments in one or more digital videos of a plurality of digital videos being in a category considered not relevant to a target query and a subset of false-negative samples removed; and one or more processors configured to cause the system to: process a search query from a client device that indicates one or more concepts utilizing the pre-trained natural language video localization model by: generating, utilizing a text embedding model, a text embedding from the search query received from the client device; identifying a set of digital video captions from a dataset of digital videos; generating, utilizing the text embedding model, digital video embeddings for the set of digital video captions from the dataset of digital videos; generating a set of similarity scores between the target query and the plurality of digital videos by comparing the text embedding from the search query with the digital video embeddings from the dataset of digital videos; and provide, in a graphical user interface, one or more indications of video content from one or more videos responsive to the search query based on the set of similarity scores between the digital video embeddings and the text embedding from the search query, wherein the one or more indications of video content comprises an indicator that points out a specific part of the video content that corresponds to the search query.
10 . The system of claim 9 , wherein the one or more processors are configured to cause the system to generate the digital video embeddings by utilizing a neural network to generate the set of digital video embeddings from digital video frames of the dataset of digital videos.
11 . The system of claim 10 , wherein the neural network comprises a Siamese-alike network architecture with late modality fusion.
12 . The system of claim 9 , wherein the one or more processors are configured to cause the system to: identify a first video caption corresponding to a first frame of a first digital video; identify a second video caption corresponding to a second frame of the first digital video; identify a third video caption corresponding to a first frame of a second digital video; and generate the digital video embeddings from the first video caption, the second video caption, and the third video caption.
13 . The system of claim 9 , wherein the one or more processors are configured to cause the system to provide one or more indications of video content by identifying a digital video embedding from the digital video embeddings with a similarity score closest to the text embedding from the search query.
14 . The system of claim 9 , wherein the one or more processors are configured to cause the system to provide one or more indications of video content by causing the client device to display one or more digital videos responsive to the search query, wherein the one or more indications comprise timestamps corresponding to the one or more digital videos responsive to the search query from the client device.
15 . The system of claim 9 , wherein the one or more processors are configured to cause the system to cause the client device to display one or more digital videos responsive to the search query by ranking the one or more digital videos responsive to the search query according to a corresponding similarity score with the search query.
16 . A non-transitory computer-readable medium storing executable instructions which, when executed by a processing device, cause the processing device to perform operations comprising: generating, utilizing a text embedding model, a set of similarity scores between a target query and a video dataset comprising a plurality of digital videos by: identifying a set of video captions corresponding with the plurality of digital videos; generating, utilizing a text embedding model, digital video embeddings for the set of video captions; generating, utilizing the text embedding model, a target query embedding for the target query; and generating the set of similarity scores by comparing the target query embedding with the digital video embeddings; determining a false-negative threshold for the plurality of digital videos; identifying a subset of false-negative samples of the plurality of digital videos to exclude based on the set of similarity scores and the false-negative threshold; determining a negative sample distribution for the plurality of digital videos based on the target query with the subset of false-negative samples excluded; identifying a subset of negative samples of the plurality of digital videos based on the negative sample distribution; generating a curated dataset comprising the identified subset of negative samples without the subset of false-negative samples; and learning parameters of a natural language video localization model based on the curated dataset.
17 . The non-transitory computer-readable medium of claim 16 , further comprising: identifying the subset of false-negative samples from the plurality of digital videos based on the subset of false-negative samples satisfying the false-negative threshold as indicated by a similarity score of the subset of false-negative samples being above a predetermined similarity score corresponding with negative sample candidates.
18 . The non-transitory computer-readable medium of claim 17 , wherein determining the false-negative threshold further comprises generating a false-negative distribution based on the set of similarity scores.
19 . The non-transitory computer-readable medium of claim 16 , wherein determining the negative sample distribution further comprises determining a mean distribution value and a standard deviation distribution value for the set of similarity scores with the subset of false-negative samples excluded.
20 . The non-transitory computer-readable medium of claim 16 , further comprises determining probability scores for the plurality of digital videos, wherein the probability scores indicate a likelihood of a digital video of the plurality of digital videos being a negative sample.

Description

BACKGROUND Recent years have seen significant advancement in hardware and software platforms for video sharing. For example, many video sharing platforms are more accessible due to advancements in networking and storage technology. As such, video sharing platforms receive new content daily, resulting in massive libraries of digital videos. However, despite these advancements, existing video sharing platforms continue to suffer from a variety of problems with regard to computational accuracy of locating specific videos and operational flexibility of implementing video sharing platforms on computing devices. SUMMARY One or more embodiments described herein provide benefits and/or solve one or more of the problems in the art with systems, methods, and non-transitory computer-readable media that implements a natural language video localization model to detect video moments within a database of digital videos that match a given natural language query. For example, in one or more embodiments, the disclosed systems provide detected video moment(s) (e.g., one or more indications of video content or timestamps from videos) that corresponds with a search query. In particular, in one or more implementations the disclosed systems localize video frames from a massive set of videos given a text query (e.g., a search query relating to the massive set of videos). Furthermore, in some embodiments the disclosed systems construct a dataset (e.g., curates a dataset of digital videos) to train the natural language video localization model. Moreover, as part of constructing the dataset, the disclosed systems generate a set of similarity scores between a target query and a video dataset. Further, based on the generated set of similarity scores, the disclosed systems exclude a subset of false-negative samples from the dataset. Accordingly, the disclosed systems generate a curated dataset that includes a specific subset of negative samples with the subset of false-negative samples excluded. In one or more embodiments, the disclosed systems learn parameters for the natural language video localization model based on the curated dataset. Additional features and advantages of one or more embodiments of the present disclosure are outlined in the description which follows, and in part will be obvious from the description, or may be learned by the practice of such example embodiments. BRIEF DESCRIPTION OF THE DRAWINGS This disclosure will describe one or more embodiments of the invention with additional specificity and detail by referencing the accompanying figures. The following paragraphs briefly describe those figures, in which: FIG. 1 illustrates an example environment in which a video dataset localization system operates in accordance with one or more embodiments; FIG. 2 illustrates an overview of the video dataset localization system providing one or more indications of video content from a dataset of digital videos to a client device in accordance with one or more embodiments; FIG. 3 illustrates a diagram of the video dataset localization system generating a set of similarity scores in accordance with one or more embodiments; FIG. 4 illustrates a diagram of the video dataset localization system utilizing a text embedding model in accordance with one or more embodiments; FIG. 5 illustrates a diagram of the video dataset localization system generating a curated dataset in accordance with one or more embodiments; FIG. 6 illustrates a diagram of the video dataset localization system learning parameters of the natural language video localization model based on the curated dataset in accordance with one or more embodiments; FIG. 7 illustrates an example schematic diagram of the video dataset localization system in accordance with one or more embodiments; FIG. 8 illustrates a flowchart of a series of acts for learning parameters for a natural language video localization model in accordance with one or more embodiments; FIG. 9 illustrates a flowchart of a series of acts for providing one or more indications of video content to a client device in accordance with one or more embodiments; FIG. 10 illustrates a block diagram of an exemplary computing device in accordance with one or more embodiments. DETAILED DESCRIPTION One or more embodiments described herein include a video dataset localization system that implements a natural language video localization model to provide one or more indications of video content from one or more videos of a video dataset in response to a search query. In particular, in one or more implementations the video dataset localization system expands search coverage to a massive video set to locate a moment within one or more videos that corresponds with a search query. Moreover, in one or more embodiments the video dataset localization system constructs a massive video moment retrieval dataset (e.g., curates a video dataset) for learning parameters of the natural language video localization model. In particular, thi