US-12620225-B2 - Segment identification from long videos

US12620225B2US 12620225 B2US12620225 B2US 12620225B2US-12620225-B2

Abstract

One or more aspects of the method, apparatus, and non-transitory computer readable medium include receiving a query relating to a long video. One or more aspects of the method, apparatus, and non-transitory computer readable medium further include generating a segment of the long video corresponding to the query using a machine learning model trained to identify relevant segments from long videos. One or more aspects of the method, apparatus, and non-transitory computer readable medium further include responding to the query based on the generated segment.

Inventors

Ioana Croitoru
TRUNG HUU BUI
Zhaowen Wang
Seunghyun Yoon
FRANCK DERNONCOURT
Hailin Jin

Assignees

ADOBE INC.

Dates

Publication Date: 20260505
Application Date: 20230424

Claims (17)

1 . A method comprising: receiving a query relating to a long video; encoding the query to obtain an encoded query; encoding a plurality of segments of the long video to obtain a plurality of encoded segments; generating a segment of the long video corresponding to the query using a machine learning model by comparing the encoded query to each of the plurality of encoded segments, wherein the machine learning model is trained to identify relevant segments from long videos based on input comprising encoded text features and encoded video features; and responding to the query based on the generated segment.
2 . The method of claim 1 , wherein: the machine learning model is trained based on training data including long segments formed by combining multiple video segments.
3 . The method of claim 1 , wherein: the generated segment comprises a long segment.
4 . The method of claim 1 , wherein: the segment comprises a start time and an end time within the long video.
5 . The method of claim 1 , wherein: the segment comprises a center time and a segment length.
6 . A system comprising: one or more processors; one or more memories including instructions executable by the one or more processors to: encode a query to obtain an encoded query, wherein the query relates to a long video; encode a plurality of segments of the long video to obtain a plurality of encoded segments; and generate a segment of the long video corresponding to the query based on a machine learning model by comparing the encoded query to each of the plurality of encoded segments, wherein the machine learning model is trained to identify relevant segments from long videos based on input comprising encoded text features and encoded video features.
7 . The system of claim 6 , wherein: the machine learning model is trained based on training data including long segments formed by combining multiple video segments.
8 . The system of claim 7 , wherein: the training data is filtered based on a relevancy filter.
9 . The system of claim 6 , further comprising: a multi-modal encoder configured to encode text input and video input.
10 . The system of claim 6 , wherein: the machine learning model includes a transformer encoder and a transformer decoder.
11 . The system of claim 6 , further comprising: a user interface configured to receive the query from a user and to provide a response to the user based on the generated segment.
12 . A non-transitory computer readable medium storing code for image processing, the code comprising instructions that, when executed by at least one processor, cause the at least one processor to perform operations comprising: receiving a query relating to a long video; encoding the query to obtain an encoded query; encoding a plurality of segments of the long video to obtain a plurality of encoded segments; generating a segment of the long video corresponding to the query using a machine learning model by comparing the encoded query to each of the plurality of encoded segments, wherein the machine learning model is trained to identify relevant segments from long videos based on input comprising encoded text features and encoded video features; and responding to the query based on the generated segment.
13 . The non-transitory computer readable medium of claim 12 , wherein: the machine learning model is trained based on training data including long segments formed by combining multiple video segments.
14 . The non-transitory computer readable medium of claim 13 , wherein: the training data is filtered based on a relevancy filter.
15 . The non-transitory computer readable medium of claim 12 , wherein: the generated segment comprises a long segment.
16 . The non-transitory computer readable medium of claim 12 , wherein: the segment comprises a start time and an end time within the long video.
17 . The non-transitory computer readable medium of claim 12 , wherein: the segment comprises a center time and a segment length.

Description

BACKGROUND The present disclosure relates generally to image processing and, in some embodiments, to identifying portions of a video relating to a user query. Videos provide a major source of information online, and the amount of video content continues to grow. Many videos are long videos that last more than an hour. In some cases, it is useful to search and find particular segments of long videos that relate to subject matter of interest. SUMMARY Embodiments of the present disclosure provide a machine learning model utilizing natural language processing to analyze a user query and identify a video segment relating to the user query from a long video. The long video can be segmented based on transcripts generated from an audio track of the long video through automatic speech recognition using a trained neural network. A method, apparatus, and non-transitory computer readable medium for identifying a video segment relating to a user query are described. One or more aspects of the method, apparatus, and non-transitory computer readable medium include receiving a query relating to a long video. One or more aspects of the method, apparatus, and non-transitory computer readable medium further include generating a segment of the long video corresponding to the query using a machine learning model trained to identify relevant segments from long videos. One or more aspects of the method, apparatus, and non-transitory computer readable medium further include responding to the query based on the generated segment. A method, apparatus, and non-transitory computer readable medium for identifying a video segment relating to a user query are described. One or more aspects of the method, apparatus, and non-transitory computer readable medium include one or more processors and one or more memories including instructions executable by the one or more processors to generate a segment of a long video corresponding to a query based on a machine learning model trained to identify relevant segments from long videos. A method, apparatus, and non-transitory computer readable medium for a method of training a machine learning model are described. One or more aspects of the method, apparatus, and non-transitory computer readable medium include obtaining a plurality of video segments. One or more aspects of the method, apparatus, and non-transitory computer readable medium further include combining a subset of the plurality of video segments to obtain a combined video segment. One or more aspects of the method, apparatus, and non-transitory computer readable medium further include training the machine learning model to identify a segment from a long video based on a query, wherein the training is based on training data including the combined video segment. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is an illustrative depiction of a high-level diagram of users interacting with a video segment identification system, including a neural network for generating video transcripts, and receiving user queries through remote devices, according to aspects of the present disclosure. FIG. 2 shows a block diagram of an example of a video segment identifier according to aspects of the present disclosure. FIG. 3 shows a flow diagram illustrating an example of video segment identification using a video segment identification system and methods, according to aspects of the present disclosure. FIG. 4 shows a flow diagram of a method for generating annotations for a video, according to aspects of the present disclosure. FIG. 5 shows a block/flow diagram illustrating an example of a method of video moment detection, according to aspects of the present disclosure. FIG. 6 shows a block/flow diagram of an example of a method of training an automatic annotation component model, according to aspects of the present disclosure. FIG. 7 shows a block/flow diagram of an example of a method of training an automatic annotation component, according to aspects of the present disclosure. FIG. 8 shows a block/flow diagram of an example of a method of video segment identification, according to aspects of the present disclosure. FIG. 9 shows a block/flow diagram of an example of a method of training a video segment identification model, according to aspects of the present disclosure. FIG. 10 shows an example of a computing device for a video segment identification system according to aspects of the present disclosure. DETAILED DESCRIPTION The present disclosure relates generally to image processing, and in some embodiments, to identifying relevant video segments in a video (i.e., moment detection). In some cases, video segments are identified based on a query from a user. Moment detection for a video involves localizing the moment of interest described by an input query. Being able to find a video segment of interest in videos that can be hours long has a broad range of applications from security to entertainment. A method for moment detection is described. One or more asp