US-12626729-B2 - System and method for video/audio comprehension and automated clipping

US12626729B2US 12626729 B2US12626729 B2US 12626729B2US-12626729-B2

Abstract

Systems and Methods for Video/Audio Comprehension and Automated Clipping includes providing at least one media clip (MC) within an event for display or listening on a user device including receiving audio or video media data indicative of the event, transcribing the media data into timestamped text, identifying entities within the text, creating text segments having a begin timestamp and end timestamp and having a minimum number of entity mentions in the text segments, clipping from the media data the at least one media clip having a begin timestamp and end timestamp corresponding to the begin timestamp and end timestamp of a corresponding one of the text segments, and providing the at least one media clip to the user device for viewing or listening by a user. Feedback may also be provided to adjust the logic that identifies MCs. MC Alerts may also be sent to users autonomously or based on user-set parameters.

Inventors

Andrew Hyde
Geoffrey Booth
Ognjen Boras
Jonathan Flanders
Danny Donnell

Assignees

DISNEY ENTERPRISES, INC.

Dates

Publication Date: 20260512
Application Date: 20230531

Claims (20)

1 . An automated computer-based method for providing at least one media clip (MC) from an event for display or listening on a user device, comprising: receiving media data indicative of the event; transcribing the audio portion of the media data into text with timestamps; identifying entities within the text, the entities being named in the content of the text; performing phonetic correction and co-reference resolution of the entities using predetermined phonetic rules and predetermined co-reference rules, respectively; segmenting the text into a plurality of text segments based on predetermined text segment creation rules, each of the text segments having at least one of the entities and having a segment begin timestamp and a segment end timestamp; clipping from the media data the at least one media clip having a clip begin timestamp and clip end timestamp that corresponds to the segment begin timestamp and the segment end timestamp of a corresponding one of the text segments; and providing the at least one media clip for viewing or listening on the user device, wherein the identifying, performing, segmenting, and clipping are performed contiguously in an automated manner without human intervention after receiving the media data using the predetermined phonetic rules, the predetermined co-reference rules, and the predetermined text segment creation rules.
2 . The method of claim 1 , further comprising determining an entity classification of the at least one entity comprising an amount of time that the at least one entity is mentioned during a given segment or during the entire event, and wherein the media clip includes the entity classification.
3 . The method of claim 1 , wherein the segmenting further comprises creating text clusters from the text based on cluster creation rules, each cluster having at least one entity and having a cluster begin time and a cluster end time.
4 . The method of claim 3 , wherein the cluster creation rules comprises at least one of: maximum entity gap length, minimum mention count, minimum cluster length, and cluster adjustment time, and cluster exclusion rules.
5 . The method of claim 3 , further comprising receiving feedback from a user or an editor on the quality of the at least one media clip and adjusting the cluster creation rules or segment creation rules to improve the quality of media clip.
6 . The method of claim 5 , where in the adjusting is performed using a machine learning model which is trained using prior adjustments.
7 . The method of claim 1 , wherein the segment creation rules comprises at least one of maximum segment length and segment exclusion rules.
8 . The method of claim 1 , wherein the phonetic rules comprises a minimum possible phonetic partial match.
9 . The method of claim 1 , wherein the co-reference rules comprises a co-reference offset maximum.
10 . The method of claim 1 , further comprising aggregating a plurality of the at least one media clip from a plurality of different shows or events.
11 . The method of claim 10 , wherein the plurality of different shows or events corresponds to shows or events selected by the user.
12 . The method of claim 1 , wherein the co-reference resolution comprises associating the entities in the text with corresponding pronouns, relationship words, nicknames, and abbreviations.
13 . The method of claim 1 , wherein the user device comprises a graphic user interface (GUI), which when selected, causes the media clip to play on a device display.
14 . The method of claim 1 , further comprising sending an MC alert message to the user device when a MC is available for viewing or predetermined MC alert criteria are satisfied.
15 . The method of claim 14 , wherein the predetermined MC alert criteria comprise at least one of: MC matching user attributes, MC matching user MC likes, MC matching user Alert settings.
16 . The method of claim 1 , further comprising receiving a settings command from a user and receiving settings inputs from a user.
17 . The method of claim 1 , further comprising receiving user attributes data from a user.
18 . The method of claim 1 , further comprising determining a title for the text segment and providing the title with the media clip for display by the user device.
19 . The method of claim 1 , wherein the media clip is less than 5 min long.
20 . The method of claim 1 , wherein the event comprises a sports show or sporting event.

Description

BACKGROUND The process of reviewing and analyzing long-form and live media content, such as video and audio, to create short-form video-on-demand (VOD) or audio-on-demand (AOD) clips or segments for consumption by users, is a process that requires extensive time due to manual processes. Such media clips are currently created via manual inspection and detailed review of the media content to identify specific topics in the longer video, e.g., a sports show or event or other shows or events, to create the clips. Such a process is slow and results in a very limited quantity of short VOD media clips for consumption by users or consumers. Accordingly, it would be desirable to have a system and method that increases the amount of such clips and decreases the time to create them, thereby providing a greater number of short VOD/AOD media content clips that are of interest to sports fans or the general public. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a top-level block diagram of components of a video/audio comprehension and automated clipping system, in accordance with embodiments of the present disclosure. FIG. 2 is a more detailed block diagram of components of FIG. 1, in accordance with embodiments of the present disclosure. FIG. 3 is a flow diagram of one of the components in FIG. 1, in accordance with embodiments of the present disclosure. FIG. 4 is a flow diagram of one of the components in FIG. 2, in accordance with embodiments of the present disclosure. FIG. 5A is a flow diagram of one of the components in FIG. 2, in accordance with embodiments of the present disclosure. FIG. 5B is a diagram of one of the components in FIG. 2, in accordance with embodiments of the present disclosure. FIG. 5C is a table showing a sample listing of Entities (or topics), in accordance with embodiments of the present disclosure. FIG. 5D is a table showing Segment and Clipping Rules/Data, in accordance with embodiments of the present disclosure. FIG. 6A is a flow diagram of one of the components in FIG. 2, in accordance with embodiments of the present disclosure. FIG. 6B is a timeline diagram showing how text clusters and text segments are created from transcript text, in accordance with embodiments of the present disclosure. FIG. 6C is a table showing text Clusters and Segments data, in accordance with embodiments of the present disclosure. FIG. 6D is a diagram showing a portion of transcript text after Entities have been identified, in accordance with embodiments of the present disclosure. FIG. 6E is a diagram showing a portion of transcript text after a first text Cluster has been identified, in accordance with embodiments of the present disclosure. FIG. 6F is a diagram showing a portion of transcript text after a second text Cluster has been identified, in accordance with embodiments of the present disclosure. FIG. 6G is a diagram showing a portion of transcript text after a third text Cluster has been identified, in accordance with embodiments of the present disclosure. FIG. 6H is a diagram showing a portion of transcript text after a text Segment has been identified, in accordance with embodiments of the present disclosure. FIG. 6I is a table showing text Segment Entity Listing for the text Segment of FIG. 6H, in accordance with embodiments of the present disclosure. FIG. 6J is a diagram showing raw detected Entity classification and rolled-up Entity classification, in accordance with embodiments of the present disclosure. FIG. 6K is a diagram showing two examples of a comparison of conventional tagging and new enhanced tagging (or classification), in accordance with embodiments of the present disclosure. FIG. 7A is a flow diagram of one of the components in FIG. 2, in accordance with embodiments of the present disclosure. FIG. 7B is a diagram showing the alignment of entries in the Clusters and Segments Table of FIG. 6C with the entries of input AV Media Data for clipping of AV Media Data, in accordance with embodiments of the present disclosure. FIG. 7C is a table showing a sample listing of Media Clips (MCs) and certain features and attributes associated with the Media Clips, in accordance with embodiments of the present disclosure. FIG. 8A is a flow diagram of one of the components in FIG. 1, in accordance with embodiments of the present disclosure. FIG. 8B is a diagram showing the combination of MCs from different MC Listing Tables to form an MC Aggregate Listing Table, in accordance with embodiments of the present disclosure. FIG. 8C is a table showing show/event metadata, in accordance with embodiments of the present disclosure. FIG. 8D is a table showing a sample listing of user attributes, in accordance with embodiments of the present disclosure. FIG. 9 is a flow diagram of one of the components in FIG. 1, in accordance with embodiments of the present disclosure. FIG. 10A is a flow diagram of Media Clip App (MC App) software application, in accordance with embodiments of the present disclosure. FIG. 10B is a scre