US-12626727-B2 - Syncing commentary with videos

US12626727B2US 12626727 B2US12626727 B2US 12626727B2US-12626727-B2

Abstract

Systems, methods, and computer program products for automatically syncing commentary with videos are described herein. A method comprises reading a sequence of frames of a video; generating frame documents based on the sequence of frames; reading commentary associated with the video; providing the commentary as input to a language model; reading embeddings generated by the language model based on the commentary; generating a commentary document in accordance with the embeddings; determining a semantic distance between the commentary document and each of the frame documents; selecting a subset of the set of frame documents having the lowest semantic distance to the commentary document; identifying a consecutive subsequence of the sequence of frames associated with the subset; providing at least two frames of the consecutive subsequence and the embeddings as input to a diffusion model; and reading a first frame generated by the diffusion model.

Inventors

Aaron Keith Baughman
Leonid KARLINSKY
Gozde Akay
Eduardo Morales

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260512
Application Date: 20241010

Claims (20)

1 . A computer-implemented method comprising: reading a sequence of frames of a video; generating a set of frame documents based on the sequence of frames, wherein each frame document of the set of frame documents corresponds to at least one of the frames of the sequence of frames; reading commentary associated with the video; providing the commentary as input to a language model; reading embeddings generated by the language model based on the commentary, wherein the embeddings characterize the commentary; generating a commentary document in accordance with the embeddings; determining a semantic distance between the commentary document and each of the frame documents; selecting a subset of the set of frame documents having the lowest semantic distance to the commentary document; identifying a consecutive subsequence of the sequence of frames associated with the subset; providing at least two frames of the consecutive subsequence and the embeddings as input to a diffusion model; and reading a first frame generated by the diffusion model.
2 . The computer-implemented method of claim 1 , wherein reading the sequence of frames comprises individually receiving each frame during a stream of the video.
3 . The computer-implemented method of claim 1 , wherein a first frame document corresponding to a first frame is a distribution over topics associated with the first frame.
4 . The computer-implemented method of claim 1 , wherein providing each frame of the sequence of frames as input to the machine learning model comprises providing the at least two consecutive frames as input to the machine learning model.
5 . The computer-implemented method of claim 1 , wherein generating the set of frame documents comprises: providing each frame of the sequence of frames as input to a machine learning model; reading a feature map generated by the machine learning model based on at least one frame of the sequence of frames, wherein the feature map characterizes objects depicted in the at least one frame and semantic relationships among the objects; and generating the frame document in accordance with the feature map.
6 . The computer-implemented method of claim 1 , the computer-implemented method further comprising: providing the first frame and an end frame as input to the diffusion model, wherein the at least two consecutive frames comprise the end frame; reading a second frame generated by the diffusion model; determining whether the first frame and the second frame are equivalent; and determining whether a number of frames greater than or equal to an insertion threshold have been generated.
7 . The computer-implemented method of claim 1 , the computer-implemented method further comprising: inserting the first frame into the sequence of frames such that the video is modified.
8 . The computer-implemented method of claim 7 , the computer-implemented method further comprising: synchronizing an audio representation of the commentary with the video in accordance with the consecutive subsequence; and transmitting the video for presentation via a client computing platform.
9 . The computer-implemented method of claim 1 , wherein the machine learning model is a Feature Pyramid Network.
10 . The computer-implemented method of claim 1 , the computer-implemented method further comprising: providing a prompt and a characterization of the consecutive subsequence as input to a generative machine learning model, wherein the prompt indicates a duration of a shortened commentary to be generated.
11 . The computer-implemented method of claim 1 , the computer-implemented method further comprising determining whether a length of the commentary is greater than a length of the consecutive subsequence, wherein the at least two consecutive frames of the sequence of frames and the embeddings are provided as input to the diffusion model responsive to determining the length of the commentary is greater than the length of the consecutive subsequence.
12 . A computer program product for syncing commentary with videos, the computer program product comprising: one or more non-transitory computer-readable storage media; program instructions stored on the one or more non-transitory computer-readable storage media to perform operations comprising: reading a sequence of frames of a video; generating a set of frame documents generated by the machine learning model based on the sequence of frames, wherein each frame document of the set of frame documents corresponds to at least one of the frames of the sequence of frames; reading commentary associated with the video; providing the commentary as input to a language model; reading embeddings generated by the language model based on the commentary, wherein the embeddings characterize the commentary, generating a commentary document in accordance with the embeddings; determining a semantic distance between the commentary document and each of the frame documents selecting a subset of the set of frame documents having the lowest semantic distance to the commentary document; identifying a consecutive subsequence of the sequence of frames associated with the subset; providing at least two frames of the consecutive subsequence and the embeddings as input to a diffusion model; and reading a first frame generated by the diffusion model.
13 . The computer program product of claim 12 , wherein generating the set of frame documents comprises: providing each frame of the sequence of frames as input to a machine learning model; reading a feature map generated by the machine learning model based on at least one frame of the sequence of frames, wherein the feature map characterizes objects depicted in the at least one frame and semantic relationships among the objects; and generating the frame document in accordance with the feature map.
14 . The computer program product of claim 12 , wherein the operations further comprise: providing the first frame and an end frame as input to the diffusion model, wherein the at least two consecutive frames comprise the end frame; reading a second frame generated by the diffusion model; determining whether the first frame and the second frame are equivalent; and determining whether a number of frames greater than or equal to an insertion threshold have been generated.
15 . The computer program product of claim 12 , wherein the operations further comprise: inserting the first frame into the sequence of frames such that the video is modified.
16 . The computer program product of claim 12 , wherein the operations further comprise: synchronizing an audio representation of the commentary with the video in accordance with the consecutive subsequence; and transmitting the video for presentation via a client computing platform.
17 . A computer system for syncing commentary with videos, the computer system comprising: a processor set; one or more computer-readable storage media; program instructions stored on the one or more computer-readable storage media to cause the processor set to perform operations comprising: reading a sequence of frames of a video; generating a set of frame documents based on the sequence of frames, wherein each frame document of the set of frame documents corresponds to at least one of the frames of the sequence of frames; reading commentary associated with the video; providing the commentary as input to a language model; reading embeddings generated by the language model based on the commentary, wherein the embeddings characterize the commentary; generating a commentary document in accordance with the embeddings; determining a semantic distance between the commentary document and each of the frame documents; selecting a subset of the set of frame documents having the lowest semantic distance to the commentary document; identifying a consecutive subsequence of the sequence of frames associated with the subset; providing at least two frames of the consecutive subsequence and the embeddings as input to a diffusion model; and reading a first frame generated by the diffusion model.
18 . The computer system of claim 17 , wherein generating the set of frame documents comprises: providing each frame of the sequence of frames as input to a machine learning model; reading a feature map generated by the machine learning model based on at least one frame of the sequence of frames, wherein the feature map characterizes objects depicted in the at least one frame and semantic relationships among the objects; and generating the frame document in accordance with the feature map.
19 . The computer system of claim 17 , wherein the operations further comprise: providing the first frame and an end frame as input to the diffusion model, wherein the at least two consecutive frames comprise the end frame; reading a second frame generated by the diffusion model; determining whether the first frame and the second frame are equivalent; and determining whether a number of frames greater than or equal to an insertion threshold have been generated.
20 . The computer program product of claim 17 , wherein the operations further comprise: inserting the first frame into the sequence of frames such that the video is modified.

Description

BACKGROUND Embodiments of the present disclosure relate to automatically syncing commentary with videos. SUMMARY According to embodiments of the present disclosure, methods of, computer program products for, and computer systems for syncing commentary with videos are disclosed. A method for syncing commentary with videos may comprise reading a sequence of frames of a video. The method may comprise providing each frame of the sequence of frames as input to a machine learning model. The method may comprise reading a set of frame documents generated by the machine learning model based on the sequence of frames. Each frame document of the set of frame documents may correspond to at least one of the frames of the sequence of frames. The method may comprise reading commentary associated with the video. The method may comprise providing the commentary as input to a language model. The method may comprise reading embeddings generated by the language model based on the commentary. The embeddings may characterize the commentary. The method may comprise generating a commentary document in accordance with the embeddings. The method may comprise determining a semantic distance between the commentary document and each of the frame documents. The method may comprise selecting a subset of the set of frame documents having the lowest semantic distance to the commentary document. The method may comprise identifying a consecutive subsequence of the sequence of frames associated with the subset. The method may comprise providing at least two consecutive frames of the sequence of frames and the embeddings as input to a diffusion model. The consecutive subsequence may comprise the at least two consecutive frames. The method may comprise reading a first frame generated by the diffusion model. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram depicting an exemplary process for syncing commentary with videos, in accordance with one or more embodiments of this disclosure. FIG. 2 is a flow diagram depicting an exemplary method for syncing commentary with videos, in accordance with one or more embodiments of this disclosure. FIG. 3 is a schematic diagram of a computing node, in accordance with one or more embodiments of the present disclosure. DETAILED DESCRIPTION Commentary can be generated of any length about a discrete event within a live streamed video using artificial intelligence. The commentary can be presented using a media streaming service. The commentary can also be embedded within Videos on Demand (VOD) for distribution around the world. Using current methods, when embedding audio files for the commentary into a video, graphical and audio syncing is, at most, approximate. Further, the audio files for the commentary may not be the correct duration for the video. For example, the sound may describe multiple discrete events during depiction of only one event when the visual scene for the single discrete event is too short for the commentary. For example, if the scene length depicting a discrete event is too short for the commentary, the commentary may need to be omitted. A few manual processes have been devised to extract highlights from the commentary during graphic side cars. The majority of such commentary is embedded into the video's based on point time statistics. Frequently, the point timing is missing or not accurate. In some events, the timing data is too broad for the accurate introduction of precise commentary for specific moments. As such, a way to automatically sync generated commentary with increased accuracy and efficiency is desired. In some implementations, the method for syncing the commentary must be completed during a live stream of a real-world or virtual event. In such implementations, the audio files for the commentary must be embedded in the video instantaneously or nearly instantaneously. Such methods, as described herein, automatically and accurately embed the audio files for commentary into videos. Further, the methods described herein, automatically extend scene length of the video and/or generate shorter commentary as needed to accurately synchronize the video with the commentary. FIG. 1 is a block diagram illustrating an exemplary process 100 for syncing commentary with videos according to one or more exemplary embodiments of the present disclosure. In some implementations, process 100 is implemented in one or more processing devices (e.g., a digital processor, an analog processor, a digital circuit designed to process information, a state machine, and/or other mechanisms for electronically processing information). The one or more processing devices may include one or more devices configured through hardware, firmware, and/or software to be specifically designed for execution of one or more of the operations of process 100. Process 100 may comprise reading video 102. Video 102 may comprise a sequence of frames, audio, and/or other information. Each frame of the sequence of frames may be