US-12621518-B2 - Video content enhancement

US12621518B2US 12621518 B2US12621518 B2US 12621518B2US-12621518-B2

Abstract

Video content enhancement methods, systems and devices are disclosed. An input video 11 is processed to generate an augmentation descriptor 4 from it, the augmentation descriptor 4 defining augmentations that enhance the content of the input video 11 . An output video 14 is generated by combining the augmentations defined by the augmentation descriptor 4 with the input video 11 , so that playback of the output video 14 features said augmentations. A timecoded transcript is generated from at least an audio portion of the input video 11 , and the timecoded transcript is processed to generate from it a set of transcript-derived enhancements. The set of transcript-derived enhancements are added to the augmentation descriptor 4 as augmentations for enhancing the content of the input video 11.

Inventors

Thomas William Richard Armitage
Thomas Charles Bridges
Tijmen Pieter Brommet
Alexander Edmund Hardman
Andrew Campbell McDonough
Daniel Thomas Nuttall
Erik Ingemar Nygren

Assignees

CAPTIONHUB LTD

Dates

Publication Date: 20260505
Application Date: 20240312
Priority Date: 20230312

Claims (18)

1 . A video content enhancement system suitable for captioning of live broadcasts from a live broadcaster, the system comprising: a server comprising an augmentation generator configured to receive and process an input video to generate an augmentation descriptor from it, the augmentation descriptor defining augmentations that enhance the content of the input video; and a client device of the live broadcaster, the client device comprising an augmentation applicator configured to generate an output video by combining the augmentations defined by the augmentation descriptor with the input video, so that playback of the output video features said augmentations; wherein the augmentation generator is configured to: transmit at least an audio portion of the input video to a transcript generator and receive in response from the transcript generator a timecoded transcript; process the timecoded transcript to generate from it a set of transcript-derived enhancements, the processing comprising applying a grouping process to assign a group of sequential transcript elements to each transcript-derived enhancement; and add the set of transcript-derived enhancements to the augmentation descriptor as augmentations for enhancing the content of the input video; and wherein each of the set of transcript-derived enhancements define: an in-point, corresponding to a timecode when the enhancement is to be introduced during playback of the output video; and an out-point, corresponding to a timecode when the enhancement is to be removed during playback of the output video, the system being configured for just-in-time video content enhancement wherein the input video is in the form of an input video stream, and the output video is in the form of an output video stream, wherein the augmentation generator of the server is configured to receive and process the input video stream, to generate a corresponding stream of augmentation descriptors from it, the augmentation descriptors including synchronisation data, wherein the augmentation applicator of the client device is configured to receive the stream of augmentation descriptors and synchronise them, using the synchronisation data, with the output video stream the output video stream being transmitted a time delay after the availability of the input video stream.
2 . The system of claim 1 , wherein the set of transcript-derived enhancements comprises a block of text derived from elements of the timecoded transcript, a period of featuring augmentations derived from the block of text being defined by the in-point and the out-point of the respective enhancement.
3 . The system of claim 2 , wherein the timecoded transcript comprises: a plurality of transcript elements, each of which comprises a set of values for a respective predetermined set of attributes, the predetermined set of attributes comprising: a name attribute for identifying an individual word of speech determined to be spoken within the audio portion of the input video; and at least one temporal attribute for specifying a timing of the individual word of speech identified by the name attribute; the at least one temporal attribute optionally comprising: a time attribute for specifying a time at which the individual word of speech identified by the name attribute was spoken relative to a common time reference, such as a start time of the input video; and a duration attribute for specifying a duration for which the individual word of speech denoted by the name attribute was spoken; the name attribute optionally further identifies a component of punctuation; wherein the transcript generator may be configured to receive and process a frames portion of the input video to determine lip movements of speakers, the determined lip movements being used to improve the accuracy of a speech-to-text transcription from the audio portion of the input video; wherein processing of the timecoded transcript to generate from it each transcript-derived enhancements may comprise at least one of: generating the block of text of each transcript-derived enhancement by combining the values of the name attribute of each transcript element belonging to the group assigned to that transcript-derived enhancement; and applying a timing optimisation process to determine the in-point and out-point of each transcript-derived enhancement; wherein applying the grouping process comprises at least one of: assigning a first transcript element of the timecoded transcript to a first group; assigning subsequent transcript elements to the same group as a previous transcript element if a size of the group to which a previous transcript element is assigned does not exceed a predetermined group size threshold; and assigning subsequent transcript elements to subsequent groups if: a time gap between subsequent and previous transcript elements, as calculated from the values of their respective temporal attributes, exceed a predetermined time gap threshold; the value of the name attribute of the subsequent transcript element includes a predetermined word; and/or the value of the name attribute of the previous transcript element includes a predetermined punctuation character; wherein applying the timing optimisation process may comprise at least one of: deriving the in-point and/or out-point of each transcript-derived enhancement from values of temporal attributes of transcript elements belonging to the group assigned to that transcript-derived enhancement; deriving the in-point of each transcript-derived enhancement from the value of at least one temporal attribute of a first of the group of transcript elements assigned to that transcript-derived enhancement; setting the in-point of each transcript-derived enhancement to immediately follow the out-point of a previous transcript-derived enhancement; deriving the out-point of each transcript-derived enhancement from the value of at least one temporal attribute of a last of the group of transcript elements assigned to that transcript-derived enhancement; and deriving at least one of the in-point and out-point of each transcript-derived enhancement in response to determining an optimal and/or a minimum featuring duration for that transcript-derived enhancement; and wherein determining the optimal and/or minimum featuring duration may comprise at least one of: setting a predetermined upper and/or lower featuring duration threshold; setting the optimal and/or minimum featuring duration to be proportional to at least one of: a determined size of the group of transcript elements assigned to that transcript-derived enhancement; and a determined word, syllable and/or character length of the block of text of that transcript-derived enhancement.
4 . The system of claim 1 , further comprising a translation engine configured to generate a translation from a source language to a user-selected output language of at least one of: the timecoded transcript; and blocks of text of each transcript-derived enhancement.
5 . The system of claim 1 , wherein the transcript generator is configured to translate from a source language detected in the audio portion of the input video to a user-selected output language.
6 . The system of claim 1 , wherein the set of transcript-derived enhancements comprises a plurality of caption-type enhancements, each caption-type enhancement comprising a block of text derived from elements of the time-coded transcript, a period of display of the block of text being defined by the in-point and the out-point of the respective caption-type enhancement; wherein each caption-type enhancement optionally comprises line break data for specifying if and where a line break occurs within the block of text when displayed; the augmentation generator being configured to generate the line break data so that the position of a line break within the block of text: occurs if the characters of the block of text exceed a predetermined length; occurs prior to a predetermined word in the block of text (such as a conjunction like “and”); and/or does not occur prior to a final word in the block of text; wherein each caption-type enhancement optionally comprises text positioning data for specifying the position of the block of text, when displayed during playback of the output video; wherein the augmentation generator is optionally configured to generate the text positioning data so that the block of data is positioned, by default, at a lower central region of the output video; wherein the augmentation generator further optionally comprises a text position optimisation module configured to: analyse a frames portion of the input video to recognise a predetermined set of visual entities and their screen occupancy over time; process caption-type enhancements to make a position determination about where the associated block of text of the caption-type enhancement is positioned relative to the recognised set of visual entities during the period defined by the in-point and out-point of that caption-type enhancement; and set the text positioning data of that caption-type enhancement in response to the position determination; wherein setting the text positioning data of that caption-type enhancement in response to the position determination minimises occlusion of the recognised visual entity by the block of text of that caption-type enhancement; wherein each predetermined set of visual entities have a predetermined visibility score that defines the importance of the non-occlusion of that visual entity by blocks of text; and setting the text positioning data priorities non-occlusion of visual entities with a higher visibility score; wherein the predetermined visual entities comprise a plurality of speakers, and the augmentation generator is configured to: determine which block of text originates from which of the plurality of speakers; set the text positioning data of a block of text so that it is displayed at a position closer to the speaker from which that block of text originates, than the other of the plurality of speakers; wherein the caption-type enhancement comprises text formatting data for specifying the formatting of the block of text, when displayed during playback of the output video; the augmentation generator being configured to set the text formatting data in dependence on the detected visual properties of the input video; wherein the augmentation generator is configured to set the text formatting data in dependence on a frame resolution of the input video; wherein a predetermined set of attributes of transcript elements of the timecoded transcript comprise a speaker attribute for identifying the speaker of an individual word of speech determined to be spoken within the audio portion of the input video; and the augmentation generator is configured to set the text formatting data of a block of text in dependence on which speaker originated that block of text, so that block of texts originating from difference speakers are visually distinguishable from one another; wherein applying the grouping process comprises assigning transcript elements to different groups if the value of the speaker attribute is different.
7 . The system of claim 1 , wherein the system comprises a scene transition detector configured to process the input video to determine at least one scene transition timecode, corresponding to when there is a scene change, the scene transition detector performing a frame comparison operation to determine the at least one scene transition timecode; wherein the frame comparison operation comprises computing a difference between visual properties of consecutive frames, and determining a scene transition when the visual properties change above a predetermined limit; and wherein applying the timing optimisation process comprises: determining if the out-point derived from the value of at least one temporal attribute of the last of the group of transcript elements assigned to a respective transcript-derived enhancement corresponds to a timecode within a predetermined threshold of one of the at least one scene transition timecodes, and if so changing the out-point of the transcript-derived enhancement to be equivalent to that scene transition timecode.
8 . The system of claim 1 , further comprising an augmentation editor configured to edit the augmentation descriptor generated by the augmentation generator; wherein the augmentation editor comprises a user interface to receive user input to edit the augmentation descriptor; wherein the user interface displays, and receives user input to control: grouping parameters such as the predetermined group size threshold, and the predetermined time gap threshold; and/or line break parameter, such as the predetermined length of a block of text; wherein the augmentation editor is optionally configured to receive a user command to operate the augmentation generator so that the augmentation generator processes the input video to generate from it the augmentation descriptor; wherein the augmentation editor is optionally configured to present a timeline representative of the input video; wherein the augmentation editor is optionally configured to interface with a scene transition detector to obtain at least one scene transition timecode and display a transition marker of the scene transition at a position on the timeline corresponding to the respective timecode; wherein the augmentation editor is optionally configured to read the augmentation descriptor and, in response, display at least one enhancement element that is representative of a respective transcript-derived enhancements from the augmentation descriptor; wherein the at least one enhancement element is optionally positioned along the timeline at a position corresponding to the in-points and/or out-points of the respective transcript-derived enhancement; wherein the at least one enhancement element is optionally user-interactable to support at least one user interaction that updates the corresponding transcript-derived enhancement thereby updating the augmentation descriptor; wherein the at least one user interaction is at least one of: a movement interaction, a resizing interaction, a splitting interaction, an interconnection interaction, and a content-editing interaction; wherein the at least one user interaction is optionally a movement interaction that moves the at least one enhancement element along the timeline and, in response, updates the in-point and the out-point of the transcript-derived enhancements represented by that enhancement element, but preserves the duration of the period represented by the in-point and out-point; wherein: each enhancement element optionally occupies a space on the timeline delimited by a starting end and a finishing end that are respectfully representative of the in-point and out-point of the respective transcript-derived enhancement; and the at least one user interaction is a resizing interaction that shifts either the starting or finishing end of the enhancement element along the timeline and, in response, updates either of the corresponding in-point or the out-point of the transcript-derived enhancements represented by the enhancement element, so as to alter the time period represented by the in-point and out-point; wherein the at least one user interaction is optionally a splitting interaction that splits the enhancement element into two separate enhancement elements, each representing one of two transcript-derived enhancements that contain a respective portion of the block of text of the transcript-derived enhancement originally represented by the enhancement element before the split, wherein the augmentation editor is optionally configured to automatically update the position of one or more enhancement elements that are adjacent to an enhancement element that is receiving a movement or resizing interaction from a user to: preserve the ordering the enhancement elements; prevent timeline overlap of those one or more adjacent enhancement elements; and/or preserve the size of those one or more adjacent enhancement elements, and so the duration of the associated one or more transcript-derived enhancements, wherein the at least one user interaction is optionally an interconnection interaction that, when user-selected: resizes a specified enhancement element by placing its finishing end to the starting end of a subsequent enhancement element along the timeline; and updates the out-point of the transcript-derived enhancement represented by the specified enhancement element to be substantially equivalent to the value of the in-point of the transcript-derived enhancement represented by the subsequent enhancement element, wherein each enhancement element optionally displays text corresponding to the block of text of the transcript-derived enhancement that it represents, wherein the at least one user interaction is optionally a content-editing interaction that updates at least one of: the content of the block of text; line break data; and text positioning data; of the transcript-derived enhancement that the enhancement element represents.
9 . The system of claim 8 , wherein the augmentation editor is configured to issue an indication if it is determined that properties of a transcript-derived enhancement exceed predetermined constraint parameters, the indication being associated with an enhancement element representing the transcript-derived enhancement.
10 . The system of claim 1 , wherein the set of transcript-derived enhancements comprises a voice-over-type enhancement that includes a block of text to be transcoded into speech, the augmentation applicator being configured to generate speech audio from the block of text of the voice-over-type enhancement.
11 . The system of claim 10 , wherein a prosody rate of the generated speech audio is controlled by the augmentation applicator so that the timing and duration of the speech audio substantially conforms with the in-point and out-point of the voice-over-type enhancement.
12 . The system of claim 11 , wherein audio properties of the generated speech audio is controlled by the augmentation applicator to be similar to the audio properties of a speaker detected from the audio portion of the input video, the speaker originating the block of text of the voice-over-type enhancement.
13 . The system of claim 12 , wherein the generated speech audio is pitch-shifted to have a similar vocal range to that of the original speaker within the input video.
14 . The system of claim 10 , wherein the voice-over-type enhancement comprises video alteration instructions for altering the output video in dependence on the block of text to be transcoded into speech.
15 . The system of claim 14 , wherein the video alteration instructions comprise lip-movement alteration to be made to the input video so that movement of the lips of a speaker within the output video appear to match with the speech audio generated from the block of text of the voice-over-type enhancement.
16 . The system according to claim 14 , further comprising a voice-over-type enhancement generator that is configured to process the input video to generate voice-over-type enhancements having the video alteration instructions.
17 . A method of video content enhancement suitable for captioning of live broadcasts from a live broadcaster, the method comprising: processing, at a server, an input video to generate an augmentation descriptor from it, the augmentation descriptor defining augmentations that enhance the content of the input video; and generating, at a client device, an output video by combining the augmentations defined by the augmentation descriptor with the input video, so that playback of the output video features said augmentations; wherein the processing of the input video comprises: deriving a timecoded transcript from at least an audio portion of the input video; processing the timecoded transcript to generate from it a set of transcript-derived enhancements that define an in-point, corresponding to a timecode when the enhancement is to be introduced during playback of the output video, and an out-point, corresponding to a timecode when the enhancement is to be removed during playback of the output video; applying a grouping process to assign a group of sequential transcript elements to each transcript-derived enhancement; and adding the set of transcript-derived enhancements to the augmentation descriptor as augmentations for enhancing the content of the input video, wherein the method provides just-in-time video content enhancement wherein the input video is in the form of an input video stream, and the output video is in the form of an output video stream; wherein the processing the input video stream, generates a corresponding stream of augmentation descriptors from it, the augmentation descriptors including synchronisation data; wherein the stream of augmentation descriptors are synchronised, using the synchronisation data, with the output video stream, the output video stream being transmitted a time delay after the availability of the input video stream.
18 . A video content enhancement system comprising: an augmentation generator configured to receive and process an input video to generate an augmentation descriptor from it, the augmentation descriptor defining augmentations that enhance the content of the input video; and an augmentation applicator configured to generate an output video by combining the augmentations defined by the augmentation descriptor with the input video, so that playback of the output video features said augmentations; wherein the augmentation generator is configured to: transmit at least an audio portion of the input video to a transcript generator and receive in response from the transcript generator a timecoded transcript; process the timecoded transcript to generate from it a set of transcript-derived enhancements; and add the set of transcript-derived enhancements to the augmentation descriptor as augmentations for enhancing the content of the input video; and wherein each of the set of transcript-derived enhancements define: an in-point, corresponding to a timecode when the enhancement is to be introduced during playback of the output video; and an out-point, corresponding to a timecode when the enhancement is to be removed during playback of the output video; wherein the set of transcript-derived enhancements comprises a block of text derived from elements of the timecoded transcript, a period of featuring augmentations derived from the block of text being defined by the in-point and the out-point of the respective enhancement; wherein the timecoded transcript comprises: a plurality of transcript elements, each of which comprises a set of values for a respective predetermined set of attributes, the predetermined set of attributes comprising: a name attribute for identifying an individual word of speech determined to be spoken within the audio portion of the input video; and at least one temporal attribute for specifying a timing of the individual word of speech identified by the name attribute; wherein processing of the timecoded transcript to generate from it each transcript-derived enhancements comprises: applying a grouping process to assign a group of sequential transcript elements to each transcript-derived enhancement; generating the block of text of each transcript-derived enhancement by combining the values of the name attribute of each transcript element belonging to the group assigned to that transcript-derived enhancement; and applying a timing optimisation process to determine the in-point and out-point of each transcript-derived enhancement.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This application claims priority to UK Patent Application GB 2303606.4 filed on Mar. 12, 2023, the entire disclosure of which is hereby incorporated by reference and relied upon. BACKGROUND OF THE INVENTION Field of the Invention The present invention relates to systems and methods for video content enhancement. In particular, the invention relates to providing video content with enhancements, such as captions, that can be presented in addition to original video content. The accessibility of video content can be improved by adding augmentations or enhancements to it, such as captions. Captions provide a synchronised text transcription of audio such as speech, providing an audience of the video content another way of accessing important components of that video content. This improves the intelligibility of content, especially for audiences that have hearing impairments. Furthermore, captions that are in the form of subtitles can allow audiences to understand audio dialogue delivered in a different language. Such enhancements can be added in “post-production” to an otherwise already-completed video because the enhancements concern transforming content that already exists from the video itself, or otherwise is derivable from it. This normally requires user input to add the enhancements in a way that promotes their accuracy and accessibility. For example, captions may be typed out, or even delivered via respeaking speech, with the appropriate punctuation, emphasis and formatting added to promote comprehension. This is labour-intensive, and often requires special training to ensure reliable output, especially when delivering a just-in-time captioning service for live broadcasts. Whilst services such as speech-to-text voice recognition can provide a basic way of automating aspects of video content enhancement, the quality of output tends to be unsuitable for most use-cases. An improvement is therefore required in the automation of video content enhancement, and additionally in support of user-controlled video content enhancement. It is against this background that the present invention has been conceived. BRIEF SUMMARY OF THE INVENTION According to a first aspect of the present invention there is provided a system for video content enhancement, the system comprising: an augmentation generator configured to receive and process an input video to generate an augmentation descriptor from it, the augmentation descriptor defining augmentations that enhance the content of the input video; andan augmentation applicator configured to generate an output video by combining the augmentations defined by the augmentation descriptor with the input video, so that playback of the output video features said augmentations. Preferably, the augmentations defined by the augmentation descriptor comprise transcript-derived enhancements. Preferably, the augmentation generator is configured to: transmit at least an audio portion of the input video to a transcript generator and receive in response from the transcript generator a time-coded transcript;process the time-coded transcript to generate from it a set of transcript-derived enhancements; andadd the set of transcript-derived enhancements to the augmentation descriptor as augmentations for enhancing the content of the input video. Preferably, each of the set of transcript-derived enhancements define: an in-point, corresponding to a timecode when the enhancement is to be introduced during playback of the output video;an out-point, corresponding to a timecode when the enhancement is to be removed during playback of the output video. Preferably, the set of transcript-derived enhancements comprises a block of text derived from elements of the time-coded transcript, a period of featuring augmentations derived from the block of text being defined by the in-point and the out-point of the respective enhancement. Preferably, the set of transcript-derived enhancements comprises a plurality of sequential caption-type enhancements, each caption-type enhancement comprising a block of text derived from elements of the time-coded transcript, a period of display of the block of text being defined by the in-point and the out-point of the respective caption-type enhancement. Preferably, the system may comprise, or is in communication with a transcript generator. The transcript generator may be configured to receive and process an audio portion of the input video to transcribe from it a timecoded transcript. The timecoded transcript may comprise a plurality of transcript elements, each of which comprises a set of values for a respective predetermined set of attributes. The predetermined set of attributes may comprise: a name attribute for identifying an individual word of speech determined to be spoken within the audio portion of the input video;at least one temporal attribute for specifying a timing of the individual word of speech identified by the name attribut