US-12620397-B2 - Systems and methods for performing live transcription

US12620397B2US 12620397 B2US12620397 B2US 12620397B2US-12620397-B2

Abstract

The present disclosure is generally directed to a tangible, non-transitory machine-readable medium that includes machine-readable instructions that, when executed, cause processing circuitry to receive a first indication of multimedia content and a second indication of whether the multimedia content is to be transcribed. The instructions cause the processing circuitry to send content generated from the multimedia content for transcription. The content includes an identifier associated with the multimedia content. Additionally, the instructions cause the processing circuitry to send a request for the content to be transcribed. The request includes or is indicative of the identifier. Moreover, the instructions cause the processing circuitry to receive a transcript for at least a portion of the content and generate transcript metadata that includes timing data and is indicative of text of the transcript. Lastly, the instructions cause the processing circuitry to send the transcript metadata to be combined with the multimedia content.

Inventors

Timothy Rolf Fassnacht

Assignees

NBCUNIVERSAL MEDIA, LLC

Dates

Publication Date: 20260505
Application Date: 20220825

Claims (18)

1 . A tangible, non-transitory machine-readable medium comprising machine-readable instructions that, when executed by one or more processors, cause the one or more processors to: receive a first indication of live multimedia content and a second indication of whether the live multimedia content is to be transcribed; cause content generated from the live multimedia content to be sent for live transcription, wherein the content comprises an event identifier associated with the live multimedia content; send a request for the content to be transcribed, wherein the request comprises or is indicative of the event identifier; receive a transcript for at least a first portion of the content, wherein the first portion is less than an entirety of the content; generate, based on the transcript, transcript metadata that comprises timing data and is indicative of text of the transcript; after receiving the transcript, determine whether live transcription of the live multimedia content is complete; upon determining that the live transcription is not complete, receive a second transcript for a second portion of the content that differs from the first portion; generate, based on the second transcript, second transcript metadata; and combine the transcript metadata and the second transcript metadata with the multimedia content.
2 . The tangible, non-transitory machine-readable medium of claim 1 , wherein the transcript comprises the event identifier.
3 . The tangible, non-transitory machine-readable medium of claim 1 , wherein the content generated from the live multimedia content is audio content.
4 . The tangible, non-transitory machine-readable medium of claim 1 , wherein the one or more processors are controlled by a first entity and live transcription of the content is performed by a second entity that is different than the first entity.
5 . The tangible, non-transitory machine-readable medium of claim 1 , wherein the instructions, when executed, cause the one or more processors to: determine, based on the second indication that the live multimedia content is to be transcribed; and receive the event identifier prior to causing the content to be sent for live transcription.
6 . The tangible, non-transitory machine-readable medium of claim 1 , wherein the event identifier is indicative of the live transcription to be performed for the live multimedia content.
7 . The tangible, non-transitory machine-readable medium of claim 1 , wherein the instructions, when executed, cause the one or more processors to implement a virtual machine that is configured to receive the transcript and generate the transcript metadata.
8 . The tangible, non-transitory machine-readable medium of claim 1 , wherein the timing data: is associated with an amount of time that passes between sending the request and receiving the transcript; and enables a computing device to associate the transcript metadata with the multimedia content in a synchronized manner.
9 . A machine-implemented method for transcribing multimedia content, the method comprising: receiving a first indication of live multimedia content comprising a live video feed and a second indication of whether the live multimedia content is to be transcribed; sending content generated from the live multimedia content for live transcription, wherein the content comprises an event identifier associated with the live multimedia content; sending a request for the content to be transcribed, wherein the request comprises or is indicative of the event identifier; receiving a transcript for at least a first portion of the content, wherein the first portion is less than an entirety of the content; generating, based on the transcript, transcript metadata that comprises timing data and is indicative of text of the transcript; after receiving the transcript, determining whether live transcription of the live multimedia content is complete; upon determining the live transcription is not complete, receiving a second transcript for a second portion of the content that differs from the first portion; generating, based on the second transcript, second transcript metadata; and combining the transcript metadata and the second transcript metadata with the multimedia content.
10 . The machine-implemented method of claim 9 , comprising determining second timing data indicative of a time delay associated with generating and receiving the second transcript, wherein the second transcript metadata comprises the second timing data.
11 . The machine-implemented method of claim 9 , comprising transcribing the content using a machine-learning technique.
12 . The machine-implemented method of claim 9 , wherein the content generated from the live multimedia content comprises encoded video data.
13 . The machine-implemented method of claim 12 , comprising encrypting the encoded video data prior to sending the content for live transcription.
14 . The machine-implemented method of claim 9 , wherein the timing data: is associated with an amount of time that passes between sending the request and receiving the transcript; and during playback of the multimedia content, enables the text of the transcript to be provided in a synchronized manner with audio data corresponding to the text.
15 . A transcription system, comprising: a central management module comprising one or more first processors, wherein the central management module is configured to send a request to a speech to text module for a transcription of live multimedia content to be completed, wherein the request comprises an event identifier associated with the transcription; and a plurality of transcription subsystems implemented at least partially by one or more second processors, wherein the plurality of transcription subsystems is configured to: receive the live multimedia content; generate the event identifier; generate content from the live multimedia content, wherein the content comprises the event identifier; send the content for transcription by the speech to text module; receive a transcript for at least a first portion of the content, wherein the first portion is less than an entirety of the content; generate, based on the transcript, transcript metadata that is indicative of text of the transcript and comprising timing data; after receiving the transcript, determine whether live transcription of the live multimedia content is complete; upon determining the live transcription is not complete, receive a second transcript for a second portion of the content that differs from the first portion; generate, based on the second transcript, second transcript metadata; and combine the transcript metadata and the second transcript metadata with the multimedia content.
16 . The transcription system of claim 15 , wherein: the central management module is controlled by a first entity; and the speech to text module is controlled by a second entity that is different than the first entity.
17 . The transcription system of claim 15 , comprising a content management interface implemented by one or more third processors, wherein the content management interface is configured to: receive information regarding the transcription and a status of the transcription; and generate a user interface comprising the information regarding the transcription and the status of the transcription.
18 . The transcription system of claim 15 , wherein the content generated from the multimedia content is encoded video data, audio data, or a combination thereof.

Description

BACKGROUND The present disclosure relates generally to the transcribing multimedia content. More particularly, the present disclosure relates to performing transcription of multimedia content, such as video content, in a live (e.g., real-time or near real-time) manner. This section is intended to introduce the reader to various aspects of art that may be related to various aspects of the present techniques, which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present disclosure. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art. Multimedia content may be associated with text. For example, video content may include spoken words. The spoken words may be reflected in text form, such as a transcript. In some cases, transcription may be performed manually (e.g., by a human being) as content is prepared or broadcast. However, in some cases, transcription may not be performable simultaneously or in a near-simultaneous manner. Moreover, while machine-learning or artificial intelligence (AI) techniques may be used to transcribe spoken words in content, such techniques may appear unadaptable to performing real-time or near-real time transcription of live multimedia content, for example, due to being performed by systems (e.g., computing systems) that are separate from systems used to broadcast or disseminate multimedia content. BRIEF DESCRIPTION Certain embodiments commensurate in scope with the originally claimed subject matter are summarized below. These embodiments are not intended to limit the scope of the claimed subject matter, but rather these embodiments are intended only to provide a brief summary of possible forms of the subject matter. Indeed, the subject matter may encompass a variety of forms that may be similar to or different from the embodiments set forth below. The current embodiments relate to systems and methods for providing real-time or near real-time transcription of multimedia content, such as video content. Additionally, as discussed below, the techniques provided herein enable text to be temporally aligned with content so that the text (e.g., from a transcription) matches spoken words or other audio content included in the transcribed content. As also described below, the techniques described herein may be scaled to enable simultaneous transcription to occur in several different locations while a central management module may track and provide the status of the transcriptions being performed. DRAWINGS These and other features, aspects, and advantages of the present disclosure will become better understood when the following detailed description is read with reference to the accompanying drawings in which like characters represent like parts throughout the drawings, wherein: FIG. 1 is a block diagram of a transcription system, in accordance with an embodiment of the present disclosure; FIG. 2 is a block diagram of another transcription system, in accordance with an embodiment of the present disclosure; FIG. 3 is a flow diagram of an exemplary process for transcribing multimedia content, in accordance with an embodiment of the present disclosure. FIG. 4 is a block diagram of a yet another transcription system, in accordance with an embodiment of the present disclosure; and FIG. 5 is a user interface that may be provided by the content management interface of FIG. 4, in accordance with an embodiment of the present disclosure. DETAILED DESCRIPTION One or more specific embodiments of the present disclosure will be described below. In an effort to provide a concise description of these embodiments, all features of an actual implementation may not be described in the specification. It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions must be made to achieve the developers' specific goals, such as compliance with system-related and business-related constraints, which may vary from one implementation to another. Moreover, it should be appreciated that such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure. When introducing elements of various embodiments of the present disclosure, the articles “a,” “an,” “the,” and “said” are intended to mean that there are one or more of the elements. The terms “comprising,” “including,” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements. As set forth above, multimedia content may be associated with text. For example, video content may include spoken words. The spoken words may be reflected in text form, suc