US-20260128041-A1 - USING METADATA FOR IMPROVED TRANSCRIPTION SEARCH

US20260128041A1US 20260128041 A1US20260128041 A1US 20260128041A1US-20260128041-A1

Abstract

Systems and methods for using metadata for improved transcription search are disclosed. In an example method, a computing system receives an audio stream from a client device. The method further involves predicting text for the audio stream using a speech-to-text model, including determining multiple segments, each segment including one or more terms and a confidence value for each term. The method further involves, for each segment, ranking the terms according to the confidence values. The method further involves generating a transcription including a highest-ranked prediction for each segment and metadata including the remaining lower-ranked predicted text for each segment. The method further involves providing a graphical user interface to the client device including the transcription and the metadata. The method further involves receiving, from the client device, revisions to the transcription and updating the transcription. The method further involves updating the speech-to-text model using the user revisions.

Inventors

Brandon Kevin Roper

Assignees

ZOOM COMMUNICATIONS, INC.

Dates

Publication Date: 20260507
Application Date: 20251218

Claims (20)

1 . A method, comprising: receiving an audio stream from a client device; predicting text for the audio stream using a speech-to-text model, comprising determining a plurality of segments, each segment comprising one or more terms and a confidence value for each term; for each segment of the plurality of segments, ranking the one or more terms according to the confidence values; generating a transcription comprising a highest-ranked prediction for each corresponding segment, the transcription comprising metadata comprising the remaining lower-ranked predicted text for each transcribed segment; providing a graphical user interface (“GUI”) to the client device comprising the transcription and a representation of the metadata; receiving, from the client device, one or more user revisions to the transcription; updating the transcription based on the one or more user revisions; and updating the speech-to-text model based on the one or more user revisions.
2 . The method of claim 1 , wherein: the client device is a participant in a video conference including a plurality of participants; and the audio stream is generated by an audio input device of the client device during or following the video conference, the audio input device capturing a spoken voice.
3 . The method of claim 1 , wherein the one or more terms comprise one or more of a syllable, a word, or a phrase.
4 . The method of claim 1 , wherein the confidence value for each term comprises a percentage corresponding to a predicted likelihood that the term corresponds to the term spoken by a user of the client device during the segment.
5 . The method of claim 1 , wherein generating the transcription comprises: for each segment of the plurality of segments, associating each term of the one or more terms and the confidence score associated with the term with an identifier of the segment.
6 . The method of claim 1 , wherein the one or more user revisions to the transcription comprise at least one of an added term, a deleted term, or an edited term.
7 . The method of claim 1 , wherein updating the speech-to-text model based on the one or more user revisions comprises re-training the speech-to-text model.
8 . A non-transitory computer-readable storage medium storing processor-executable instructions configured to cause one or more processors to: receive an audio stream from a client device; predict text for the audio stream using a speech-to-text model, comprising determining a plurality of segments, each segment comprising one or more terms and a confidence value for each term; for each segment of the plurality of segments, rank the one or more terms according to the confidence values; generate a transcription comprising a highest-ranked prediction for each corresponding segment, the transcription comprising metadata comprising the remaining lower-ranked predicted text for each transcribed segment; provide a GUI to the client device comprising the transcription and a representation of the metadata; receive, from the client device, one or more user revisions to the transcription; update the transcription based on the one or more user revisions; and update the speech-to-text model based on the one or more user revisions.
9 . The non-transitory computer-readable storage medium of claim 8 , wherein: the client device is a participant in a video conference including a plurality of participants; and the audio stream is generated by an audio input device of the client device during or following the video conference, the audio input device capturing a spoken voice.
10 . The non-transitory computer-readable storage medium of claim 8 , wherein the one or more terms comprise one or more of a syllable, a word, or a phrase.
11 . The non-transitory computer-readable storage medium of claim 8 , wherein the confidence value for each term comprises a percentage corresponding to a predicted likelihood that the term corresponds to the term spoken by a user of the client device during the segment.
12 . The non-transitory computer-readable storage medium of claim 8 , wherein the instruction to generate the transcription comprises: for each segment of the plurality of segments, associating each term of the one or more terms and the confidence score associated with the term with an identifier of the segment.
13 . The non-transitory computer-readable storage medium of claim 8 , wherein the one or more user revisions to the transcription comprise at least one of an added term, a deleted term, or an edited term.
14 . The non-transitory computer-readable storage medium of claim 8 , wherein the instruction to update the speech-to-text model based on the one or more user revisions comprises re-training the speech-to-text model.
15 . A system comprising: one or more non-transitory computer-readable media; and one or more processors communicatively coupled to the one or more non-transitory computer-readable media, the one or more processors configured to execute processor-executable instructions stored in the non-transitory computer-readable media to: receive an audio stream from a client device; predict text for the audio stream using a speech-to-text model, comprising determining a plurality of segments, each segment comprising one or more terms and a confidence value for each term; for each segment of the plurality of segments, rank the one or more terms according to the confidence values; generate a transcription comprising a highest-ranked prediction for each corresponding segment, the transcription comprising metadata comprising the remaining lower-ranked predicted text for each transcribed segment; provide a GUI to the client device comprising the transcription and a representation of the metadata; receive, from the client device, one or more user revisions to the transcription; update the transcription based on the one or more user revisions; and update the speech-to-text model based on the one or more user revisions.
16 . The system of claim 15 , wherein: the client device is a participant in a video conference including a plurality of participants; and the audio stream is generated by an audio input device of the client device during or following the video conference, the audio input device capturing a spoken voice.
17 . The system of claim 15 , wherein the one or more terms comprise one or more of a syllable, a word, or a phrase.
18 . The system of claim 15 , wherein the confidence value for each term comprises a percentage corresponding to a predicted likelihood that the term corresponds to the term spoken by a user of the client device during the segment.
19 . The system of claim 15 , wherein the instruction to generate the transcription comprises: for each segment of the plurality of segments, associating each term of the one or more terms and the confidence score associated with the term with an identifier of the segment.
20 . The system of claim 15 , wherein the instruction to update the speech-to-text model based on the one or more user revisions comprise re-training the speech-to-text model.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation of and claims priority to U.S. Ser. No. 18/087,158 entitled “Using Metadata for Improved Transcription Search” and filed on Dec. 22, 2022, the entire disclosure of which is incorporated herein by reference for any purpose. FIELD The present disclosure relates generally to improving searching within a transcription, and more particularly, to using metadata to improve identifying desired information within a transcription. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is an illustration depicting an example video conferencing system in accordance with various embodiments. FIGS. 2A and 2B are illustrations depicting an example video conferencing system in accordance with various embodiments. FIGS. 3A, 3B, 3C, and 3D are illustrations of example graphical user interfaces (“GUIs”) in accordance with various embodiments. FIGS. 4A and 4B are flow charts depicting processes for identifying desired information within a transcription in accordance with various embodiments. FIG. 5 shows an example computing device suitable for use with systems and methods in accordance with various embodiments. DETAILED DESCRIPTION In the following description, for the purposes of explanation, specific details are set forth in order to provide a thorough understanding of certain embodiments. However, it will be apparent that various embodiments may be practiced without these specific details. The figures and description are not intended to be restrictive. The word “exemplary” is used herein to mean “serving as an example, instance, or illustration.” Any embodiment or design described herein as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments or designs. With the evolution of media and communications, there have been many advancements, including in providing computer generated speech to text transcription and predictive closed captioning. However, computer generated speech to text is often inconsistent and prone to errors. One example of frequently occurring errors includes the inaccuracy of the transcription of words that sound similar. Inconsistent transcription can be problematic for hearing-impaired users and other users relying on the transcript for different operations. For example, for users trying to search through a lengthy transcript for specific words and phrases, one wrong transcription can cause them to miss crucial information. The longer the transcript the harder it is to find key words, especially if there are errors in the transcription. The present disclosure relates generally to improving searching within a transcription, and more particularly, to using metadata to improve identifying desired information within a computer-generated transcription. The present disclosure can be adapted to work with or constructed using any combination of speech to text transcription systems and methods. As part of the automated speech to text transcription, the present disclosure collects metadata for all potential terms or phrases for the speech to text transcription. Specifically, when a text to speech conversion occurs, there are confidence levels associated with each word being converted. The word with the highest confidence value will be used in the transcription, where the alternative words with lower confidence values will be stored as metadata, associated with the selected high confidence word. This is in contrast to traditional system, which may discard any alternative words. Even though the word selected has the highest confidence, according to some model, the selected word may not be the correct word. Whereas one of the alternative words with a lower confidence value may have been the correct transcription of the speech. For example, the word “their” could be transcribed with confidence levels of: there—80%, their—75%, they're—48%, they are—2%. In conventional transcripts, the word “there” would be used such that a user searching for the word “their” would not return a result for that part of the conversation. However, when the present disclosure links metadata to the selected word of “there” to include the word “their”, then a search of the metadata could return this part of the conversation. For example, if this metadata is included with the transcript, then when the searcher looks for “there” it could highlight all the other possible transcriptions from the metadata that were not selected (e.g., their, they're, they are). There are other advantages to providing a user with access to the alternative words in addition to the selected words within the transcript. One advantage is the ability to improve the artificial intelligence (AI), machine learning (ML), natural language processing, (NLP), etc. used in the computer-generated transcription. For example, when a user corrects a word within a transcript, that correction can be used to update that transcription and/or retrain the AI, ML, NLP, etc. This provi