US-12620389-B2 - Predictor-corrector method for including speech hints in automatic speech recognition

US12620389B2US 12620389 B2US12620389 B2US 12620389B2US-12620389-B2

Abstract

A method comprises: receiving an automatic speech recognition (ASR) text transcript generated by an ASR process that encoded input audio into audio encodings and converted the audio encodings to ASR words of the ASR text transcript that correspond to the audio encodings; receiving speech hints for non-standard words, and generating alternative words for an ASR word of the ASR words based on the speech hints; correlating an audio encoding of the audio encodings that corresponds to the ASR word against the ASR word and each of the alternative words, to produce correspondence scores; selecting an output word among the ASR word and the alternative words based on the correspondence scores; and providing the output word to a corrected transcript.

Inventors

Mohamed Hariri Nokob
Kareem Aladdin Nassar

Assignees

CISCO TECHNOLOGY, INC.

Dates

Publication Date: 20260505
Application Date: 20231130

Claims (20)

1 . A method comprising: receiving an automatic speech recognition (ASR) text transcript (ASR text transcript) generated by an ASR process that encoded input audio into audio encodings and converted the audio encodings to ASR words of the ASR text transcript that correspond to the audio encodings; receiving speech hints for non-standard words, and generating, for an ASR word of the ASR words that is represented as an ASR token sequence, alternative words as alternative token sequences based on the speech hints; correlating an audio encoding of the audio encodings that corresponds to the ASR word against the ASR token sequence for the ASR word and each of the alternative token sequences for the alternative words, to produce correspondence scores; selecting an output word among the ASR word and the alternative words by selecting an output token sequence among the ASR token sequence and the alternative token sequences based on the correspondence scores; and providing the output token sequence as the output word to a corrected transcript.
2 . The method of claim 1 , wherein: the speech hints include out-of-vocabulary words that are not general vocabulary words.
3 . The method of claim 1 , wherein: generating includes manipulating the ASR token sequence based on the speech hints to produce the alternative words.
4 . The method of claim 1 , wherein manipulating includes one or more of: replacing an ASR token of the ASR token sequence with a hint token from a speech hint; adding the hint token to the ASR token sequence; and deleting the ASR token from the ASR token sequence.
5 . The method of claim 1 , further comprising: prior to correlating, aligning the audio encoding corresponding to the ASR word.
6 . The method of claim 1 , wherein: correlating includes correlating using a neural model previously trained on training data to produce the correspondence scores to indicate how closely the audio encodings correspond to the ASR word and each of the alternative words.
7 . The method of claim 6 , wherein: the neural model was previously trained on correlations of training audio encodings against training words labeled to indicate correspondence or non-correspondence of the training words to the training audio encodings.
8 . An apparatus comprising: a network input/output interface to communicate with a network; and a processor coupled to the network input/output interface and configured to perform: receiving an automatic speech recognition (ASR) text transcript (ASR text transcript) generated by an ASR process that encoded input audio into audio encodings and converted the audio encodings to ASR words of the ASR text transcript that correspond to the audio encodings; receiving speech hints for non-standard words, and generating, for an ASR word of the ASR words that is represented as an ASR token sequence, alternative words as alternative token sequences based on the speech hints; correlating an audio encoding of the audio encodings that corresponds to the ASR word against the ASR token sequence for the ASR word and each of the alternative token sequences for the alternative words, to produce correspondence scores; selecting an output word among the ASR word and the alternative words by selecting an output token sequence among the ASR token sequence and the alternative token sequences based on the correspondence scores; and providing the output token sequence as the output word to a corrected transcript.
9 . The apparatus of claim 8 , wherein: the speech hints include out-of-vocabulary words that are not general vocabulary words.
10 . The apparatus of claim 8 , wherein: the processor is configured to perform generating by manipulating the ASR token sequence based on the speech hints to produce the alternative words.
11 . The apparatus of claim 8 , wherein the processor is configured to perform manipulating by one or more of: replacing an ASR token of the ASR token sequence with a hint token from a speech hint; adding the hint token to the ASR token sequence; and deleting the ASR token from the ASR token sequence.
12 . The apparatus of claim 8 , wherein the processor is further configured to perform: prior to correlating, aligning the audio encoding corresponding to the ASR word.
13 . The apparatus of claim 8 , wherein: the processor is configured to perform correlating by correlating using a neural model previously trained on training data to produce the correspondence scores to indicate how closely the audio encodings correspond to the ASR word and each of the alternative words.
14 . The apparatus of claim 13 , wherein: the neural model was previously trained on correlations of training audio encodings against training words labeled to indicate correspondence or non-correspondence of the training words to the training audio encodings.
15 . A non-transitory computer medium encoded with instructions that, when executed by a processor, cause the processor to perform: receiving an automatic speech recognition (ASR) text transcript (ASR text transcript) generated by an ASR process that encoded input audio into audio encodings and converted the audio encodings to ASR words of the ASR text transcript that correspond to the audio encodings; receiving speech hints for non-standard words, and generating, for an ASR word of the ASR words that is represented as an ASR token sequence, alternative words as alternative token sequences based on the speech hints; correlating an audio encoding of the audio encodings that corresponds to the ASR word against the ASR token sequence for the ASR word and each of the alternative token sequences for the alternative words, to produce correspondence scores; selecting an output word among the ASR word and the alternative words by selecting an output token sequence among the ASR token sequence and the alternative token sequences based on the correspondence scores; and providing the output token sequence as the output word to a corrected transcript.
16 . The non-transitory computer medium of claim 15 , wherein: the speech hints include out-of-vocabulary words that are not general vocabulary words.
17 . The non-transitory computer medium of claim 15 , wherein: the instructions to cause the processor to perform generating include instructions to cause the processor to perform manipulating the ASR token sequence based on the speech hints to produce the alternative words.
18 . The non-transitory computer medium of claim 15 , wherein the instructions to cause the processor to perform manipulating include the instructions to cause the processor to perform one or more of: replacing an ASR token of the ASR token sequence with a hint token from a speech hint; adding the hint token to the ASR token sequence; and deleting the ASR token from the ASR token sequence.
19 . The non-transitory computer medium of claim 15 , wherein: the instructions to cause the processor to perform correlating include instructions to cause the processor to perform correlating using a neural model previously trained on training data to produce the correspondence scores to indicate how closely the audio encodings correspond to the ASR word and each of the alternative words.
20 . The non-transitory computer medium of claim 19 , wherein: the neural model was previously trained on correlations of training audio encodings against training words labeled to indicate correspondence or non-correspondence of the training words to the training audio encodings.

Description

TECHNICAL FIELD The present disclosure relates generally to improving automatic speech recognition (ASR) transcripts using speech hints and artificial intelligence (AI) techniques. BACKGROUND Automatic speech recognition (ASR) systems have become more accurate in recent years due to availability of large training datasets used to train ASR models employed by the ASR systems. A problem facing the ASR systems is their ineffectiveness when dealing with unseen words. Usually, names, acronyms, initialisms, and the like, do not appear in the training datasets, and thus present a challenge to the ASR systems. To help manage such challenges, speech hints may be passed to the ASR system as arguments, prior to transcription. Speech hints are special words such as names, acronyms, or domain specific words that are provided by a user, for example. Several methods have been proposed to enhance the ASR processing using speech hints, leading to unsatisfactory results in practical cases. Typically, adding new words to a dictionary of so called “end-to-end ASR models” (e.g., neural networks trained wholly using text and audio as opposed to traditional models which consist of separate parts stitched together) is not natural because training adjusts ASR model weights to the general vocabulary without regard to additional out-of-vocabulary (OOV) words. Whatever method is used to add speech hints, the method can be suboptimal and likely produces a tradeoff between recall and precision for the speech hints as well as possible degradation in general word error rate. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram of an automatic speech recognition (ASR) and predictor-corrector system in which embodiments directed to improving ASR transcripts using speech hints and audio encodings generated by ASR may be implemented, according to an example embodiment. FIG. 2 is a block diagram expanding on the ASR and predictor-corrector system, according to an example embodiment. FIG. 3 is a block diagram expanding on a corrector of the ASR and predictor-corrector system, according to an example embodiment. FIG. 4 is an illustration of example machine learning (ML) training of a neural corrector model of the corrector, according to an example embodiment. FIG. 5 is a flowchart of a method of improving an ASR transcript previously generated by ASR using speech hints and the neural corrector model to produce a corrected transcript, according to an example embodiment. FIG. 6 illustrates a hardware block diagram of a computing device that may perform functions associated with operations discussed herein, according to an example embodiment. DETAILED DESCRIPTION Overview In an embodiment, a method comprises: receiving an automatic speech recognition (ASR) text transcript generated by an ASR process that encoded input audio into audio encodings and converted the audio encodings to ASR words of the ASR text transcript that correspond to the audio encodings; receiving speech hints for non-standard words, and generating alternative words for an ASR word of the ASR words based on the speech hints; correlating an audio encoding of the audio encodings that corresponds to the ASR word against the ASR word and each of the alternative words, to produce correspondence scores; selecting an output word among the ASR word and the alternative words based on the correspondence scores; and providing the output word to a corrected transcript. Example Embodiments List of Definitions and Acronyms Tokens: Pieces of words. For example, the word “Kyra” comprises two tokens “ky” and “ra.”General vocabulary words: Words that normally appear in text, including dictionary words.OOV words: Out-of-vocabulary words that are not general vocabulary words and may be represented by speech hints. FIG. 1 is a block diagram of an example automatic speech recognition (ASR) and predictor-corrector system 100 in which embodiments directed to improving ASR transcripts using speech hints (also referred to as “word hints”) and audio encodings generated by ASR may be implemented. ASR and predictor-corrector system 100 includes an ASR system 102 followed by a predictor-corrector system 104. In an example, ASR system 102 and predictor-corrector system 104 may include computer processes or applications hosted on one or more computer devices. ASR system 102 receives input audio 106 including a sequence of audio frames (e.g., a wave file), coverts the sequence of audio frames to an ASR transcript 108 including sentences of general vocabulary words, and provides the ASR transcript to predictor-corrector system 104. Internally, ASR system 102 generates audio encodings 110 (also referred to as “ASR encodings”) of the audio frames as an intermediate signal, and provides the audio encodings to predictor-corrector system 104. According to embodiments presented herein, predictor-corrector system 104 receives ASR transcript 108 (after being generated by ASR system 102), audio encodings 110 used t