EP-4569508-B1 - SEAMLESS SPELLING FOR AUTOMATIC SPEECH RECOGNITION SYSTEMS

EP4569508B1EP 4569508 B1EP4569508 B1EP 4569508B1EP-4569508-B1

Inventors

Sim, Khe Chai
BEAUFAYS, Françoise
STROHMAN, TREVOR
ZIVKOVIC, Dan

Dates

Publication Date: 20260506
Application Date: 20220907

Claims (15)

A computer-implemented method (500) that, when executed on data processing hardware (610), causes the data processing hardware (610) to perform operations comprising: receiving audio data (102) characterizing an utterance (101) spoken by a user, the utterance (101) comprising a particular phrase (182) and a sequence of individual characters (186) that provide a correct spelling of the particular phrase (182) spoken after the particular phrase (182); processing, using an automatic speech recognition (ASR) model, the audio data (102) to generate an initial transcription (107) for the utterance (101), the initial transcription (107) comprising a misrecognition of the particular phrase (182) by the ASR model (142) followed by the sequence of individual characters (186) that provide the correct spelling of the particular phrase (182); detecting a spelling structure (103) in the initial transcription (107) subsequent to the misrecognition of the particular phrase (182), wherein detecting the spelling structure (103) comprises executing a weighted finite state transducer (wFST) having a spelling component (310) configured to detect the spelling structure (103) in the initial transcription (107) by identifying one or more spell trigger words (184) in the initial transcription (107) subsequent to the misrecognition of the particular phrase (182); and in response to detecting the spelling structure (103) in the initial transcription (107): extracting, from the initial transcription (107), the sequence of individual characters (186) that provide the correct spelling of the particular phrase (182); constructing a corrected phrase for the misrecognition of the particular phrase (182) from the extracted sequence of individual characters (186); extracting, from the initial transcription (107), the misrecognition of the particular phrase (182); and normalizing the initial transcription (107) to obtain a final transcription (104) for the utterance (101) by replacing the extracted misrecognition of the particular phrase (182) with the corrected phrase constructed from the extracted sequence of individual characters (186).
The method (500) of claim 1, wherein: receiving the audio data (102) comprises receiving the audio data (102) as the user speaks the utterance (101); performing speech recognition on the audio data (102) comprises performing streaming speech recognition on the audio data (102) as the audio data (102) is received to generate, as output from the ASR model (142), streaming speech recognition results; and the operations further comprise providing each streaming speech recognition result generated as output from the ASR model (142) for display on a screen (115) in communication with the data processing hardware (610).
The method (500) of claims 1 or 2, wherein the operations further comprise displaying a partial speech recognition result including the misrecognition of the particular phrase (182) on the screen (115) before the user speaks the sequence of individual characters (186) comprising the spelling of the particular phrase (182).
The method (500) of any preceding claim, wherein detecting the spelling structure (103) comprises executing a weighted finite state transducer (wFST) having a spelling component (310) configured to detect the spelling structure (103) in the initial transcription (107) by identifying one or more initiating trigger words in the initial transcription (107) preceding the misrecognition of the particular phrase (182).
The method (500) of claim 4, wherein the one or more initiating trigger words and the one or more spell trigger words (184) form a predefined spell command.
The method (500) of claim 4 or 5, wherein detecting the spelling structure (103) in the initial transcription (107) is further based on identifying the sequence of individual characters (186) in the initial transcription (107) subsequent to identifying the one or more spell trigger words (184).
The method (500) of any of claims 1-6, wherein the operations further comprise, after constructing the corrected phrase from the extracted sequence of individual characters (186), applying a capitalization normalizer (156) to capitalize a first letter of at least one word in the corrected phrase.
The method (500) of any of claims 1-7, wherein: the particular phrase (182) spoken by the user in the utterance (101) comprises two or more particular words; the sequence of individual characters (186) that provide the correct spelling of the particular phrase (182) comprise two or more spans of consecutive individual characters (186) that each provide a correct spelling for a corresponding one of the two or more particular words; and the utterance (101) spoken by the user further comprises the user speaking a space command (185) between each adjacent pair of the two or more spans of consecutive individual characters (186).
The method (500) of claim 8, wherein: the initial transcription (107) further comprises a space token (185t) inserted between each adjacent pair of the two or more spans of consecutive individual characters (186) of the sequence of individual characters (186); and constructing the corrected phrase from the extracted sequence of individual characters (186) comprises: for each span of consecutive individual characters (186), joining the individual characters (186) to form the corresponding one of the two or more particular words; and replacing each space token (185t) in the initial transcription (107) with a blank space.
The method (500) of any of claims 1-9, wherein the misrecognition of the particular phrase (182) comprises one or more words.
The method (500) of any of claims 1-10, wherein extracting the misrecognition of the particular phrase (182) is based on an edit distance between the sequence of individual characters (186) in the initial transcription (107) and the misrecognition of the particular phrase (182).
The method (500) of any of claims 1-11, wherein extracting the misrecognition of the particular phrase (182) comprises: determining a corresponding edit distance between the sequence of individual characters (186) and each word span of multiple word spans of different lengths in the initial transcription (107), the multiple word spans preceding the sequence of individual characters (186); and identifying the word span of the multiple word spans of different lengths that has a shortest corresponding edit distance to the misrecognition of the particular phrase (182).
The method (500) of claim 12, wherein the corresponding edit distance comprises a corresponding Levenshtein distance dynamically computed for each of the multiple word spans via a single calculation.
The method (500) of any of claims 1-13, wherein detecting the spelling structure (103) in the initial transcription (107) and normalizing the initial transcription (107) to obtain the final transcription (104) occurs without receiving any user input after the user finishes speaking the utterance (101).
A system (100) comprising: data processing hardware (610); and memory hardware (620) in communication with the data processing hardware (610) and storing instructions that, when executed on the data processing hardware (610), cause the data processing hardware (610) to perform the method of any preceding claim.

Description

TECHNICAL FIELD This disclosure relates to providing seamless spelling for automatic speech recognition (ASR) systems. BACKGROUND ASR systems provide a technology that is typically used in mobile devices and/or other devices. In general, ASR systems attempt to provide accurate transcriptions of what a user speaks to a device. However, in some instances, ASR systems may generate transcriptions that do not match what the user intended or actually spoke. US2017263248A1 describes an electronic device which implements dictation-based editing of textual data. The device receives a natural-language user input and determines whether the natural-language user input includes a predefined editing command. If the natural-language user input includes the predefined editing command, the device modifies the textual data in accordance with the predefined editing command. If the natural-language user input does not include the predefined editing command, the device transcribes the natural-language user input and adds the transcribed text to the textual data. SUMMARY One aspect of the present disclosure provides a computer-implemented method that, when executed on data processing hardware, causes the data processing hardware to perform operations that include receiving audio data characterizing an utterance spoken by a user and processing, using an automatic speech recognition (ASR) model, the audio data to generate an initial transcription for the utterance. The utterance includes a particular phrase and a sequence of characters that provide a correct spelling of the particular phrase spoken after the particular phrase. The initial transcription includes a misrecognition of the particular phrase by the ASR model followed by the sequence of individual characters that provide the correct spelling of the particular phrase. The operations also include detecting a spelling structure in the initial transcription subsequent to the misrecognition of the particular phrase. In response to detecting the spelling structure in the initial transcription, the operations also include extracting, from the initial transcription, the sequence of individual characters that provide the correct spelling of the particular phrase, constructing a corrected phrase for the misrecognition of the particular phrase from the extracted sequence of individual characters, extracting, from the initial transcription, the misrecognition of the particular phrase, and normalizing the initial transcription to obtain a final transcription for the utterance by replacing the extracted misrecognition of the particular phrase with the corrected phrase constructed from the extracted sequence of individual characters. Implementations of the present disclosure include one or more of the following optional features. In some implementations, receiving the audio data includes receiving the audio data as the user speaks the utterance, performing speech recognition on the audio data includes performing streaming speech recognition on the audio data as the audio data is received to generate, as output from the ASR model, streaming speech recognition result. In these implementations, the operations also include providing each streaming speech recognition result generated as output from the ASR model for display on a screen in communication with the data processing hardware. In some examples, the operations also include displaying a partial speech recognition result including the misrecognition of the particular phrase on the screen before the user speaks the sequence of individual characters comprising the spelling of the particular phrase. In all implementations, detecting the spelling structure includes executing a weighted finite state transducer (wFST) having a spelling component configured to detect the spelling structure in the initial transcription by identifying one or more spell trigger words in the initial transcription subsequent to the misrecognition of the particular phrase and/or one or more initiating trigger words preceding the misrecognition of the particular phrase. The one or more initiating trigger words and the one or more spell trigger words may form a predefined spell command. In some examples, the operations also include, after constructing the corrected phrase from the extracted sequence of individual characters, applying a capitalization normalizer to capitalize a first letter of at least one word in the corrected phrase. The misrecognition of the particular phrase may include one or more words. Optionally, extracting the misrecognition of the particular phrase may be based on an edit distance between the sequence of individual characters in the initial transcription and the misrecognition of the particular phrase. In some implementations, the particular phrase spoken by the user in the utterance includes two or more particular words, the sequence of individual characters that provide the correct spelling of the particular phrase includes two or more s