US-12620390-B2 - Flickering reduction with partial hypothesis re-ranking for streaming ASR

US12620390B2US 12620390 B2US12620390 B2US 12620390B2US-12620390-B2

Abstract

A method includes processing, using a speech recognizer, a first portion of audio data to generate a first lattice, and generating a first partial transcription for an utterance based on the first lattice. The method includes processing, using the recognizer, a second portion of the data to generate, based on the first lattice, a second lattice representing a plurality of partial speech recognition hypotheses for the utterance and a plurality of corresponding speech recognition scores. For each particular partial speech recognition hypothesis, the method includes generating a corresponding re-ranked score based on the corresponding speech recognition score and whether the particular partial speech recognition hypothesis shares a prefix with the first partial transcription. The method includes generating a second partial transcription for the utterance by selecting the partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses having the highest corresponding re-ranked score.

Inventors

Antoine Jean Bruguier
David Qiu
Yanzhang He
Trevor Strohman

Assignees

GOOGLE LLC

Dates

Publication Date: 20260505
Application Date: 20230713

Claims (16)

1 . A computer-implemented method executed on data processing hardware that causes the data processing hardware to perform operations comprising: receiving audio data corresponding to an utterance spoken by a user; processing, using a speech recognizer, a first portion of the audio data to generate a first lattice representing a first plurality of partial speech recognition hypotheses for the utterance and a first plurality of speech recognition scores for corresponding ones of the first plurality of partial speech recognition hypotheses; generating a first partial transcription for the utterance by selecting the partial speech recognition hypothesis of the first plurality of partial speech recognition hypotheses having the highest corresponding speech recognition score of the first plurality of speech recognition scores; displaying of the first partial transcription on a display in communication with the data processing hardware; processing, using the speech recognizer, a second portion of the audio data to generate, based on the first lattice, a second lattice representing a second plurality of partial speech recognition hypotheses for the utterance and a second plurality of speech recognition scores for corresponding ones of the second plurality of partial speech recognition hypotheses; for each particular partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses, generating a corresponding re-ranked score based on the corresponding speech recognition score of the second plurality of speech recognition scores and whether the particular partial speech recognition hypothesis shares a prefix with the first partial transcription; generating a second partial transcription for the utterance by selecting the partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses having the highest corresponding re-ranked score; and displaying the second partial transcription on the display, the second partial transcription displayed on the display replacing the first partial transcription displayed on the display, wherein generating the corresponding re-ranked scores does not change values of the second plurality of speech recognition scores of the second lattice.
2 . The computer-implemented method of claim 1 , wherein generating the corresponding re-ranked score for each particular partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses comprises: determining that the particular partial speech recognition hypothesis does not share the prefix with the first partial transcription; and in response to determining that the particular partial speech recognition hypothesis does not share the prefix with the first partial transcription, adjusting the corresponding speech recognition score of the second plurality of speech recognition scores by a pre-determined amount.
3 . The computer-implemented method of claim 1 , wherein generating the corresponding re-ranked score for each particular partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses comprises: determining a distance between the particular partial speech recognition hypothesis and the first partial transcription; and adjusting the corresponding speech recognition score of the second plurality of speech recognition scores based on the distance.
4 . The computer-implemented method of claim 1 , wherein the operations further comprise: processing, using the speech recognizer, a final portion of the audio data to generate, based on a preceding lattice, a final lattice representing a plurality of full speech recognition hypotheses for the utterance and a final plurality of speech recognition scores for corresponding ones of the plurality of full speech recognition hypotheses; and generating a final transcription for the utterance by selecting the full speech recognition hypothesis of the plurality of full speech recognition hypotheses having the highest corresponding speech recognition score of the final plurality of speech recognition scores.
5 . The computer-implemented method of claim 1 , wherein the operations further comprise: processing, using the speech recognizer, a third portion of the audio data to generate, based on the second lattice, a third lattice representing a third plurality of partial speech recognition hypotheses for the utterance and a third plurality of speech recognition scores for corresponding ones of the third plurality of partial speech recognition hypotheses; for each particular partial speech recognition hypothesis of the third plurality of partial speech recognition hypotheses, generating a re-ranked score based on the corresponding speech recognition score of the third plurality of speech recognition scores and whether the particular partial speech recognition hypothesis includes the second partial transcription; and generating a third partial transcription for the utterance by selecting the speech recognition hypothesis of the third plurality of partial speech recognition hypotheses having the highest corresponding re-ranked score.
6 . The computer-implemented method of claim 1 , wherein: the first lattice represents a first streaming partial speech recognition result output from the speech recognizer during a first instance of a beam search; and the second lattice represents a second streaming partial speech recognition result output from the speech recognizer during a second instance of the beam search.
7 . The computer-implemented method of claim 1 , wherein the speech recognizer comprises an end-to-end speech recognition model.
8 . The computer-implemented method of claim 1 , wherein: the utterance spoken by the user is captured by a user device associated with the user; and the data processing hardware resides on the user device.
9 . A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware and storing instructions that, when executed on the data processing hardware, cause the system to perform operations comprising: receiving audio data corresponding to an utterance spoken by a user; processing, using a speech recognizer, a first portion of the audio data to generate a first lattice representing a first plurality of partial speech recognition hypotheses for the utterance and a first plurality of speech recognition scores for corresponding ones of the first plurality of partial speech recognition hypotheses; generating a first partial transcription for the utterance by selecting the partial speech recognition hypothesis of the first plurality of partial speech recognition hypotheses having the highest corresponding speech recognition score of the first plurality of speech recognition scores; displaying of the first partial transcription on a display in communication with the data processing hardware; processing, using the speech recognizer, a second portion of the audio data to generate, based on the first lattice, a second lattice representing a second plurality of partial speech recognition hypotheses for the utterance and a second plurality of speech recognition scores for corresponding ones of the second plurality of partial speech recognition hypotheses; for each particular partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses, generating a corresponding re-ranked score based on the corresponding speech recognition score of the second plurality of speech recognition scores and whether the particular partial speech recognition hypothesis shares a prefix with the first partial transcription; generating a second partial transcription for the utterance by selecting the partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses having the highest corresponding re-ranked score; and displaying the second partial transcription on the display, the second partial transcription displayed on the display replacing the first partial transcription displayed on the display, wherein generating the corresponding re-ranked scores does not change values of the second plurality of speech recognition scores of the second lattice.
10 . The system of claim 9 , wherein generating the corresponding re-ranked score for each particular partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses comprises: determining that the particular partial speech recognition hypothesis does not share the prefix with the first partial transcription; and in response to determining that the particular partial speech recognition hypothesis does not share the prefix with the first partial transcription, adjusting the corresponding speech recognition score of the second plurality of speech recognition scores by a pre-determined amount.
11 . The system of claim 9 , wherein generating the corresponding re-ranked score for each particular partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses comprises: determining a distance between the particular partial speech recognition hypothesis and the first partial transcription; and adjusting the corresponding speech recognition score of the second plurality of speech recognition scores based on the distance.
12 . The system of claim 9 , wherein the operations further comprise: processing, using the speech recognizer, a final portion of the audio data to generate, based on a preceding lattice, a final lattice representing a plurality of full speech recognition hypotheses for the utterance and a final plurality of speech recognition scores for corresponding ones of the plurality of full speech recognition hypotheses; and generating a final transcription for the utterance by selecting the full speech recognition hypothesis of the plurality of full speech recognition hypotheses having the highest corresponding speech recognition score of the final plurality of speech recognition scores.
13 . The system of claim 9 , wherein the operations further comprise: processing, using the speech recognizer, a third portion of the audio data to generate, based on the second lattice, a third lattice representing a third plurality of partial speech recognition hypotheses for the utterance and a third plurality of speech recognition scores for corresponding ones of the third plurality of partial speech recognition hypotheses; for each particular partial speech recognition hypothesis of the third plurality of partial speech recognition hypotheses, generating a re-ranked score based on the corresponding speech recognition score of the third plurality of speech recognition scores and whether the particular partial speech recognition hypothesis includes the second partial transcription; and generating a third partial transcription for the utterance by selecting the speech recognition hypothesis of the third plurality of partial speech recognition hypotheses having the highest corresponding re-ranked score.
14 . The system of claim 9 , wherein: the first lattice represents a first streaming partial speech recognition result output from the speech recognizer during a first instance of a beam search; and the second lattice represents a second streaming partial speech recognition result output from the speech recognizer during a second instance of the beam search.
15 . The system of claim 9 , wherein the speech recognizer comprises an end-to-end speech recognition model.
16 . The system of claim 9 , wherein: the utterance spoken by the user is captured by a user device associated with the user; and the data processing hardware resides on the user device.

Description

CROSS REFERENCE TO RELATED APPLICATIONS This U.S. Patent Applications claims priority under 35 U.S.C. § 119(e) to U.S. Provisional Application 63/369,216, filed on Jul. 22, 2022. The disclosure of this prior application is considered part of the disclosure of this application and is hereby incorporated herein by reference in its entirety. TECHNICAL FIELD This disclosure relates to flickering reduction with partial hypothesis re-ranking for streaming automatic speech recognition (ASR). BACKGROUND Modern automatic speech recognition (ASR) systems focus on providing not only high quality (e.g., a low word error rate), but also low latency (e.g., a short delay between the user speaking and a transcription appearing) speech recognition for spoken utterances. For example, when using a device that implements an ASR system, there is often an expectation that the ASR system decodes utterances in a streaming fashion that corresponds to real-time or even faster than real-time. SUMMARY One aspect of the disclosure provides a computer-implemented method for flickering reduction with partial hypothesis re-ranking for streaming automatic speech recognition (ASR). The computer-implemented method, when executed on data processing hardware, causes the data processing hardware to perform operations including: receiving audio data corresponding to an utterance spoken by a user; processing, using a speech recognizer, a first portion of the audio data to generate a first lattice representing a first plurality of partial speech recognition hypotheses for the utterance and a first plurality of speech recognition scores for corresponding ones of the first plurality of partial speech recognition hypotheses; and generating a first partial transcription for the utterance by selecting the partial speech recognition hypothesis of the first plurality of partial speech recognition hypotheses having the highest corresponding speech recognition score of the first plurality of speech recognition scores. The operations also include: processing, using the speech recognizer, a second portion of the audio data to generate, based on the first lattice, a second lattice representing a second plurality of partial speech recognition hypotheses for the utterance and a second plurality of speech recognition scores for corresponding ones of the second plurality of partial speech recognition hypotheses, for each particular partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses; generating a corresponding re-ranked score based on the corresponding speech recognition score of the second plurality of speech recognition scores and whether the particular partial speech recognition hypothesis shares a prefix with the first partial transcription; and generating a second partial transcription for the utterance by selecting the partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses having the highest corresponding re-ranked score. Implementations of the disclosure may include one or more of the following optional features. In some implementations, generating the corresponding re-ranked score for each particular partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses includes determining that the particular partial speech recognition hypothesis does not share the prefix with the first partial transcription and, in response to determining that the particular partial speech recognition hypothesis does not share the prefix with the first partial transcription, adjusting the corresponding speech recognition score of the second plurality of speech recognition scores by a pre-determined amount. In some examples, the operations include selecting the pre-determined amount to adjust an amount of flicker reduction. In some implementations, generating the corresponding re-ranked score for each particular partial speech recognition hypothesis of the second plurality of partial speech recognition hypotheses includes determining a distance between the particular partial speech recognition hypothesis and the first partial transcription, and adjusting the corresponding speech recognition score of the second plurality of speech recognition scores based on the distance. In some examples, the operations also include: processing, using the speech recognizer, a final portion of the audio data to generate, based on a preceding lattice, a final lattice representing a plurality of full speech recognition hypotheses for the utterance and a final plurality of speech recognition scores for corresponding ones of the plurality of full speech recognition hypotheses; and generating a final transcription for the utterance by selecting the full speech recognition hypothesis of the plurality of full speech recognition hypotheses having the highest corresponding speech recognition score of the final plurality of speech recognition scores. In some implementat