EP-4581616-B1 - CONTEXTUAL BIASING WITH TEXT INJECTION

EP4581616B1EP 4581616 B1EP4581616 B1EP 4581616B1EP-4581616-B1

Inventors

SAINATH, TARA N.
PRABHAVALKAR, ROHIT PRAKASH
CASEIRO, DIAMANTINO ANTONIO
RONDON, Patrick Maxim
ALLAUZEN, CYRIL

Dates

Publication Date: 20260506
Application Date: 20231020

Claims (13)

A computer-implemented method (600) when executed on data processing hardware (710) causes the data processing hardware ( 10) to perform operations comprising: receiving context biasing data (510), the context biasing data (510) comprising a set of unspoken textual utterances (320) corresponding to a particular context (512), each unspoken textual utterance (320) in the set of unspoken textual utterances (320) not paired with any corresponding spoken utterance of speech; obtaining a list of carrier phrases (520) associated with the particular context (512) of the set of unspoken textual utterances (320); for each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320), generating a corresponding training data pair (532) comprising the respective unspoken textual utterance (320) paired with a carrier phrase (520) from among the list of carrier phrases (520); for each respective training data pair (532): tokenizing the respective training data pair (532) into a sequence of subword units (402); generating, by a text encoder (202), at each of a plurality of output steps, a first higher order textual feature representation (203) for a corresponding sub-word unit (402) in the sequence of sub-word units (402) tokenized from the respective training data pair (532); receiving, as input to a first decoder (250) of a speech recognition model (200), the first higher order textual feature representation (203) generated by the text encoder (202) at each of the plurality of output steps; and generating, by the first decoder (250), at each of the plurality of output steps, a first probability distribution (253) over possible text units; and training the speech recognition model (200) based on the first probability distribution (253) over possible text units generated by the first decoder (250) at each of the plurality of output steps for each respective training data pair (532).
The computer-implemented method (600) of claim 1, wherein the particular context (512) comprises at least one of: a song; a contact; an application; an entity; or a geographic location.
The computer-implemented method (600) of claim 1 or 2, wherein the list of carrier phrases (520) comprises at least one of: call; message; play; open; or directions to.
The computer-implemented method (600) of any of claims 1-3, wherein the operations further comprise: tokenizing the respective training data pair (532) into one or more alternate sequences of sub-word units, each alternate sequence of sub-word units comprising at least one different sub-word unit in the alternate sequence of sub-word units than a corresponding sub-word unit (402) in the sequence of sub-word units (402), wherein the respective training data pair (532) comprises the sequence of subword units (402) and the one or more alternate sequence of sub-word units.
The computer-implemented method (600) of any of claims 1-4, wherein the operations further comprise, for each unspoken textual utterance (320) in the set of unspoken textual utterances (320): receiving, as input to a shared audio-text encoder (206) of the speech recognition model (200), the first higher order textual feature representation (203) generated by the text encoder (202) at each of the plurality of output steps; generating, by the shared audio-text encoder (206), at each of the plurality of output steps, a second higher order textual feature representation (207) for a corresponding first higher order textual feature representation (203) in a shared latent representation space; receiving, as input to a second decoder (260) of the speech recognition model (200), the second higher order textual feature representation (207) generated by the shared audio-text encoder (206) at each of the plurality of output steps; and generating, by the second decoder (260), at each of the plurality of output steps, a second probability distribution (263) over possible text units, wherein training the speech recognition model (200) is further based on the second probability distribution (263) over possible text units generated by the second decoder (260) at each of the plurality of output steps for each unspoken textual utterance (320) in the set of unspoken textual utterances (320).
The computer-implemented method (600) of claim 5, wherein the operations further comprise: receiving a set of transcribed speech utterances (304), each transcribed speech utterance (304) in the set of transcribed speech utterances (304) paired with a corresponding transcription (302) and represented by a corresponding sequence of acoustic frames (110); and for each transcribed speech utterance (304) in the set of transcribed speech utterances (304): generating, by an audio encoder (204) of the speech recognition model (200), at each of a plurality of output steps, a first higher order audio feature representation (205) for a corresponding acoustic frame (110) in the sequence of acoustic frames (110) representing the transcribed speech utterance (304); receiving, as input to the first decoder (250) of the speech recognition model (200), the first higher order audio feature representation (205) generated by the audio encoder (204) at each of the plurality of output steps; and generating, by the first decoder (250), at each of the plurality of output steps, a first probability distribution (255) over possible speech recognition hypotheses, wherein training the speech recognition model (200) is further based on the first probability distribution (255) over possible speech recognition hypotheses generated by the first decoder (250) at each of the plurality of output steps for each transcribed speech utterance (304) in the set of transcribed speech utterances (304).
The computer-implemented method (600) of claim 6, wherein the operations further comprise, for each transcribed speech utterance (304) in the set of transcribed speech utterances (304): receiving, as input to the shared audio-text encoder (206) of the speech recognition model (200), the first higher order audio feature representation (205) generated by the audio encoder (204) at each of the plurality of output steps; generating, by the shared audio-text encoder (206), at each of the plurality of output steps, a second higher order audio feature representation (208) for a corresponding first higher order audio feature representation (205) in the shared latent representation space; receiving, as input to the second decoder (260) of the speech recognition model (200), the second higher order audio feature representation (208) generated by the shared audio-text encoder (206) at each of the plurality of output steps; and generating, by the second decoder (260), at each of the plurality of output steps, a second probability distribution (265) over possible speech recognition hypotheses, wherein training the speech recognition model (200) is further based on the second probability distribution (265) over possible speech recognition hypotheses generated by the second decoder (260) at each of the plurality of output steps for each transcribed speech utterance (304) in the set of transcribed speech utterances (304).
The computer-implemented method (600) of claim 7, wherein training the speech recognition model (200) comprises jointly training the speech recognition model (200) using the first and second probability distributions (253, 263) over possible text units and the first and second probability distributions (255, 265) over possible speech recognition hypotheses.
The computer-implemented method (600) of claims 7 or 8, wherein the operations further comprise: receiving, at a contextual finite-state transducer FST (242), the second probability distribution (265) over possible speech recognition hypotheses; determining, using the contextual FST (242), context scores (244) for each possible speech recognition hypothesis of the second probability distribution (265) based on context data (510); and executing a beam search decoding process to select a respective one of the possible speech recognition hypotheses of the second probability distribution (265) based on the context scores (244) and the second probability distribution (265).
The computer-implemented method (600) of any of claims 1-9, wherein the first decoder (250) comprises: a prediction network (220) configured to: receive, as input, a sequence of N previous non-blank symbols output by a final Softmax layer (240); and generate, at each of the plurality of output steps, a dense representation; and a joint network (230) configured to: receive, as input, the first higher order textual feature representation (203) generated by the text encoder (202) at each of the plurality of output steps and the dense representation generated by the prediction network (220) at each of the plurality of output steps; and generate, at each of the plurality of output steps, the first probability distribution (253) over possible text units.
The computer-implemented method (600) of any of claims 1-10, wherein the operations further comprise, for each respective training data pair (532): upsampling, using a parameter-free duration model (400), a distribution of the sequence of sub-word units (402) tokenized from the respective training data pair (532); and randomly masking a portion of the upsampled distribution of the sequence of sub-word units (402).
The computer-implemented method (600) of any of claims 1-11, wherein: each sub-word unit in the sequence of sub-word units (402) comprises one of a phoneme or a wordpiece; and each text unit in the first probability distribution (253) over possible text units comprises a wordpiece.
A system (100) comprising: data processing hardware (710); and memory hardware in communication (720) with the data processing hardware (710), the memory hardware (720) storing instructions that when executed on the data processing hardware (710) cause the data processing hardware (710) to perform operations according to the computer-implemented method of any preceding claim.

Description

TECHNICAL FIELD This disclosure relates to contextual biasing with text injection. BACKGROUND Automatic speech recognition (ASR), the process of taking an audio input and transcribing it into text, has greatly been an important technology that is used in mobile devices and other devices. In general, automatic speech recognition attempts to provide accurate transcriptions of what a person has said by taking an audio input (e.g., speech utterance) and transcribing the audio input into text. Modern ASR models continue to improve in both accuracy (e.g., a low word error rate (WER)) and latency (e.g., delay between the client speaking and the transcription) based on the ongoing development of deep neural networks. However, one challenge in developing deep learning-based ASR models is the parameters of the ASR models tend to over fit the training data, thereby resulting in the ASR models having difficulties generalizing unseen data when the training data is not extensive enough. In some instances, ASR models use biasing to increase a probability of transcribing particular words or phrases. However, conventional biasing techniques cause significant WER and latency degradation of ASR models, especially as a number of biasing phrases increases. Semi-Supervised End-to-End Speech Recognition in proc. Interspeech 2018 by Karita et al. discloses a semi-supervised method for end-to-end automatic speech recognition (ASR). It can exploit large unpaired speech and text datasets, which require much less human effort to create paired speech-to-text datasets. The semi-supervised method targets the extraction of an intermediate representation between speech and text data using a shared encoder network. Autoencoding of text data with this shared encoder improves the feature extraction of text data as well as that of speech data when the intermediate representations of speech and text are similar to each other as an inter-domain feature. SUMMARY One aspect of the disclosure provides a computer-implemented method that when executed on data processing hardware causes the data processing hardware to perform operations for training an automatic speech recognition model using contextual biasing with text injection. The operations include receiving context biasing data that includes a set of unspoken textual utterances corresponding to a particular context. Each unspoken textual utterance in the set of unspoken textual utterances is not paired with any corresponding spoken utterance of speech. The operations also include obtaining a list of carrier phrases associated with the particular context of the set of unspoken textual utterances. For each respective unspoken textual utterance in the set of unspoken textual utterances, the operations include generating a corresponding training data pair that includes the respective unspoken textual utterance paired with a carrier phrase from among the list of carrier phrases. For each respective training data pair, the operations include: tokenizing the respective training data pair into a sequence of sub-word units; generating, by a text encoder at each of a plurality of output steps, a first higher order textual feature representation for a corresponding sub-word unit in the sequence of subword units tokenized from the respective training data pair; receiving the first higher order textual feature representation generated by the text encoder at each of the plurality of output steps as input to a first decoder of a speech recognition model; and generating, by the first decoder, a first probability distribution over possible text units at each of the plurality of output steps. The operations also include training the speech recognition model based on the first probability distribution over possible text units generated by the first decoder at each of the plurality of output steps for each respective training data pair. Implementations of the disclosure may include one or more of the following optional features. In some implementations, the particular context includes at least one of a song, a contract, an application, an entity, or a geographic location. In some examples, the list of carrier phrases includes at least one of call, message, play, open, or directions to. The operations may further include tokenizing the respective training data pair into one or more alternate sequences of sub-word units each including at least one different sub-word unit in the alternate sequence of sub-word units than a corresponding sub-word unit in the sequence of sub-word units. Here, the respective training data pair includes the sequence of sub-word units and the one or more alternate sequence of sub-word units. In some examples, for each unspoken textual utterance in the set of unspoken textual utterances, the operations further include: receiving the first higher order textual feature representation generated by the text encoder at each of the plurality of output steps as input to a shared audio-text encoder