EP-4537327-B1 - JOINT SPEECH AND TEXT STREAMING MODEL FOR ASR

EP4537327B1EP 4537327 B1EP4537327 B1EP 4537327B1EP-4537327-B1

Inventors

SAINATH, TARA N
HUO, Zhouyuan
CHEN, Zhehuai
ZHANG, YU
WANG, WEIRAN
STROHMAN, TREVOR
PRABHAVALKAR, ROHIT PRAKASH
LI, BO
BAPNA, Ankur

Dates

Publication Date: 20260513
Application Date: 20230701

Claims (15)

A computer-implemented method (500) when executed on data processing hardware (610) causes the data processing hardware (610) to perform operations comprising: receiving training data comprising a set of unspoken textual utterances (320), each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320) is not paired with any corresponding spoken utterance of speech; for each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320): tokenizing the respective unspoken textual utterance (320) into a sequence of sub-word units (402); generating, by a text encoder (202) of an encoder (210), at each of a plurality of output steps, a first higher order textual feature representation (203) for a corresponding sub-word unit (402) in the sequence of sub-word units (402) tokenized from the respective unspoken textual utterance (320); receiving, as input to a first-pass decoder (250), the first higher order textual feature representation (203) generated by the text encoder (202) at each of the plurality of output steps; and generating, by the first-pass decoder (250), at each of the plurality of output steps, a first probability distribution (253) over possible text units; and training the encoder (210) based on the first probability distribution (253) over possible text units generated by the first-pass decoder (250) at each of the plurality of output steps for each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320); characterized in that : the operations further comprise, for each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320): receiving, as input to a non-causal audio-text encoder (206) of the encoder (210), the first higher order textual feature representation (203) generated by the text encoder (202) at each of the plurality of output steps; generating, by the non-causal audio-text encoder (206), at each of the plurality of output steps, a second higher order textual feature representation (207) for a corresponding first higher order textual feature representation (203); receiving, as input to a second-pass decoder (260), the second higher order textual feature representation (207) generated by the non-causal audio-text encoder (206) at each of the plurality of output steps; and generating, by the second-pass decoder (260), at each of the plurality of output steps, a second probability distribution (263) over possible text units, wherein training the encoder (210) is further based on the second probability distribution (263) over possible text units generated by the second-pass decoder (260) at each of the plurality of output steps for each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320); the training data further comprises a set of transcribed speech utterances (304), each transcribed speech utterance (304) in the set of transcribed speech utterances (304) paired with a corresponding transcription (302) and represented by a corresponding sequence of acoustic frames (110); and the operations further comprise, for each respective transcribed speech utterance (304) in the set of transcribed speech utterances (304): generating, by a causal speech encoder (204) of the encoder (210), at each of the plurality of output steps, a first higher order audio feature representation (205) for a corresponding acoustic frame (110) in the sequence of acoustic frames (110) representing the transcribed speech utterance (304); receiving, as input to the first-pass decoder (250), the first higher order audio feature representation (205) generated by the causal speech encoder (204) at each of the plurality of output steps; and generating, by the first-pass decoder (250), at each of the plurality of output steps, a first probability distribution (255) over possible speech recognition hypotheses, wherein training the encoder (210) is further based on the first probability distribution (255) over possible speech recognition hypotheses generated by the first-pass decoder (250) at each of the plurality of output steps for each respective transcribed speech utterance (304) in the set of transcribed speech utterances (304).
The computer-implemented method (500) of claim 1, wherein the first-pass decoder (250) and the second-pass decoder (260) comprise a same decoder.
The computer-implemented method (500) of claim 2 or 3, wherein the non-causal audio-text encoder (206) comprises one of: a plurality of unidirectional long short-term memory (LSTM) layers; a plurality of conformer layers; or a plurality of transformer layers.
The computer-implemented method (500) of claim 1, wherein the causal speech encoder (204) comprises one of: a plurality of unidirectional long short-term memory (LSTM) layers; a plurality of conformer layers; or a plurality of transformer layers, optionally wherein: the causal speech encoder (204) comprises an initial stack of conformer layers; and the non-causal audio-text encoder (206) comprises a final stack of conformer layers overlain on the initial stack of conformer layers.
The computer-implemented method (500) of any of claims 1 or 4, wherein the causal speech encoder (204) and the non-causal audio-text encoder (206) of the encoder (210) are trained using Hybrid Autoregressive Transducer Factorization.
The computer-implemented method (500) of claim 5, wherein the operations further comprise, for each respective transcribed speech utterance (304) in the set of transcribed speech utterances (304): receiving, as input to the non-causal audio-text encoder (206), the first higher order audio feature representation (205) generated by the causal speech encoder (204) at each of the plurality of output steps; generating, by the non-causal audio-text encoder (206), at each of the plurality of output steps, a second higher order audio feature representation (208) for a corresponding first higher order audio feature representation (205); receiving, as input to the second-pass decoder (260), the second higher order audio feature representation (208) generated by the non-causal audio-text encoder (206) at each of the plurality of output steps; and generating, by the second-pass decoder (260), at each of the plurality of output steps, a second probability distribution (265) over possible speech recognition hypotheses, wherein training the encoder (210) is further based on the second probability distribution (265) over possible speech recognition hypotheses generated by the second-pass decoder (260) at each of the plurality of output steps for each respective transcribed speech utterance (304) in the set of transcribed speech utterances (304).
The computer-implemented method (500) of claim 6, wherein training the encoder (210) comprises training the encoder (210) using a minimum word error loss function.
The computer-implemented method (500) of any of claims 1-7, wherein: each sub-word unit (402) in the sequence of sub-word units (402) comprises one of a phoneme or a wordpiece; and each text unit in the first probability distribution (203) over possible text units comprises a wordpiece.
The computer-implemented method (500) of any of claims 1-8, wherein the operations further comprise, for each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320): upsampling, using a parameter-free duration model, a distribution of the sequence of sub-word units (402) tokenized from the respective unspoken textual utterance (320); and randomly masking a portion of the upsampled distribution of the sequence of sub-word units (402).
A system (118) comprising: data processing hardware (610); and memory hardware (620) in communication with the data processing hardware (610), the memory hardware (620) storing instructions that when executed on the data processing hardware (610) cause the data processing hardware (610) to perform operations comprising: receiving training data comprising a set of unspoken textual utterances (320), each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320) is not paired with any corresponding spoken utterance of speech; for each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320): tokenizing the respective unspoken textual utterance (320) into a sequence of sub-word units (402); generating, by a text encoder (202) of an encoder (210), at each of a plurality of output steps, a first higher order textual feature representation (203) for a corresponding sub-word unit (402) in the sequence of sub-word units (402) tokenized from the respective unspoken textual utterance (320); receiving, as input to a first-pass decoder (250), the first higher order textual feature representation (203) generated by the text encoder (202) at each of the plurality of output steps; and generating, by the first-pass decoder (250), at each of the plurality of output steps, a first probability distribution (253) over possible text units; and training the encoder (210) based on the first probability distribution (253) over possible text units generated by the first-pass decoder (250) at each of the plurality of output steps for each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320), characterized in that : the operations further comprise, for each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320): receiving, as input to a non-causal audio-text encoder (206) of the encoder (210), the first higher order textual feature representation (203) generated by the text encoder (202) at each of the plurality of output steps; generating, by the non-causal audio-text encoder (206), at each of the plurality of output steps, a second higher order textual feature representation (207) for a corresponding first higher order textual feature representation (203); receiving, as input to a second-pass decoder (260), the second higher order textual feature representation (207) generated by the non-causal audio-text encoder (206) at each of the plurality of output steps; and generating, by the second- pass decoder (260), at each of the plurality of output steps, a second probability distribution (263) over possible text units, wherein training the encoder (210) is further based on the second probability distribution (263) over possible text units generated by the second-pass decoder (260) at each of the plurality of output steps for each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320); the training data further comprises a set of transcribed speech utterances (304), each transcribed speech utterance (304) in the set of transcribed speech utterances (304) paired with a corresponding transcription (302) and represented by a corresponding sequence of acoustic frames (110); and the operations further comprise, for each respective transcribed speech utterance (304) in the set of transcribed speech utterances (304): generating, by a causal speech encoder (204) of the encoder (210), at each of the plurality of output steps, a first higher order audio feature representation (205) for a corresponding acoustic frame (110) in the sequence of acoustic frames (110) representing the transcribed speech utterance (304); receiving, as input to the first-pass decoder (250), the first higher order audio feature representation (205) generated by the causal speech encoder (204) at each of the plurality of output steps; and generating, by the first-pass decoder (250), at each of the plurality of output steps, a first probability distribution (255) over possible speech recognition hypotheses, wherein training the encoder (210) is further based on the first probability distribution (255) over possible speech recognition hypotheses generated by the first-pass decoder (250) at each of the plurality of output steps for each respective transcribed speech utterance (304) in the set of transcribed speech utterances (304).
The system (118) of claim 10, wherein the first-pass decoder (250) and the second-pass decoder (260) comprise a same decoder, and/ or wherein the non-causal audio-text encoder (206) comprises one of: a plurality of unidirectional long short-term memory (LSTM) layers; a plurality of conformer layers; or a plurality of transformer layers.
The system (118) of claim 10, wherein the causal speech encoder (204) comprises one of: a plurality of unidirectional long short-term memory (LSTM) layers; a plurality of conformer layers; or a plurality of transformer layers., and/or wherein: the causal speech encoder (204) comprises an initial stack of conformer layers; and the non-causal audio-text encoder (206) comprises a final stack of conformer layers overlain on the initial stack of conformer layers.
The system (118) of any of claims 10-12, wherein the causal speech encoder (204) and the non-causal audio-text encoder (206) of the encoder (210) are trained using Hybrid Autoregressive Transducer Factorization, optionally wherein the operations further comprise, for each respective transcribed speech utterance (304) in the set of transcribed speech utterances (304): receiving, as input to the non-causal audio-text encoder (206), the first higher order audio feature representation (205) generated by the causal speech encoder (204) at each of the plurality of output steps; generating, by the non-causal audio-text encoder (206), at each of the plurality of output steps, a second higher order audio feature representation (208) for a corresponding first higher order audio feature representation (205); receiving, as input to the second-pass decoder (260), the second higher order audio feature representation (208) generated by the non-causal audio-text encoder (206) at each of the plurality of output steps; and generating, by the second-pass decoder (260), at each of the plurality of output steps, a second probability distribution (265) over possible speech recognition hypotheses, wherein training the encoder (210) is further based on the second probability distribution (265) over possible speech recognition hypotheses generated by the second-pass decoder (260) at each of the plurality of output steps for each respective transcribed speech utterance (304) in the set of transcribed speech utterances (304).
The system (118) of claim 13, wherein training the encoder (210) comprises training the encoder (210) using a minimum word error loss function.
The system (118) of any of claims 10-14, wherein: each sub-word unit (402) in the sequence of sub-word units (402) comprises one of a phoneme or a wordpiece; and each text unit in the first probability distribution (203) over possible text units comprises a wordpiece, and/or wherein the operations further comprise, for each respective unspoken textual utterance (320) in the set of unspoken textual utterances (320): upsampling, using a parameter-free duration model, a distribution of the sequence of sub-word units (402) tokenized from the respective unspoken textual utterance (320); and randomly masking a portion of the upsampled distribution of the sequence of sub-word units (402).

Description

TECHNICAL FIELD This disclosure relates to a joint speech and text streaming model for ASR. BACKGROUND Automatic speech recognition (ASR), the process of taking an audio input and transcribing it into text, has greatly been an important technology that is used in mobile devices and other devices. In general, automatic speech recognition attempts to provide accurate transcriptions of what a person has said by taking an audio input (e.g., speech utterance) and transcribing the audio input into text. Modern ASR models continue to improve in both accuracy (e.g., a low word error rate (WER)) and latency (e.g., delay between the client speaking and the transcription) based on the ongoing development of deep neural networks. However, one challenge in developing deep learning-based ASR models is the parameters of the ASR models tend to over fit the training data, thereby resulting in the ASR models having difficulties generalizing unseen data when the training data is not extensive enough. As a result, training ASR models on larger training datasets improves the accuracy of the ASR model. Synthesized speech and/or data-augmented speech can be incorporated to increase the volume of training data used to train the ASR models. KARITA SHIGEKI ET AL: "Semi-supervised End-to-end Speech Recognition Using Text-to-speech and Autoencoders", ICASSP 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), IEEE, 12 May 2019 (2019-05-12), pages 6166-6170, DOI: 10.1109/ICASSP.2019.8682890 describes speech and text autoencoders that share encoders and decoders with an automatic speech recognition (ASR) model to improve ASR performance with large speech only and text only training datasets. SUMMARY One aspect of the disclosure provides a computer-implemented method according to independent claim 1. Another aspect of the disclosure provides a system according to independent claim 10. Preferable implementations are defined by the dependent claims. The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims. DESCRIPTION OF DRAWINGS FIG. 1 is a schematic view of an example speech recognition system.FIG. 2 is a schematic view of an example speech recognition model.FIGS. 3A and 3B are schematic views of an example training process for training an encoder of the speech recognition model.FIG. 4 is a schematic view of an example alignment model.FIG. 5 is a flowchart of an example arrangement of operations for a computer-implemented method of training an encoder of a speech recognition model to jointly learn shared representations of speech and text.FIG. 6 is a schematic view of an example computing device that may be used to implement the systems and methods described herein. Like reference symbols in the various drawings indicate like elements. DETAILED DESCRIPTION Automated speech recognition has made tremendous strides with the introduction of sequence to sequence (Seq2Seq) models that map from audio to character sequences. At the same time, text-to-speech (TTS) or speech synthesis systems have successfully applied Seq2Seq models to obtain state of the art natural, realistic sounding synthesized speech that can be indistinguishable to the human ear from human speech. One challenge in developing deep learning-based ASR models is that parameters of the ASR models tend to over fit the training data, thereby resulting in the ASR models having difficulties generalizing unseen data when the training data is not extensive enough. Thus, training ASR models on larger training datasets improves the accuracy of the ASR model. For instance, the use of machine learning or other statistical methods can train ASR models on training data sets that include upwards of 10,000 hours of transcribed speech. Yet, performance of ASR models suffers when a domain associated with the training data is distinct from a domain at which the ASR model will be deployed during inference. For example, training an ASR model on transcribed speech in a domain associated with video meetings would be less effective in recognizing speech related to voice search queries, and vice versa. Unpaired text data has the potential to drastically limit the amount of labeled human speech required to train ASR models, while also providing flexibility in moving the ASR model across different domains. Using text data (i.e., unpaired text data) in addition to speech data to train ASR models, however, presents a challenge with combining speech and text modalities of the training data. One current approach uses multi-task training to train a single model with different objectives for each modality. This approach suffers from interference and capacity limitations given the different nature and objectives for each modality of the training data. Another current approach includes TT