US-11238845-B2 - Multi-dialect and multilingual speech recognition

US11238845B2US 11238845 B2US11238845 B2US 11238845B2US-11238845-B2

Abstract

Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.

Inventors

CHEN ZHIFENG
LI BO
WEINSTEIN EUGENE
WU YONGHUI
MORENO MENGIBAR PEDRO J
WEISS RON J
SIM KHE CHAI
SAINATH TARA N
PHU NGUYEN PATRICK AN

Assignees

GOOGLE LLC

Dates

Publication Date: 20220201
Application Date: 20191114
Priority Date: 20181121

Claims (19)

1. A method of performing speech recognition using an automated speech recognition system comprising one or more computers, the method comprising: receiving, by the one or more computers of the automated speech recognition system, audio data indicating audio characteristics of an utterance; providing, by the one or more computers of the automated speech recognition system, input features determined based on the audio data to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different languages or dialects, wherein the speech recognition model has been trained using cluster adaptive training, with each of the multiple languages or dialects corresponding to a separate cluster, and wherein the speech recognition model is configured to receive different identifiers as input to the speech recognition model to specify the different clusters corresponding to the respective languages or dialects; receiving, by the one or more computers of the automated speech recognition system, output that the speech recognition model generated in response to receiving the input features determined based on the audio data; and providing, as an output of the automated speech recognition system, a transcription of the utterance generated based on the output of the speech recognition model, wherein the speech recognition model has been trained using multi-task learning using: a first objective function corresponding to grapheme prediction; and a second objective function corresponding to a language or dialect classification cost, the first objective function and second objective function being weighted such that the speech recognition model is trained to learn hidden representations that are effective for both language and dialect classification and grapheme prediction.
2. The method of claim 1 , wherein: the speech recognition model comprises an encoder, a decoder, and an attention model that learns alignments between outputs of the encoder and the decoder; and the encoder, the decoder, and the attention model each comprise one or more neural network layers that have parameters learned through training using the using training examples representing speech in multiple languages or dialects.
3. The method of claim 1 , wherein the linguistic units are graphemes, and the speech recognition model is configured to provide output indicating a probability distribution over a predetermined set of graphemes.
4. The method of claim 1 , wherein the speech recognition model is trained to output scores indicative of labels representing different languages or dialects, and wherein the speech recognition model is trained to generate output sequences that include one of the labels representing the different languages or dialects.
5. The method of claim 4 , wherein the labels for the language or dialect are included in the output sequences.
6. The method of claim 1 , further comprising: determining a language or dialect of the utterance; and providing, as input to the speech recognition model, data indicating the language or dialect as input to one or more neural network layers of the speech recognition model, wherein the output of the speech recognition model is generated based on input features determined from the audio data for the utterance and the data indicating the language or dialect of the utterance.
7. The method of claim 6 , wherein providing data indicating the language or dialect comprises providing a 1-hot vector having a value corresponding to each of a predetermined set of languages or dialects.
8. The method of claim 6 , wherein the data comprises an embedding corresponding to the language or dialect, wherein the embedding has been learned through training.
9. The method of claim 6 , wherein the data indicating the language or dialect is provided as input to one or more neural network layers of an encoder of the speech recognition model.
10. The method of claim 6 , wherein the data indicating the language or dialect is provided as input to one or more neural network layers of the of a decoder of the speech recognition model.
11. The method of claim 6 , wherein the data indicating the language or dialect is provided as input to one or more neural network layers of an encoder of the speech recognition model and to one or more neural network layers of the decoder of the speech recognition model.
12. The method of claim 11 , wherein the data indicating the language or dialect is provided as input to each neural network layer of the encoder and to each neural network layer of the decoder.
13. The method of claim 12 , wherein at each neural network layer of the encoder and the decoder, a vector indicative of the language or dialect is linearly transformed by the weight matrices of the neural network layer and added to the original hidden activations before a nonlinearity is applied.
14. The method of claim 1 , wherein the speech recognition model has been trained using cluster adaptive training, with each language or dialect corresponding to a separate cluster, and wherein each language or dialect has a corresponding language or dialect identifier provided as input to the speech recognition model to specify the use of the language or dialect.
15. The method of claim 14 , wherein the language or dialect identifiers are one-hot vectors.
16. The method of claim 1 , wherein the speech recognition model has been trained using cluster adaptive training, with each language or dialect corresponding to a separate cluster, and wherein language or dialect embedding vectors learned through training are used as weights to combine clusters.
17. The method of claim 1 , wherein: the speech recognition model comprises an encoder, a decoder, and an attention model that learns alignments between outputs of the encoder and the decoder; the encoder, the decoder, and the attention model each comprise one or more neural network layers that have parameters learned through training using the training examples representing speech in multiple languages or dialects; the speech recognition model has been trained using cluster adaptive training, with each language or dialect corresponding to a separate cluster; for each cluster, a single LSTM layer is used with output projection to match the dimension of a particular layer of the speech recognition model; a weighted sum of all the cluster adaptive trained bases using dialect vectors as interpolation weights is added back to the outputs of the particular layer to generate an aggregated output vector; and the aggregated output vector is provided as input to a last layer of the encoder of the speech recognition model.
18. A system comprising: one or more computers; and one or more computer-readable media storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the following operations: receiving, by the one or more computers of the automated speech recognition system, audio data indicating audio characteristics of an utterance; providing, by the one or more computers of the automated speech recognition system, input features determined based on the audio data to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different languages or dialects, wherein the speech recognition model has been trained using cluster adaptive training, with each of the multiple languages or dialects corresponding to a separate cluster, and wherein the speech recognition model is configured to receive different identifiers as input to the speech recognition model to specify the different clusters corresponding to the respective languages or dialects; receiving, by the one or more computers of the automated speech recognition system, output that the speech recognition model generated in response to receiving the input features determined based on the audio data; and providing, as an output of the automated speech recognition system, a transcription of the utterance generated based on the output of the speech recognition model, wherein the speech recognition model has been trained using multi-task learning using: a first objective function corresponding to grapheme prediction; and a second objective function corresponding to a language or dialect classification cost, the first objective function and second objective function being weighted such that the speech recognition model is trained to learn hidden representations that are effective for both language and dialect classification and grapheme prediction.
19. One or more non-transitory computer-readable media storing instructions that, when executed by the one or more computers, cause the one or more computers to perform the following operations: receiving, by the one or more computers of the automated speech recognition system, audio data indicating audio characteristics of an utterance; providing, by the one or more computers of the automated speech recognition system, input features determined based on the audio data to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different languages or dialects, wherein the speech recognition model has been trained using cluster adaptive training, with each of the multiple languages or dialects corresponding to a separate cluster, and wherein the speech recognition model is configured to receive different identifiers as input to the speech recognition model to specify the different clusters corresponding to the respective languages or dialects; receiving, by the one or more computers of the automated speech recognition system, output that the speech recognition model generated in response to receiving the input features determined based on the audio data; and providing, as an output of the automated speech recognition system, a transcription of the utterance generated based on the output of the speech recognition model, wherein the speech recognition model has been trained using multi-task learning using: a first objective function corresponding to grapheme prediction; and a second objective function corresponding to a language or dialect classification cost, the first objective function and second objective function being weighted such that the speech recognition model is trained to learn hidden representations that are effective for both language and dialect classification and grapheme prediction.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application claims the benefit of U.S. Provisional Application No. 62/770,534, filed on Nov. 21, 2018, the entire contents which are incorporated herein by reference. BACKGROUND The present specification relates to speech recognition. Speech recognition has made remarkable progress in the past few years. Nevertheless, building a speech recognizer that can accurately recognize speech in multiple different languages or dialects is still a challenge. SUMMARY Sequence-to-sequence models can provide a simple and elegant solution for building speech recognition systems by folding separate components of a typical speech recognition system, namely acoustic (AM), pronunciation (PM) and language (LM) models into a single neural network. In some implementations, a single sequence-to-sequence model, such as a model of the listen, attend and spell (LAS) type, can be trained to serve multiple different English dialects, which simplifies the process of training multi-dialect systems without the need for separate AM, PM and LMs for each dialect. In general, simply pooling the data from all dialects into one LAS model falls behind the performance of a model fine-tuned on each dialect. However, incorporating dialect-specific information into the model can improve performance, for example, through techniques of modifying the training targets by inserting the dialect symbol at the end of the original grapheme sequence and also feeding a 1-hot representation of the dialect information into all layers of the model. In fact, a multi-dialect model structured and trained in this way can provide greater accuracy than specialized, single-dialog models. Experimental results for seven English dialects show that a single, multi-dialect LAS model is effective in modeling dialect variations, outperforming single-dialect LAS models (each trained individually on each of the seven dialects) with 3.1˜16.5% relative reductions in word error rate (WER). Dialects are variations of the same language, specific to geographical regions or social groups. Although different dialects share many similarities, there are usually large differences at several linguistic levels; amongst others: phonological, grammatical, orthographic (e.g., “color” vs. “colour”) and very often different vocabularies. As a result, automatic speech recognition (ASR) systems trained or tuned for one specific dialect of a language perform poorly when tested on another dialect of the same language. In addition, systems simultaneously trained on many dialects fail to generalize well for each individual dialect. Inevitably, multi-dialect languages pose a challenge to ASR systems. If enough data exists for each dialect, a common practice is to treat each dialect independently. Alternatively, in cases where dialects are resource-scarce, these models are boosted with data from other dialects. In the past, there have been many attempts to build multi-dialect/language systems. The usual approach has been to define a common set of universal phone models with appropriate parameter sharing and train it on data from many languages. The data can be adapted from the language of interest developed, similar to neural network models with language independent feature extraction and language dependent phonetic classifiers. One of the challenges of building a universal multi-dialect model for conventional ASR systems is that many of these models still require a separate pronunciation model (PM) and language model (LM) per dialect, which are trained independently from the multi-dialect acoustic model (AM). Therefore, if the AM predicts an incorrect set of sub-word units from the incorrect dialect, errors are propagated to the PM and LM. Sequence-to-sequence models provide a simple and elegant technique for the ASR task by learning and optimizing a single neural network for the AM, PM and LM. This provides a significant advantage for building a single multi-dialect system. Training a multi-dialect sequence-to-sequence model is simple, as the output set of symbols to be predicted can be generated by simply pooling all the grapheme symbols together across the dialects. In addition, the AM, PM and LM variations are jointly modeled across dialects. The simplicity and joint optimization make it effective for training multi-dialect systems. In some implementations, attention-based sequence-to-sequence models are adopted, namely listen, attend and spell (LAS) for multi-dialect modeling. It has shown good performance compared to other sequence-to-sequence models for single dialect tasks. As discussed below, one example model has the goal of recognizing each of seven English dialects with a single LAS model. One approach is to simply pool all the data from the seven dialects together. For English, the grapheme set is shared across dialects, so nothing needs to be modified for the output. Although this model often gives acceptable performance for each dialect, this m