EP-4578006-B1 - UNIVERSAL MONOLINGUAL OUTPUT LAYER FOR MULTILINGUAL SPEECH RECOGNITION

EP4578006B1EP 4578006 B1EP4578006 B1EP 4578006B1EP-4578006-B1

Inventors

ZHANG, CHAO
LI, BO
SAINATH, TARA N
STROHMAN, TREVOR
Chang, Shuo-yiin

Dates

Publication Date: 20260506
Application Date: 20231011

Claims (15)

A computer-implemented method (500) that when executed on data processing hardware (610) causes the data processing hardware (610) to perform operations comprising: receiving, as input to a multilingual automated speech recognition (ASR) model (200) configured to recognize speech in a plurality of different supported languages, a sequence of acoustic frames (110); generating, by an audio encoder (204) of the multilingual ASR model (200), at each of a plurality of output steps, a higher order feature representation (212, 222) for a corresponding acoustic frame (110) in the sequence of acoustic frames (110); generating, by a language identification (LID) predictor (230) of the multilingual ASR model (200), at each of the plurality of output steps, a language prediction representation (232) for a corresponding higher order feature representation (212, 222) generated by the audio encoder (204); and generating, by a decoder (240) of the multilingual ASR model (200), at each of the plurality of output steps, a probability distribution (252) over possible speech recognition results, the decoder (240) comprising a monolingual output layer (400) having a plurality of output nodes (410) each sharing a plurality of language-specific wordpiece models (420), the probability distribution (252) over possible speech recognition results based on the corresponding higher order feature representation (212, 222) generated by the audio encoder (204), a sequence of non-blank symbols (121) output by the monolingual output layer (400), and a corresponding language prediction representation (232) generated by the LID predictor (230).
The computer-implemented method (500) of claim 1, wherein: each language of the plurality of different supported languages comprises V number of wordpiece models (420); the monolingual output layer (400) comprises an input size equal to H; and the monolingual output layer (400) comprises a dimension equal H x V.
The computer-implemented method (500) of claim 1 or 2, wherein each language-specific wordpiece model (420) of the plurality of language-specific wordpiece models (420) shared by each corresponding output node (410) comprises a language-specific wordpiece model (420) corresponding to a respective language among the plurality of different supported languages that is different than the respective languages corresponding to the other language-specific wordpiece models (420) shared by the corresponding output node (410); and optionally, wherein each language-specific wordpiece model (420) comprises a respective wordpiece token vocabulary (422) in a writing system corresponding to the respective language.
The computer-implemented method (500) of any preceding claim, wherein the sequence of acoustic frames (110) received as input at the audio encoder (204) characterize an utterance spoken in at least one of the plurality of different supported languages; and optionally, wherein the utterance comprises a code-mixed utterance comprising one or more words spoken in a first language and one or more other words spoken in a second language.
The computer-implemented method (500) of any preceding claim, wherein, for each of the plurality of different supported languages, the plurality of output nodes (410) of the monolingual output layer (400) associate to corresponding language-specific wordpiece models (400) for each of the plurality of different supported languages alphabetically.
The computer-implemented method (500) of any of any preceding claim, wherein the operations further comprise, when two or more of the plurality of different supported languages share a same corresponding language-specific wordpiece model (420), associating, by the monolingual output layer (400), the same corresponding language-specific wordpiece model (400) to share a same one of the plurality of output nodes (410).
The computer-implemented method (500) of claim 6, wherein the monolingual output layer (400) associates same language-specific wordpiece models (420) shared by different languages to output nodes (410) by: identifying all language-specific wordpiece models (420) across all of the plurality of different supported languages that are shared by two or more of the plurality of different languages; and for each corresponding language-specific wordpiece model (420) identified as being shared by two or more of the plurality of different languages: indexing the corresponding language-specific wordpiece model (420) from 1 to S, wherein S denotes a number of the different languages that share the corresponding language-specific wordpiece model (420); and assigning the corresponding language-specific wordpiece model (420) to occupy a respective one of the plurality of output nodes (410) for each of the S number of the different languages that share the corresponding language-specific wordpiece model (420).
The computer-implemented method (500) of claim 7, wherein, for the corresponding language-specific wordpiece model (420) assigned to occupy the respective one of the plurality of output nodes (410) for each of the S number of different languages, the monolingual output layer (400) merges the corresponding language-specific wordpiece model (420) indexed from 1 to S into a single language-specific wordpiece model (420) shared by each of the S number of the different languages.
The computer-implemented method (500) of any of any preceding claim, wherein: the language prediction (432) representation generated by the LID predictor (430) at each of the plurality of output steps represents a probability distribution over possible languages among the plurality of different supported languages that is predicted for a corresponding acoustic frame (110) in the sequence of acoustic frames (110); and generating the probability distribution (252) over possible speech recognition results comprises generating the probability distribution (252) over possible speech recognition results only over the language-specific wordpiece models (420) that correspond to the top-K languages in the probability distribution over possible languages represented by the corresponding language prediction representation (232).
The computer-implemented method (500) of claim 9, wherein: K is less than a total number of the different supported languages; and K comprises a frame-dependent variable that adapts.
The computer-implemented method (500) of any of any preceding claim, wherein the operations further comprise performing, by the monolingual output layer (400), beam-searching over a top N candidate hypotheses selected from the probability distribution (252) over possible speech recognition results at each of the plurality of output steps.
The computer-implemented method (500) of any preceding claim, wherein the operations further comprise: generating, by a prediction network (300) of the decoder (240), at each of the plurality of output steps, a dense representation (350) based on the sequence of non-blank symbols (121) output by the monolingual output layer (400) and the corresponding language prediction representation (232) generated by the LID predictor (230); and generating, by a joint network (250) of the decoder (240), at each of the plurality of output steps, the probability distribution (252) over possible speech recognition results based on a corresponding dense representation (350) generated by the prediction network (300), the corresponding higher order feature representation (212, 222) generated by the audio encoder (204), and the corresponding language prediction representation (232) generated by the LID predictor (230); and optionally, wherein the joint network (250) comprises a combination structure that stacks gating and bilinear pooling to fuse the dense representation (350) generated by the prediction network (300) and the higher feature representation (212, 222) generated by the audio encoder (204).
The computer-implemented method (500) of any preceding claim, wherein the audio encoder (204) comprises a cascaded encoder and the operations further comprise: generating, by a first encoder (210) of the cascaded encoder, at each of the plurality of output steps, a first higher order feature representation (212) for the corresponding acoustic frame (110) in the sequence of acoustic frames (110); and generating, by a second encoder (220) of the cascaded encoder, at each of the plurality of output steps, a second higher order feature representation (222) for a corresponding first higher feature representation (212), wherein generating the language prediction representation (232) for the corresponding higher order feature representation (212) is based on a concatenation (231) of the corresponding first higher order feature representation (212) generated by the first encoder (210) and a corresponding second higher order feature representation (220) generated by the second encoder (220).
The computer-implemented method (500) of any of any preceding claim, wherein the audio encoder (204) comprises a cascaded encoder and the operations further comprise: generating, by a first encoder (210) of the cascaded encoder, at each of the plurality of output steps, a first higher order feature representation (212) for the corresponding acoustic frame (110) in the sequence of acoustic frames (110); and generating, by a second encoder (220) of the cascaded encoder, at each of the plurality of output steps, a second higher order feature representation (222) based on a concatenation of a corresponding first higher order feature representation (212) generated by the first encoder (210) and the corresponding language prediction representation (232) generated by the language ID predictor (230), wherein generating the language prediction representation (232) for the corresponding higher order feature representation (212) is based on a corresponding first higher order feature representation (212) generated by the first encoder (210).
An automated speech recognition (ASR) system (118) comprising: a multilingual automated speech recognition (ASR) model (200) for recognizing speech in a plurality of different supported languages, the multilingual ASR model (200) comprising: an audio encoder (204); a language identification (LID) predictor (230); and a decoder (240) comprising a monolingual output layer (400) having a plurality of output nodes (410) each sharing a plurality of language-specific wordpiece models (420); wherein the ASR system is configured to perform the method of any preceding claim.

Description

TECHNICAL FIELD This disclosure relates to using a universal monolingual output layer for multilingual speech recognition. BACKGROUND Automatic speech recognition (ASR), the process of taking an audio input and transcribing it into text, has greatly been an important technology that is used in mobile devices and other devices. In general, automatic speech recognition attempts to provide accurate transcriptions of what a person has said by taking an audio input (e.g., speech utterance) and transcribing the audio input into text. Modern ASR models continue to improve in both accuracy (e.g., a low word error rate (WER)) and latency (e.g., delay between a user speaking and the transcription) based on the ongoing development of deep neural networks. Despite a vase number of people being bilingual, many ASR models are only compatible with a single language. Other conventional ASR models are multilingual (i.e., compatible with multiple languages), but include significantly increased model sizes such that the conventional multilingual ASR models are not suitable for on-device applications that have certain storage and computing resource limitations. It is known from the publication CHAO ZHANG ET AL: "Streaming End-to-End Multilingual Speech Recognition with Joint Language Identification", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 13 September 2022 (2022-09-13) techniques for performing streaming end-to-end multilanguage speech recognition with joint language identification. It is further known from the publication JOSHI VIKAS ET AL: "Multiple Softmax Architecture for Streaming Multilingual End-to-End ASR Systems", INTERSPEECH 2021, 30 August 2021 (2021-08-30), 3 September 2021 (2021-09-03), pages 1767-1771, ISCA DOI: 10.21437/ Interspeech.2021-1298 (retrieved from the Internet:URL:https://www.isca-speech.org/archive/ pdfs/interspeech_2021/joshi21_interspeech.pdf) techniques for multilingual speech recognition based on a multi-softmax model applicable to RNN-T transducers, by having language specific softmax, joint and embedding layers while sharing rest of the parameters. SUMMARY One aspect of invention provides a computer-implemented method according to claim 1. Another aspect of the invention provides an automated speech recognition system according to claim 15. Further preferable aspects are defined by the dependent claims. The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims. DESCRIPTION OF DRAWINGS FIG. 1 is a schematic view of an example speech recognition system.FIGS. 2A and 2B are schematic views of example speech recognition models.FIG. 3 is a schematic view of an example prediction network of the example speech recognition models of FIGS. 2A and 2B.FIG. 4 is a schematic view of an example universal monolingual output layer of the example speech recognition models of FIGS. 2A and 2B.FIG. 5 is flowchart of example arrangement of operations for a computer-implemented method of using a universal monolingual output layer for multilingual speech recognition.FIG. 6 is a schematic view of an example computing device that may be used to implement the systems and methods described herein. Like reference symbols in the various drawings indicate like elements. DETAILED DESCRIPTION Automatic speech recognition (ASR), the process of taking an audio input and transcribing it into text, is a feature included in various computing devices and is used by a significant number of people. Yet, only about 100 of the most common spoken languages have suitable ASR models for recognizing speech even though approximately 7,000 languages are actively spoken in the world. Using a single multilingual ASR model that is compatible with a plurality of spoken languages (e.g., rather than multiple monolingual ASR models) is beneficial because of the capability of the multilingual ASR to recognize code-switch utterances (e.g., a single utterance that has at least two different spoken languages) and the reduced workload of maintaining a single ASR model. Many end-to-end ASR models use wordpiece models that recognize speech in word or wordpiece segments to optimize ASR performance in monolingual speech scenarios. However, it is impractical to use a large number of multilingual wordpiece models in multilingual scenarios where multiple different texts (i.e., alphabetic characters) are involved. To that end, conventional systems use a separate monolingual output layer and decoder for each of the different languages that the ASR model is configured to recognize. A major drawback of these conventional systems is that using separate monolingual output layers for each language considerably increases the storage and computational resources consumed by the ASR model and requires management of more concurrent beam s