EP-4407605-B1 - USING SPEECH RECOGNITION TO IMPROVE CROSS-LANGUAGE SPEECH SYNTHESIS

EP4407605B1EP 4407605 B1EP4407605 B1EP 4407605B1EP-4407605-B1

Inventors

CHEN, Zhehuai
RAMABHADRAN, BHUVANA
ROSENBERG, ANDREW
ZHANG, YU
MENGIBAR, PEDRO J. MORENO

Dates

Publication Date: 20260513
Application Date: 20211020

Claims (11)

A computer-implemented method executed by data processing hardware that causes the data processing hardware to perform operations comprising: obtaining a multilingual text-to-speech (TTS) model; generating, using the multilingual TTS model, a native synthesized speech representation for an input text sequence in a first language that is conditioned on speaker characteristics of a native speaker of the first language; generating, using the multilingual TTS model, a cross-lingual synthesized speech representation for the input text sequence in the first language that is conditioned on speaker characteristics of a native speaker of a different second language; generating, using a Variational AutoEncoder, VAE, a native audio encoder embedding for the native synthesized speech representation; generating, using the VAE, a cross-lingual audio encoder embedding for the cross-lingual synthesized speech representation; determining, using a classifier, an adversarial loss term conditioned on the first language based on the native audio encoder embedding and the cross-lingual audio encoder embedding ; and updating parameters of the multilingual TTS model based on the adversarial loss term.
The computer-implemented method of claim 1, wherein the operations further comprise: generating, using a speech recognition model, a first speech recognition result for the native synthesized speech representation and a second speech recognition result for the cross-lingual synthesized speech representation; determining a consistent loss term based on the first speech recognition result and the second speech recognition result; and updating parameters of the speech recognition model based on the consistent loss term.
The computer-implemented method of claim 2, wherein the operations further comprise: generating a first cross-entropy loss term based on the first speech recognition result and the input text sequence in the first language; determining a second cross-entropy loss term based on the second speech recognition result and the input text sequence in the first language; and updating parameters of the speech recognition model based on the first and second cross-entropy loss terms.
The computer-implemented method of claim 3, wherein the operations further comprise back-propagating the first and second cross-entropy losses through the multilingual TTS model.
The computer-implemented method of any preceding claim, wherein the operations further comprise applying data augmentation to at least one of the native synthesized speech representation or the cross-lingual synthesized speech representation.
The computer-implemented method of any preceding claim, wherein the multilingual TTS model shares language embeddings across the first and second languages.
The computer-implemented method of any preceding claim, wherein the operations further comprise, prior to generating the native and cross-lingual synthesized speech representations: transliterating the input text sequence in the first language into a native script; and tokenizing, using a global phoneme set shared between the first and second languages, the native script into a phoneme sequence.
The computer-implemented method of claim 7, wherein generating the native synthesized speech representation comprises generating the native synthesized speech representation based on the phoneme sequence.
The computer-implemented method of claim 7, wherein generating the cross-lingual synthesized speech representation comprises generating the cross-lingual synthesized speech representation based on the phoneme sequence.
The computer-implemented method of claim 7, wherein the operations further comprise: encoding, using an encoder of the multilingual TTS model, the phoneme sequence; and decoding, using a decoder of the multilingual TTS model, the encoded phoneme sequence to generate a respective one of the native synthesized speech representation or the cross-lingual synthesized speech representation.
A system comprising: data processing hardware; and memory hardware in communication with the data processing hardware, the memory hardware storing instructions that when executed on the data processing hardware cause the data processing hardware to perform operations comprising the method of any preceding claim.

Description

TECHNICAL FIELD This disclosure relates to the training of a cross-lingual speech synthesizer and to using speech recognition to improve cross-language speech synthesis. BACKGROUND Automatic speech recognition (ASR) attempts to provide accurate transcriptions of what a person has said by taking an audio input and transcribing the audio input into text. Speech synthesis (TTS) generates speech from text. Languages that are scarcely used today or have limited amount of spoken and textual resources present a challenge for training ASR and TTS systems because only a limited amount of labeled training data exists. Training ASR models with self-supervision may reduce the amount of labeled training data required to train ASR models. Often times, even where ASR models have sufficient labeled training data a unique ASR model is required for each language. In speech synthesis, voice cloning is a promising approach to circumventing data sparsity for low-resource languages. An example approach is disclosed in the conference paper by Yu Zhang et al., "Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning", Proceedings of Interspeech, Graz, Austria, 2019. Storing a separate ASR or TTS model for each language requires a significant amount of memory. SUMMARY According to an aspect of the disclosure there is provided a computer-implemented method for training a multilingual TTS model as set forth by independent claim 1 and a corresponding system as set forth by independent claim 11. In some implementations, the multilingual TTS model includes an encoder portion that shares language embeddings across the first and second languages and a decoder portion that shares the language embeddings across the first and second language and shares speaker embeddings for both native speakers of the first language and native speakers of the second language. In these implementations, a number of speaker embeddings for the native speakers of the first language may be less than a number of speaker embeddings for the native speakers of the second language. The decoder portion may be further conditioned on prosody information extracted from synthesized speech representations using a variational autoencoder. Here, the prosody information extracted from the synthesized speech representations using the variational autoencoder is disentangled from speaker information by applying an adversarial loss on speaker classification. In some examples, prior to generating the native and cross-lingual synthesized speech representations, the operations further include: transliterating the input text sequence in the first language into a native script; tokenizing the native script into a phoneme sequence; encoding, using an encoder of the multilingual TTS model, the phoneme sequence; and decoding, using a decoder of the multilingual TTS model, the encoded phoneme sequence to generate the respective one of the native synthesized speech representation or the cross-lingual synthesized speech representation. In some implementations, the operations further include: generating, using a variational autoencoder, a native audio encoder embedding for the native synthesized speech representation; generating, using the variational autoencoder, a cross-lingual audio encoder embedding for the cross-lingual synthesized speech representation; determining an adversarial loss term conditioned on the first language based on the native and cross-lingual audio encoder embeddings; and updating parameters of the multilingual TTS model based on the adversarial loss term. The details of one or more implementations of the disclosure are set forth in the accompanying drawings and the description below. Other aspects, features, and advantages will be apparent from the description and drawings, and from the claims. DESCRIPTION OF DRAWINGS FIG. 1 is a schematic view of an example speech recognition system including a speech recognition model.FIG. 2 is a schematic view of a Recurrent Neural Network-Transducer (RNN-T) model architecture.FIG. 3 is a schematic view of an example training process for training a speech recognition model and/or a multilingual text-to-speech model.FIG. 4 is a schematic view of an example training process for training a multilingual text-to-speech modelFIG. 5 is a schematic view of a multilingual text-to-speech model training multiple speech recognition models.FIG. 6 is a schematic view of an example speech recognition systemFIG. 7 is a flowchart of an example arrangement of operations for a method of training an automated speech recognition model.FIG. 8 is a schematic view of an example computing device that may be used to implement the systems and methods described herein. Like reference symbols in the various drawings indicate like elements. DETAILED DESCRIPTION Vast amounts of transcribed data is needed to train automatic speech recognition (ASR) models. That is, ASR models require training data pairs tha