US-12626686-B1 - Cross-lingual voice cloning for low-resource languages

US12626686B1US 12626686 B1US12626686 B1US 12626686B1US-12626686-B1

Abstract

Techniques for cross-lingual voice cloning for low-resource languages are provided. In an example method, a computing device receives a first speech sample, the first speech sample characterized by a first voice and spoken in a first language and a text input in a second language. The computing device generates, using a first trained ML model trained to encode the text input to an encoded representation that characterizes the text input spoken in a second voice in the second language, the encoded representation of the text input. The computing device generates, using a second trained ML model trained to generate a spectrogram based on the encoded representation of the text input spoken in the second language, the spectrogram of the text input, the spectrogram characterized by the text input spoken in the first voice in the second language. The computing device generates an audio output based on the spectrogram.

Inventors

Dading Chong
Dongyang Dai
Xiao Song
Chao Wang
Sheng Yuan

Assignees

Zoom Video Communications, Inc.

Dates

Publication Date: 20260512
Application Date: 20230725

Claims (20)

1 . A method, comprising: receiving a first speech sample, the first speech sample characterized by a first voice and spoken in a first language; receiving a text input in a second language, wherein the second language is a low-resource language; generating, by a first trained machine learning (“ML”) model trained using low-resource training data, to encode the text input to an encoded representation that characterizes the text input spoken in a second voice in the second language, a first encoded representation of the text input; generating, by a second trained ML model trained using high-resource training data, to generate a spectrogram based on the first encoded representation of the text input spoken in the second language, a first spectrogram of the text input, the first spectrogram characterized by the text input spoken in the first voice in the second language, wherein the first trained ML model is different from the second trained ML model; and generating an audio output based on the spectrogram.
2 . The method of claim 1 , wherein training the first trained ML model to encode the text input to the encoded representation that characterizes the text input spoken in the second voice in the second language comprises: accessing a third trained ML model, wherein the third trained ML model is an automatic speech recognition encoder trained to generate the encoded representation; generating, by the third trained ML model, the encoded representation of the text input spoken in the second voice in the second language; and training the first trained ML model to generate the encoded representation generated by the third trained ML model.
3 . The method of claim 1 , further comprising generating an embedded representation of the first speech sample, wherein the embedded representation is generated by a third trained ML model, wherein the third trained ML model is trained to generate the embedded representation using language-independent training data.
4 . The method of claim 1 , wherein the first trained ML model is a transformer comprising one or more transformer blocks and one or more feed-forward blocks.
5 . The method of claim 1 , wherein the second trained ML model is a transformer comprising one or more transformer blocks and one or more feed-forward blocks.
6 . The method of claim 1 , wherein: the first spectrogram is a Mel spectrogram; and generating the audio output based on the first spectrogram comprises: receiving, by a vocoder, the Mel spectrogram; generating an audio waveform based on the Mel spectrogram; and playing back the audio waveform using an audio output device.
7 . The method of claim 6 , wherein the vocoder is a trained generative adversarial neural network.
8 . The method of claim 1 , wherein the text input comprises one or more characters, wherein the one or more characters include at least one of a grapheme or a phoneme.
9 . A system comprising: one or more processors; and one or more computer-readable storage media storing instructions which, when executed by the one or more processors, cause the one or more processors to perform operations including: receiving a first speech sample, the first speech sample characterized by a first voice and spoken in a first language; receiving a text input in a second language, wherein the second language is a low-resource language; generating, by a first trained ML model trained using low-resource training data, to encode the text input to an encoded representation that characterizes the text input spoken in a second voice in the second language, a first encoded representation of the text input; generating, by a second trained ML model trained using high-resource training data, to generate a spectrogram based on the first encoded representation of the text input spoken in the second language, a first spectrogram of the text input, the first spectrogram characterized by the text input spoken in the first voice in the second language, wherein the first trained ML model is different from the second trained ML model; and generating an audio output based on the spectrogram.
10 . The system of claim 9 , wherein the first trained ML model is an automatic speech recognition encoder.
11 . The system of claim 9 , wherein the first trained ML model is a transformer comprising one or more transformer blocks and one or more feed-forward blocks.
12 . The system of claim 9 , wherein the second trained ML model is a transformer comprising one or more transformer blocks and one or more feed-forward blocks.
13 . The system of claim 9 , wherein: the first spectrogram is a Mel spectrogram; and generating the audio output based on the first spectrogram comprises: receiving, by a vocoder, the Mel spectrogram; generating an audio waveform based on the Mel spectrogram; and playing back the audio waveform using an audio output device.
14 . The system of claim 13 , wherein the vocoder is a trained generative adversarial neural network.
15 . The system of claim 9 , wherein the instructions further comprise generating an embedded representation of the first speech sample, wherein the embedded representation is generated by a third trained ML model, wherein the third trained ML model is trained to generate the embedded representation using language-independent training data.
16 . A non-transitory computer-readable medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations including: receiving a first speech sample, the first speech sample characterized by a first voice and spoken in a first language; receiving a text input in a second language, wherein the second language is a low-resource language; generating, by a first trained machine learning (“ML”) model trained using low-resource training data, to encode the text input to an encoded representation that characterizes the text input spoken in a second voice in the second language, a first encoded representation of the text input; generating, by a second trained ML model trained using high-resource training data, to generate a spectrogram based on the first encoded representation of the text input spoken in the second language, a first spectrogram of the text input, the first spectrogram characterized by the text input spoken in the first voice in the second language, wherein the first trained ML model is different from the second trained ML model; and generating an audio output based on the spectrogram.
17 . The non-transitory computer-readable medium of claim 16 , wherein training the first trained ML model to encode the text input to the encoded representation that characterizes the text input spoken in the second voice in the second language comprises: accessing a third trained ML model, wherein the third trained ML model is an automatic speech recognition encoder trained to generate the encoded representation; generating, by the third trained ML model, the encoded representation of the text input spoken in the second voice in the second language; and training the first trained ML model to generate the encoded representation generated by the third trained ML model.
18 . The non-transitory computer-readable medium of claim 16 , further comprising generating an embedded representation of the first speech sample, wherein the embedded representation is generated by a third trained ML model, wherein the third trained ML model is trained to generate the embedded representation using language-independent training data.
19 . The non-transitory computer-readable medium of claim 16 , wherein: the first trained ML model is a first transformer comprising one or more first transformer blocks and one or more first feed-forward blocks; and the second trained ML model is a second transformer comprising one or more second transformer blocks and one or more second feed-forward blocks.
20 . The non-transitory computer-readable medium of claim 16 , wherein: the first spectrogram is a Mel spectrogram; and generating the audio output based on the first spectrogram comprises: receiving, by a vocoder, the Mel spectrogram, wherein the vocoder is a trained generative adversarial neural network; generating an audio waveform based on the Mel spectrogram; and playing back the audio waveform using an audio output device.

Description

FIELD The present application generally relates to speech synthesis, and more particularly relates to techniques for cross-lingual voice cloning for low-resource languages. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are incorporated into and constitute a part of this specification, illustrate one or more certain examples and, together with the description of the example, serve to explain the principles and implementations of the certain examples. FIG. 1 shows an example system that provides videoconferencing functionality to various client devices. FIG. 2 shows an example system in which a video conference provider provides videoconferencing functionality to various client devices. FIG. 3 shows an example of a system for cross-lingual voice cloning for low-resource languages, according to some aspects of the present disclosure. FIG. 4 shows an example of a system for cross-lingual voice cloning for low-resource languages, according to some aspects of the present disclosure. FIG. 5 shows an example of a system for cross-lingual voice cloning for low-resource languages, according to some aspects of the present disclosure. FIGS. 6A-B show illustrations of example graphical user interfaces that may be used with a system for cross-lingual voice cloning for low-resource languages, according to some aspects of the present disclosure. FIG. 7 shows a flowchart of an example method 700 for cross-lingual voice cloning for low-resource languages. FIG. 8 shows an example computing device suitable for use in example systems or methods for cross-lingual voice cloning for low-resource languages, according to some aspects of the present disclosure. DETAILED DESCRIPTION Examples are described herein in the context of systems and methods for cross-lingual voice cloning for low-resource languages. Those of ordinary skill in the art will realize that the following description is illustrative only and is not intended to be in any way limiting. Reference will now be made in detail to implementations of examples as illustrated in the accompanying drawings. The same reference indicators will be used throughout the drawings and the following description to refer to the same or like items. In the interest of clarity, not all of the routine features of the examples described herein are shown and described. It will, of course, be appreciated that in the development of any such actual implementation, numerous implementation-specific decisions must be made in order to achieve the developer's specific goals, such as compliance with application- and business-related constraints, and that these specific goals will vary from one implementation to another and from one developer to another. Video conferencing is, by now, an omnipresent backdrop for much of personal and business communications. As video conferencing has become more prevalent, particularly with regards to remote, global teams, the associated technologies have grown to meet the challenges that have accompanied connecting diverse, multi-lingual teams from varied backgrounds and cultures. One group of technologies that has an increasing role to play in the context of cross-border video conferencing relates to speech synthesis. For example, text-to-speech (TTS) synthesis may be used for the generation of audible speech when participants are unable to speak or when speech is unavailable for some reason. Additional TTS applications include multilingual real-time translation, virtual assistants, expanded accessibility functionality, and so on. In some cases, it may be desirable to synthesize speech using a particular voice. The voice of synthesized speech refers to the sound produced by a particular individual when speaking, encompassing both physical and prosodic characteristics. Physical traits include aspects like pitch and timbre, while prosodic elements include rhythm, pace, and intonation. Consider the example of a video conference that provides multilingual real-time translation. During the video conference, a participant may speak in a first language. The speech may be transcribed, translated into a second language, and then synthesized using a TTS component. The TTS component may be configured to synthesize the translated speech using a synthetic “voice” provided by the TTS component. Such TTS components may include a machine learning (ML) model that has been trained to synthesize speech using training data that includes a variety of voices. Thus, the TTS component outputs synthetic speech according to the voice derived from the training data. In general, a TTS component may generate speech using one of a set of available voices corresponding the to training data used while training the TTS component. For example, a user of a TTS component may select a voice described as “high-pitched American male” or “low-pitched Japanese female.” However, some use cases call for the use of specific voices when performing TTS. The synthesis of speech using a specific