KR-20260065813-A - Controllable Diffusion-Based Speech Generation Model

KR20260065813AKR 20260065813 AKR20260065813 AKR 20260065813AKR-20260065813-A

Abstract

The systems and techniques described herein relate to a diffusion-based model for generating a transformed speech from a source speech based on a target speech. For example, a device may extract first prosodic data from input data and generate content embeddings based on the input data. The device may extract second prosodic data from a target speech, generate speaker embeddings from the target speech, and generate prosodic embeddings from the second prosodic data. The device may generate transformed prosodic data based on the first prosodic data and prosodic embeddings. Subsequently, the device may generate a transformed spectrogram based on the transformed prosodic data, speaker embeddings, and content embeddings.

Inventors

변, 경근
문, 선국
비서, 에릭

Assignees

퀄컴 인코포레이티드

Dates

Publication Date: 20260511
Application Date: 20240829
Priority Date: 20231025

Claims (20)

As a device for generating output speech from input data, One or more memories configured to store the above input data; and It includes one or more processors coupled to the one or more of the above memories, and the one or more processors, Extracting first prosody data from the above input data; Generate a content embedding based on the above input data; Extract secondary prosodic data from the target speech; Generate speaker embeddings from the above target speech; Generate prosodic embeddings from the above second prosodic data; and Based on the first prosodic data and the prosodic embedding, to generate converted prosodic data A device for generating output speech from input data configured.
In paragraph 1, A device for generating output speech from input data, wherein the input data comprises one or more of speech data or text data.
In paragraph 2, A device for generating output speech from input data, wherein the input data includes one of speech data and text data.
In paragraph 1, A device for generating output speech from input data, wherein the first prosodic data includes one or more of a fundamental frequency, an energy value, and a velocity value.
In paragraph 1, A device for generating output speech from input data, wherein the above second prosodic data includes one or more of a fundamental frequency, an energy value, and a velocity value.
In paragraph 1, The above one or more processors are, Based on the converted prosodic data, the speaker embeddings, and the content embeddings, a converted spectrogram is generated; and To generate the converted spectrogram through a decoder including a diffusion decoder or a non-diffusion decoder based on the converted prosodic data, the speaker embedding, and the content embedding. A device for generating output speech from input data configured.
In paragraph 6, The above one or more processors are, Based on the above-mentioned converted prosody data, generate a predicted global speaking rate, and generate a speaking rate for the above-mentioned converted spectrogram through a rate control engine; and To generate converted speech based on the above input data through a vocoder A device for generating output speech from input data configured.
In Paragraph 7, The above vocoder is a device for generating output speech from input data, comprising a neural vocoder.
In paragraph 6, The above one or more processors are, Extracting the first prosody data from the input data through the first prosody extractor engine; Generate the content embedding based on the input data through a content encoder; Extracting the second prosody data from the target speech through the second prosody extractor engine; Generate the speaker embedding from the target speech through the speaker encoder; Generate the prosodic embedding from the second prosodic data through a prosodic encoder; Based on the first prosodic data and the prosodic embeddings, converted prosodic data is generated through a prosodic conversion engine; and To generate the converted spectrogram through a decoder based on the converted prosodic data, the speaker embedding, and the content embedding. A device for generating output speech from input data configured.
In Paragraph 9, The above device comprises the above decoder, and the decoder is configured to synthesize the content embedding, the speaker embedding, and the speech spectrum conditioned on the converted prosodic data, for generating output speech from input data.
In Paragraph 7, A device for generating output speech from input data, wherein the rate control engine is configured to manipulate the speaking rate based on the predicted speed.
In Paragraph 9, The above-described prosodic encoder is configured to generate the above-described prosodic embeddings at one or more of the frame-level or sentence-level, and is a device for generating output speech from input data.
In Paragraph 12, The above device comprises the prosodic encoder, and the prosodic encoder is configured to generate the prosodic embedding at the frame level to enable frame-level intonation control, for generating output speech from input data.
In Paragraph 7, The above one or more processors are, An apparatus for generating output speech from input data, configured to generate the speaking rate for the converted spectrogram independently of the automatic speech recognition model based on the converted prosodic data through the rate control engine.
In paragraph 1, A device for generating output speech from input data, wherein the input data includes speech data, and the device further includes one or more microphones configured to capture the speech data.
In paragraph 1, A device for generating output speech from input data, further comprising one or more speakers configured to output speech data including the converted prosody data.
As a method for generating output speech from input, A step of extracting first prosodic data from input data; A step of generating content embeddings based on the above input data; Step of extracting second prosodic data from target speech; A step of generating speaker embeddings from the above target speech; A step of generating a prosodic embedding from the second prosodic data; and A method for generating output speech from an input, comprising the step of generating converted prosodic data based on the first prosodic data and the prosodic embedding.
In Paragraph 17, A method for generating output speech from an input, wherein the above input data includes one or more of speech data or text data.
In Paragraph 18, A method for generating output speech from an input, wherein the above input data includes one of speech data and text data.
In Paragraph 17, A method for generating output speech from an input, wherein the above-mentioned first prosodic data includes one or more of a fundamental frequency, an energy value, and a velocity value.

Description

Controllable Diffusion-Based Speech Generation Model The present disclosure generally relates to processing speech signals. For example, aspects of the present disclosure relate to a diffusion-based model for generating a transformed speech from a source speech based on a target speech (e.g., the transformed speech has the prosody characteristics of the target speech but retains the same content as the source speech). Diffusion-based speech conversion is a technique comprising encoder and decoder structures, where source speech is fed to an average speech encoder to generate content embeddings. Source and target speech are fed to a speaker encoder to generate speaker embeddings. The content and speaker embeddings are then fed to a diffusion decoder that synthesizes spectrograms based on condition vectors associated with the content and speaker embeddings. The approach relies on general speaker characteristics and utilizes a single embedding vector for speech conversion. Systems and techniques for providing a controllable diffusion-based speech generation model that introduces a transformation process providing additional controllability over the prosodic features of speech are described herein. According to some aspects, an apparatus for generating output speech from input data is provided. The apparatus comprises one or more memories configured to store input data and one or more processors coupled to the one or more memories, wherein the one or more processors are configured to extract first prosodic data from the input data; generate content embeddings based on the input data; extract second prosodic data from a target speech; generate speaker embeddings from the target speech; generate prosodic embeddings from the second prosodic data; and generate transformed prosodic data based on the first prosodic data and the prosodic embeddings. In some aspects, a method for generating an output speech from input data is provided. The method comprises the steps of: extracting a first prosodic data from input data; generating a content embedding based on the input data; extracting a second prosodic data from a target speech; generating a speaker embedding from the target speech; generating a prosodic embedding from the second prosodic data; generating transformed prosodic data based on the first prosodic data and the prosodic embedding; and generating a transformed spectrogram based on the transformed prosodic data, the speaker embedding, and the content embedding. In some aspects, a non-transient computer-readable medium is provided in which instructions are stored, and the instructions are configured such that, when executed by one or more processors, one or more processors extract first prosodic data from input data; generate content embeddings based on input data; extract second prosodic data from target speech; generate speaker embeddings from target speech; generate prosodic embeddings from second prosodic data; and generate converted prosodic data based on first prosodic data and prosodic embeddings. In some aspects, an apparatus is provided, the apparatus may include means for extracting first prosodic data from input data; means for generating content embeddings based on input data; means for extracting second prosodic data from target speech; means for generating speaker embeddings from target speech; means for generating prosodic embeddings from second prosodic data; means for generating converted prosodic data based on first prosodic data and prosodic embeddings; and means for generating converted spectrograms based on converted prosodic data, speaker embeddings and content embeddings. In some aspects, one or more of the devices described herein are an extended reality (XR) device or system (e.g., a virtual reality (VR) device, an augmented reality (AR) device, or a mixed reality (MR) device), a mobile device or wireless communication device (e.g., a mobile phone or other mobile device), a wearable device (e.g., a network-connected watch or other wearable device), a camera, a personal computer, a laptop computer, a vehicle or computing device or a component of a vehicle, a server computer or server device (e.g., an edge or cloud-based server, a personal computer acting as a server device, a mobile device such as a mobile phone acting as a server device, an XR device acting as a server device, a vehicle acting as a server device, a network router, or other device acting as a server device), other devices, or a combination thereof, a part thereof, and/or include these. In some aspects, the device includes a camera or a plurality of cameras for capturing one or more images. In some aspects, the device further includes a display for displaying one or more images, notifications, and/or other displayable data. In some aspects, the devices described above may include one or more sensors (e.g., one or more inertial measurement units (IMUs), such as one or more gyroscopes, one or more gyroscopes or gyroscopes, o