CN-121999758-A - Controllable voice generation method, device, equipment and storage medium

CN121999758ACN 121999758 ACN121999758 ACN 121999758ACN-121999758-A

Abstract

The application discloses a controllable voice generation method, a device, equipment and a storage medium, and relates to the technical field of voice generation. The method comprises the steps of extracting voice characteristic embedded vectors from corresponding types of reference audio by utilizing at least two voice characteristic encoders, wherein the voice characteristic encoders are obtained by taking the extracted single type of voice characteristic embedded vectors as training targets and performing countermeasure training, splicing all the voice characteristic embedded vectors and text coding embedded vectors to obtain spliced vectors, inputting the spliced vectors into a voice generation model, and obtaining synthesized audio according to output of the voice generation model. The decoupling control is carried out on the voice characteristics, the voice generation is controlled by utilizing a plurality of voice characteristic embedded vectors, different voice characteristic dimensions of the synthesized audio can be controlled, the control capability and flexibility of controllable voice generation are improved, and the quality of generated voice is improved.

Inventors

DONG WANLI
HE RENQIANG
GAN WEIHAO

Assignees

马栏山音视频实验室

Dates

Publication Date: 20260508
Application Date: 20260210

Claims (10)

1. A method of controllable speech generation, comprising: Extracting voice characteristic embedded vectors from corresponding types of reference audio by using at least two voice characteristic encoders, wherein the voice characteristic encoders are obtained by taking the extracted single type of voice characteristic embedded vectors as training targets and performing countermeasure training; splicing all the voice characteristic embedded vectors and the text coding embedded vectors to obtain spliced vectors; And inputting the spliced vector into a voice generation model, and obtaining synthesized audio according to the output of the voice generation model.
2. The method of claim 1, wherein the speech characteristics encoder comprises a timbre encoder, an emotion encoder, a prosody encoder, and a speech rate encoder; extracting a speech characteristic embedding vector from a corresponding type of reference audio using at least two speech characteristic encoders, comprising: The voice speed embedding method comprises the steps of extracting a voice color embedding vector from voice color reference audio through a voice color encoder, extracting an emotion embedding vector from emotion reference audio through an emotion encoder, extracting a prosody embedding vector from prosody reference audio through a prosody encoder, and extracting a speech speed embedding vector from speech speed reference audio through a speech speed encoder.
3. The method of claim 1, wherein the speech coder is pre-trained using a gradient inversion layer, and wherein the training process comprises: constructing an countermeasure network based on an encoder to be trained, a gradient inversion layer, a main task classifier and a domain classifier, wherein the gradient inversion layer is used for carrying out gradient inversion on the loss of the domain classifier and inputting the loss into the encoder to be trained; and inputting training data into the countermeasure network, and updating parameters of the encoder to be trained according to the gradient of the main task classifier and the reverse gradient of the domain classifier until the training stopping condition is reached, so as to obtain the voice characteristic encoder.
4. The method of claim 1, wherein the splicing all the speech characteristic embedded vectors and the text encoding embedded vectors to obtain the spliced vectors comprises: And based on all the voice characteristic embedded vectors and the text coding embedded vectors, vector splicing is carried out according to the sequence dimension, and a spliced vector is obtained.
5. The method of claim 1, wherein the training process of the speech generation model comprises: Constructing an initial model based on the large language model, the stream matching and the vocoder; Acquiring an audio sample, and extracting features of the audio sample by utilizing different voice characteristic encoders to obtain an embedded vector sample; and splicing the embedded vector samples, inputting the spliced embedded vector samples into the initial model, and performing model training by taking the discrete speech codes of the audio samples as targets until the training stop condition is reached, so as to obtain the speech generation model.
6. The method according to any one of claims 1 to 5, wherein before the splicing all the speech characteristic embedded vectors and the text-encoded embedded vectors, further comprising: The conflict detection is used for detecting whether attribute information under respective voice characteristics of different voice characteristic embedded vectors has conflict or not; If the conflict exists among the different voice characteristic embedded vectors, a conflict alarm is generated, and if the conflict does not exist, the operation of splicing all the voice characteristic embedded vectors and the text coding embedded vectors is executed.
7. A controllable speech generating device, comprising: The embedded vector extraction module is used for extracting voice characteristic embedded vectors from corresponding types of reference audios by utilizing at least two voice characteristic encoders, wherein the voice characteristic encoders are obtained by taking the extracted single type of voice characteristic embedded vectors as training targets and performing countermeasure training; The vector splicing module is used for splicing all the voice characteristic embedded vectors and the text coding voice characteristic embedded vectors to obtain spliced vectors; And the voice generation module is used for inputting the spliced vector into a voice generation model and obtaining synthesized audio according to the output of the voice generation model.
8. A controllable speech generating system comprising a speech generating model and at least two speech feature encoders, the controllable speech generating system being configured to perform the controllable speech generating method according to any of claims 1 to 6.
9. An electronic device, comprising: A memory for storing a computer program; a processor for executing the computer program to implement the controllable speech generation method of any one of claims 1 to 6.
10. A computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the controllable speech generation method according to any one of claims 1 to 6.

Description

Controllable voice generation method, device, equipment and storage medium Technical Field The present invention relates to the field of speech generation technologies, and in particular, to a controllable speech generation method, device, apparatus, and storage medium. Background The controllable speech generation (Controllable SPEECH SYNTHESIS) refers to a technology capable of regulating and controlling various properties of generated speech in a speech synthesis process, so that the generated speech is more natural and meets specific requirements or scenes, and is widely applied to multiple fields of film and television dubbing, intelligent speech assistants, education, entertainment and the like. In the related art, three main methods are used for controllable speech generation, namely 1, controllable speech generation is performed based on a discrete label control method. The method comprises the steps of marking a clear discrete emotion type label for each piece of voice data during model training, inputting the label and a text into a model, and controlling output by designating the emotion label during synthesis. However, this approach suffers from the disadvantage that emotion categories are predefined and limited, and no mix or fine emotion outside the category can be generated, and expressive and flexible are limited. 2. The text prompt-based method is used for controllable voice generation, namely, a user generates prompt words through natural language description or specific labels, and a model generates corresponding sounds according to the prompts, so that the creativity is high. However, the description of the sound in the prompt words is subjective and fuzzy, so that the generation effect of the model is unstable. 3. Controllable voice generation is performed based on a zero sample method. I.e. extracting the style embedding from the reference audio as a conditional input, the guiding model generates speech similar to the reference audio style. But the controllability of the zero-sample-based method is still relatively limited and the flexibility of controllable speech generation is to be improved. Disclosure of Invention In view of the above, the present invention aims to provide a controllable speech generation method, device, equipment and storage medium, which can improve the control capability and flexibility of controllable speech generation. The specific scheme is as follows: In one aspect, the application discloses a controllable speech generation method, comprising the following steps: Extracting voice characteristic embedded vectors from corresponding types of reference audio by using at least two voice characteristic encoders, wherein the voice characteristic encoders are obtained by taking the extracted single type of voice characteristic embedded vectors as training targets and performing countermeasure training; splicing all the voice characteristic embedded vectors and the text coding embedded vectors to obtain spliced vectors; And inputting the spliced vector into a voice generation model, and obtaining synthesized audio according to the output of the voice generation model. Optionally, the voice characteristic encoder comprises a timbre encoder, an emotion encoder, a prosody encoder and a speech speed encoder; extracting a speech characteristic embedding vector from a corresponding type of reference audio using at least two speech characteristic encoders, comprising: The voice speed embedding method comprises the steps of extracting a voice color embedding vector from voice color reference audio through a voice color encoder, extracting an emotion embedding vector from emotion reference audio through an emotion encoder, extracting a prosody embedding vector from prosody reference audio through a prosody encoder, and extracting a speech speed embedding vector from speech speed reference audio through a speech speed encoder. Optionally, the voice characteristic encoder is obtained by training in advance by using a gradient inversion layer, and the training process of the voice characteristic encoder comprises the following steps: constructing an countermeasure network based on an encoder to be trained, a gradient inversion layer, a main task classifier and a domain classifier, wherein the gradient inversion layer is used for carrying out gradient inversion on the loss of the domain classifier and inputting the loss into the encoder to be trained; and inputting training data into the countermeasure network, and updating parameters of the encoder to be trained according to the gradient of the main task classifier and the reverse gradient of the domain classifier until the training stopping condition is reached, so as to obtain the voice characteristic encoder. Optionally, the splicing all the speech characteristic embedded vectors and the text coding embedded vectors to obtain spliced vectors includes: And based on all the voice characteristic embedded vectors and the text cod