CN-116110416-B - Voice conversion method based on vocoder, storage medium and electronic equipment

CN116110416BCN 116110416 BCN116110416 BCN 116110416BCN-116110416-B

Abstract

The application relates to the technical field of deep learning and natural language processing, and discloses a voice conversion method based on a vocoder, a storage medium and electronic equipment, wherein the voice conversion method comprises the following steps: the method comprises the steps of constructing an any-to-one voice conversion model, carrying out content coding on source voice audio, obtaining acoustic characteristics through an acoustic model, obtaining converted voice through a vocoder, extracting the mean value and variance of a target speaker, inputting the spectrum of the converted voice obtained through the vocoder into a convolution, obtaining a characteristic Z_source after a WaveNet module and a characteristic extractor formed by linear radiation conversion, converting the Z_source into Z_target by using the mean value and variance of the target speaker, and inputting the characteristic Z_source into a UnivNet structure to obtain the converted voice.

Inventors

SHENG LEYUAN

Assignees

杭州小影创新科技股份有限公司

Dates

Publication Date: 20260508
Application Date: 20230131

Claims (9)

1. A vocoder-based voice conversion method, comprising: Acquiring an original voice, a voice of a target speaker and a voice data set; Constructing an any-to-one voice conversion model, and training the any-to-one voice conversion model; inputting the original voice into an any-to-one voice conversion model to convert the original voice into target man-in-the-middle voice; inputting all the voices in the voice data set into an any-to-one voice conversion model to be converted into middle voice so as to construct a parallel data set; constructing a speaker coding structure, and extracting a vector of a target speaker by using the speaker coding structure, wherein the vector comprises a mean value and a variance; Constructing a feature extractor, wherein the feature extractor comprises a convolution module, a WaveNet module and a linear radiation conversion module, and the frequency spectrum of the target man-in-the-middle voice is input into the feature extractor to obtain a feature Z_source; converting the feature Z_source into a feature Z_target by using the vector of the target speaker; The feature z_target is input into the vocoder to obtain the converted target voice.
2. The vocoder-based voice conversion method of claim 1, wherein the any-to-one voice conversion model includes a content encoder for acquiring content of an input voice utterance and removing information of a speaker, an acoustic model for extracting acoustic features, and a vocoder for converting outputs of the content encoder and the acoustic model into voices of designated intermediaries.
3. The vocoder-based voice conversion method of claim 2, wherein the step of converting the original voice input into the any-to-one voice conversion model into the target man-in-the-middle voice comprises the steps of: Inputting the original voice into a content encoder to perform content encoding on the original voice so as to obtain original speaker information in the original voice; the original voice passes through an acoustic model to obtain acoustic characteristics of the original voice; The acoustic characteristics of the original voice are input to a vocoder to obtain the target man-in-the-middle voice.
4. The vocoder-based voice conversion method according to claim 2, wherein the voice in the voice dataset is all input into an any-to-one voice conversion model to be converted into the man-in-the-middle voice, comprising the steps of: inputting the voice in the voice data set one by one to a content encoder for content encoding of the voice; the voice in the voice data set is processed through an acoustic model one by one to obtain acoustic characteristics of the voice; And inputting the acoustic characteristics of the voices in the voice data set into the vocoder one by one so as to obtain the middle voice corresponding to the voices in the voice data set.
5. The vocoder-based voice conversion method of claim 1, wherein the voice data set comprises voice data of a plurality of different speakers.
6. The vocoder-based speech conversion method of claim 1, wherein the speaker encoding structure comprises conformer network models and ECAPA-TDNN models, wherein conformer network models are used to extract speaker features and ECAPA-TDNN models are used for speaker recognition.
7. The vocoder-based voice conversion method of claim 1, wherein converting the feature z_source to the feature z_target using the vector of the target speaker comprises: The target middle voice is recorded as source, the voice parallel to the target middle voice is recorded as target, the source and the target are input into the vocoder structure in pairs, after the network of feature extraction, the source and the target are separately processed, and then the shared network is passed again, in the training process, the target feature obtained after the separation processing is used for guiding the learning of the source feature, wherein Z_target= InstanceNorm (Z_source) is the Std+mean, std refers to variance, and Mean refers to Mean.
8. A computer readable storage medium storing program code for execution by a device, the program code comprising steps for performing the method of any one of claims 1-7.
9. An electronic device comprising a processor, a memory, and a program or instruction stored on the memory and executable on the processor, which when executed by the processor, implements the method of any of claims 1-7.

Description

Voice conversion method based on vocoder, storage medium and electronic equipment Technical Field The application relates to the technical field of deep learning and natural language processing, in particular to a voice conversion method based on a vocoder, a storage medium and electronic equipment. Background With the wide application of deep learning in various fields, many tasks in the speech direction have been greatly developed, such as speech synthesis and speech conversion. The voice conversion is to convert the voice of one person speaking into the tone of another person while keeping the content unchanged. The types of speech conversion can be broadly classified into any-to-one, any-to-many, any-to-any, depending on the speaker's absence from the training set. any refers to any person whose input may be, one refers to a particular person, and many refers to a limited number of persons. an any-to-one model refers to a model that can be used to convert any one speaker to a particular person. It is not possible if the speaker is replaced with another speaker. Typically, a management-to-management means that some people in the training data can perform mutual conversion, but people outside the training set cannot. any-to-many is without limitation to the input speech speaker, but the target speaker can only appear in the training set. any-to-any is the most difficult of these, and can be used to convert any person's tone to any person. The general structure of the speech conversion model is 1. The content encoder encodes the input speech to obtain the speaking content and removes the information of the speaker. 2. The speaker encoder encodes the voice as well, but acquires the information of the speaker, and removes the content information of the speaker. 3. And the decoder decodes the outputs of the content encoder and the speaker encoder to output specific acoustic characteristic information or voice waveforms. The voice cloning is to finely tune the target speaker on a trained voice synthesis model according to one sentence or a small number of sentences of the target speaker, learn tone characteristics of the target speaker, thereby realizing that the target speaker can speak arbitrary content. The same points of the voice cloning and the voice conversion are that one sentence or a small number of sentences of the target speaker are needed as references to learn the tone color of the target speaker. The difference is that the input of the speech clone is arbitrary text and the input of the speech conversion is the speech of the source speaker. The existing voice conversion technology circuit is that 1. The existing vocoder can only convert the acoustic characteristics of one person into the voice waveform of the same person, and the speaker and the speaking content are kept unchanged in the conversion process. 2. In the current speech conversion system, the input speech is required to be encoded to remove the speaker information, and meanwhile, the speech with the target tone is required to be encoded to extract the speaker information. However, speaker information has no explicit features that can be removed and extracted, and only a vector of hidden variables is extracted. The effect of speaker extraction outside the training set is significantly worse. The existing speech cloning technology line is 1. Training a better speech synthesis model. 2. And performing training fine adjustment again on the target speaker by using the trained voice synthesis model, and learning the tone of the target speaker. Retraining in an actual product application requires the user to wait and also places high demands on the deployed equipment. Disclosure of Invention The application aims to overcome the defects of the prior art and provides a voice conversion method based on a vocoder, a storage medium and electronic equipment. In a first aspect, there is provided a vocoder-based voice conversion method, comprising: Acquiring an original voice, a voice of a target speaker and a voice data set; Constructing an any-to-one voice conversion model, and training the any-to-one voice conversion model; inputting the original voice into an any-to-one voice conversion model to convert the original voice into target man-in-the-middle voice; inputting all the voices in the voice data set into an any-to-one voice conversion model to be converted into middle voice so as to construct a parallel data set; constructing a speaker coding structure, and extracting a vector of a target speaker by using the speaker coding structure, wherein the vector comprises a mean value and a variance; constructing a feature extractor, and inputting the frequency spectrum of the target man-in-the-middle voice into the feature extractor to obtain a feature Z_source; converting the feature Z_source into a feature Z_target by using the vector of the target speaker; The feature z_target is input into the vocoder to obtain the converted target voice. Furt