EP-4742235-A1 - METHOD FOR GENERATING VOICE CLONING MODEL, AND RELATED APPARATUS

EP4742235A1EP 4742235 A1EP4742235 A1EP 4742235A1EP-4742235-A1

Abstract

This application provides a voice cloning model generation method and a related apparatus. The method is applied to a voice cloning field, and includes: obtaining results, input by a user via a terminal device, of scoring a plurality of pieces of reference audio; performing training based on the plurality of pieces of reference audio and the results of scoring the plurality of pieces of reference audio by the user, to obtain an acoustic feedback unit, where the acoustic feedback unit is configured to measure auditory feeling of the user for different pieces of audio; obtaining a first voice data set that is input by the user via the terminal device; and training a voice cloning model based on the first voice data set and the acoustic feedback unit, and obtaining a voice cloning model. In a process of training the voice cloning model, in consideration of preference of the user for different pieces of audio, the results of scoring the plurality of pieces of reference audio by the user are added to the process of training the voice cloning model. The voice cloning model obtained through training better matches user requirements, and a voice generated using the trained voice cloning model can better match user's auditory feeling.

Inventors

CHEN, Feiyang
WANG, Zhefeng
HUAI, Baoxing
DAI, Zonghong

Assignees

Huawei Cloud Computing Technologies Co., Ltd.

Dates

Publication Date: 20260513
Application Date: 20240429

Claims (19)

A voice cloning model generation method, comprising: obtaining results, input by a user via a terminal device, of scoring a plurality of pieces of reference audio; performing training based on the plurality of pieces of reference audio and the results of scoring the plurality of pieces of reference audio by the user, to obtain an acoustic feedback unit, wherein the acoustic feedback unit is configured to measure auditory feeling of the user for different pieces of audio; obtaining a first voice data set that is input by the user via the terminal device; and training a voice cloning model based on the first voice data set and the acoustic feedback unit, and obtaining a trained voice cloning model.
The method according to claim 1, wherein before the obtaining the results, input by the user via the terminal device, of scoring the plurality of pieces of reference audio, the method further comprises: obtaining feedback information input by the user via the terminal device, wherein the feedback information comprises one or more of an application scenario of the voice cloning model, an emotion category used for the voice cloning model, and a language generated by the voice cloning model; obtaining the plurality of pieces of reference audio from an audio library through filtering based on the feedback information; and sending the plurality of pieces of reference audio to the terminal device.
The method according to claim 1 or 2, wherein the training the voice cloning model based on the first voice data set and the acoustic feedback unit comprises a plurality of rounds of iterative training, wherein in a current round of iterative training, the voice cloning model generates an optimized voice; and the optimized voice is input into the acoustic feedback unit, wherein the acoustic feedback unit scores the optimized voice to obtain a result of scoring the optimized voice, and the result of scoring the optimized voice is used as an input of the voice cloning model in a next round of iterative training, to influence the voice cloning model in generating an optimized voice in the next round, wherein in a 1 st round of iterative training, the optimized voice is generated by the voice cloning model based on the first voice data set.
The method according to claim 3, wherein in the current round of iterative training, the result of scoring the optimized voice is used as a parameter in a loss function of the voice cloning model in the next round of iterative training, to influence the loss function of the voice cloning model.
The method according to any one of claims 1 to 4, wherein the voice cloning model is used in any one or more of an audiobook scenario, a virtual human field, or a video creation field.
The method according to any one of claims 1 to 5, wherein after the obtaining the trained voice cloning model, the method further comprises: receiving, by the server, target information input by the user, wherein the target information comprises text; and inputting the target information into the trained voice cloning model, to generate a second voice.
The method according to claim 6, wherein the target information exists in the form of any one or a combination of a document, a picture, and a slide.
The method according to any one of claims 1 to 7, wherein the results of scoring the plurality of pieces of reference audio comprise results of scoring all of the plurality of pieces of reference audio in a plurality of dimensions, and the plurality of dimensions comprise two or more dimensions of timbre, voice prosody, pronunciation, and articulation.
A voice cloning model generation apparatus, comprising: an obtaining module, configured to obtain results, input by a user via a terminal device, of scoring a plurality of pieces of reference audio; an acoustic feedback module, configured to perform training based on the plurality of pieces of reference audio and the results of scoring the plurality of pieces of reference audio by the user, to obtain an acoustic feedback unit, wherein the acoustic feedback unit is configured to measure auditory feeling of the user for different pieces of audio, wherein the obtaining module is configured to obtain a first voice data set that is input by the user via the terminal device; and a voice cloning module, configured to: train a voice cloning model based on the first voice data set and the acoustic feedback unit, and obtain a trained voice cloning model.
The apparatus according to claim 9, wherein the obtaining module is further configured to obtain feedback information input by the user via the terminal device, wherein the feedback information comprises one or more of an application scenario of the voice cloning model, an emotion category used for the voice cloning model, and a language generated by the voice cloning model; a filtering module is configured to obtain the plurality of pieces of reference audio from an audio library through filtering based on the feedback information; and a sending module is configured to send the plurality of pieces of reference audio to the terminal device.
The apparatus according to claim 9 or 10, wherein the training the voice cloning model based on the first voice data set and the acoustic feedback unit comprises a plurality of rounds of iterative training, wherein in a current round of iterative training, the voice cloning module is configured to generate an optimized voice; and the acoustic feedback module is configured to input the optimized voice into the acoustic feedback unit, wherein the acoustic feedback unit scores the optimized voice to obtain a result of scoring the optimized voice, and the result of scoring the optimized voice is used as an input of the voice cloning model in a next round of iterative training, to influence the voice cloning model in generating an optimized voice in the next round, wherein in a 1 st round of iterative training, the optimized voice is generated by the voice cloning model based on the first voice data set.
The apparatus according to claim 11, wherein in the current round of iterative training, the result of scoring the optimized voice is used as a parameter in a loss function of the voice cloning model in the next round of iterative training, to influence the loss function of the voice cloning model.
The apparatus according to any one of claims 9 to 12, wherein the voice cloning model is used in any one or more of an audiobook scenario, a virtual human field, or a video creation field.
The apparatus according to any one of claims 9 to 13, wherein the obtaining module is further configured to receive target information input by the user, wherein the target information comprises text; and the voice cloning module is further configured to input the target information into the trained voice cloning model, to generate a second voice.
The apparatus according to claim 14, wherein the target information exists in the form of any one or a combination of a document, a picture, and a slide.
The apparatus according to any one of claims 9 to 15, wherein the results of scoring the plurality of pieces of reference audio comprise results of scoring all of the plurality of pieces of reference audio in a plurality of dimensions, and the plurality of dimensions comprise two or more dimensions of timbre, voice prosody, pronunciation, and articulation.
A computing device cluster, comprising at least one computing device, wherein the at least one computing device each comprises a processor and a memory, and the processor of the at least one computing device is configured to execute instructions stored in the memory of the at least one computing device, to enable the computing device cluster to perform the method according to any one of claims 1 to 8.
A computer storage medium, comprising program instructions, wherein when the program instructions are executed on a computing device cluster, the computing device cluster is enabled to perform the method according to any one of claims 1 to 8.
A computer program product, comprising program instructions, wherein when the program instructions are executed on a computing device cluster, the computing device cluster is enabled to perform the method according to any one of claims 1 to 8.

Description

This application claims priorities to Chinese Patent Application No. 202310934184.8, filed with the China National Intellectual Property Administration on July 27, 2023 and entitled "VOICE CLONING METHOD AND RELATED APPARATUS", and to Chinese Patent Application No. 202311278704.0, filed with the China National Intellectual Property Administration on September 28, 2023 and entitled "VOICE CLONING MODEL GENERATION METHOD AND RELATED APPARATUS", both of which are incorporated herein by reference in their entireties. TECHNICAL FIELD This application relates to the voice cloning field, and in particular, to a voice cloning model generation method and a related apparatus. BACKGROUND In recent years, with the rapid development of industries such as virtual human, audiobook, and video creation, more repetitive dubbing tasks are replaced by synthesized voice. Voice cloning, as a voice synthesis technology for cloning timbre, prosody, and styles of a target speaker, meets requirements for voice synthesis. Currently, voice cloning systems typically extract dozens of or hundreds of pieces of recording data required for cloning training, and cloning engines learn target speaker's pronunciation styles, prosody, and timbre, and other characteristics from the provided recording data. Although timbre and speaking styles of cloned voice are basically consistent with the target speaker's timbre and speaking styles in the recording, the cloned voice often fails to satisfy users' auditory feeling. SUMMARY This application provides a voice cloning model generation method and a related apparatus. A voice generated by using the voice cloning model generation method in this application can better match auditory feeling of a user, thereby improving user experience. According to a first aspect, this application provides a voice cloning model generation method, including: obtaining results, input by a user via a terminal device, of scoring a plurality of pieces of reference audio; performing training based on the plurality of pieces of reference audio and the results of scoring the plurality of pieces of reference audio by the user, to obtain an acoustic feedback unit, where the acoustic feedback unit is configured to measure auditory feeling of the user for different pieces of audio; obtaining a first voice data set that is input by the user via the terminal device; and training a voice cloning model based on the first voice data set and the acoustic feedback unit, and obtaining a trained voice cloning model. In a process of training the voice cloning model, in consideration of user's requirements and preference, the results of scoring the plurality of pieces of reference audio by the user are added to the process of training the voice cloning model, so that the voice cloning model obtained through training can better meet the user requirements. When used in a voice synthesis service scenario, the trained voice cloning model can better match user's auditory feeling. According to the first aspect, in a possible implementation, before the obtaining the results, input by the user via the terminal device, of scoring the plurality of pieces of reference audio, the method further includes: obtaining feedback information input by the user via the terminal device, where the feedback information includes one or more of an application scenario of the voice cloning model, an emotion category used for the voice cloning model, and a language generated by the voice cloning model; obtaining the plurality of pieces of reference audio from an audio library through filtering based on the feedback information; and sending the plurality of pieces of reference audio to the terminal device. Further, the plurality of pieces of reference audio are obtained through filtering based on the feedback information input by the user. Therefore, based on the feedback information input by the user and the results of scoring the plurality of pieces of reference audio by the user, the voice cloning model obtained through training can better meet the user requirements, and a voice generated by using the trained voice cloning model can better match user's auditory feeling. According to the first aspect, in a possible implementation, the training the voice cloning model based on the first voice data set and the acoustic feedback unit includes a plurality of rounds of iterative training. In a current round of iterative training, the voice cloning model generates an optimized voice; andthe optimized voice is input into the acoustic feedback unit, where the acoustic feedback unit scores the optimized voice to obtain a result of scoring the optimized voice, and the result of scoring the optimized voice is used as an input of the voice cloning model in a next round of iterative training, to influence the voice cloning model in generating an optimized voice in the next round. In a 1st round of iterative training, the optimized voice is generated by the voice cloning model based