CN-121983022-A - Text-to-speech model training method and device, computer equipment and storage medium

CN121983022ACN 121983022 ACN121983022 ACN 121983022ACN-121983022-A

Abstract

The invention discloses a training method, a device, computer equipment and a storage medium for a text-to-speech model, wherein the method comprises the steps of obtaining a training data set, wherein the training data set comprises a forgetting subset and a reserved subset, the forgetting subset comprises speech samples of target speakers, and the reserved subset comprises speech samples of non-target speakers; and performing iterative updating training on the forgetting model until the preset convergence condition is met, so as to obtain the target text-to-speech model. Through the mode, the invention carries out heuristic update based on the reserved subset to perceive the performance change of the model on the reserved sample, and simultaneously carries out joint optimization by considering forgetting loss and reserved loss, thereby effectively relieving the gradient conflict problem between the forgetting task and the reserved task. The invention can be applied to the scenes such as voiceprint verification in the financial field to prevent voice fraud and voice auxiliary service in the medical field to protect privacy of patients.

Inventors

WU JING
CHEN MINCHUAN

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20260109

Claims (10)

1. A text-to-speech model training method, the method comprising: Obtaining a training data set, wherein the training data set comprises a forgetting subset and a reserved subset, the forgetting subset comprises voice samples of target speakers, and the reserved subset comprises voice samples of non-target speakers; Taking the parameters of the pre-trained original model as the current parameters of a preset forgetting model; Performing iterative updating training on the forgetting model until a preset convergence condition is met to obtain a target text-to-speech model, wherein the iterative updating training comprises the following steps in each iteration: calculating a first retention loss based on the current parameter of the forgetting model and the retention subset, and heuristically updating the current parameter by utilizing the first retention loss to obtain a temporary parameter; Calculating a forgetting loss based on the temporary parameter and the forgetting subset, and calculating a second preserving loss based on the current parameter and the preserving subset, the first preserving loss and the second preserving loss being used to constrain output features of a forgetting model on a preserving subset to fit timbre features of the non-targeted speaker; Updating current parameters of the forgetting model according to the gradient of the forgetting loss and the gradient of the second retention loss, wherein the forgetting loss is used for restraining the output characteristics of the forgetting model on a forgetting subset to deviate from the tone characteristics of the target speaker.
2. The method of claim 1, wherein heuristically updating the current parameter with the first retention penalty comprises: Calculating a gradient of the first retention loss relative to the current parameter; and performing gradient descent updating on the current parameter for one time by utilizing the gradient and a preset inner learning rate to obtain the temporary parameter.
3. The method of claim 1, wherein the calculating a forgetting loss based on the temporary parameter and the forgetting subset comprises: taking a text corresponding to the voice sample in the forgetting subset as input, and predicting by using the forgetting model configured by the temporary parameters to obtain predicted voice characteristics; Acquiring target characteristics, wherein the target characteristics are generated by the original model serving as a teacher model based on tone characteristics of non-target speakers; a difference between the predicted speech feature and the target feature is calculated as the forgetting loss.
4. The method of claim 1, wherein prior to performing iterative update training on the forgetting model, the method further comprises: calculating gradient amplitude of forgetting loss relative to the forgetting model parameters; determining a median of the gradient magnitudes as a threshold; Marking parameters which are larger than or equal to the threshold value in the gradient amplitude as key parameters, and setting the key parameters in a weight significance mask to be in an effective state; And marking parameters smaller than the threshold value in the gradient amplitude as non-critical parameters, and setting the parameters in an invalid state in a weight significance mask.
5. The method of claim 4, wherein updating the current parameters of the forgetting model based on the gradient of the forgetting loss and the gradient of the second retention loss comprises: constructing a joint loss function, wherein the joint loss function is a weighted sum of the forgetting loss and the second reserved loss; Calculating the gradient of the joint loss function relative to the current parameter to obtain an original updated gradient; Filtering the original updated gradient by using the weight significance mask, reserving a gradient value corresponding to the key parameter, and setting the gradient value corresponding to the non-key parameter to zero to obtain a final updated gradient; And updating the current parameters of the forgetting model by utilizing the final updating gradient.
6. The method of claim 3, wherein the forgetting model and the teacher model each employ a generating network architecture based on conditional flow matching, and wherein the first, second, and forgetting losses are flow matching losses for measuring differences between model predicted velocity fields and target velocity fields.
7. The method of claim 1, wherein after the obtaining the target text-to-speech model, the method further comprises: acquiring a target text to be synthesized and a reference voice prompt; Inputting the target text and the reference voice prompt into the target text-to-voice model; Generating and outputting target voice by using the target text-to-voice model; Wherein when the reference voice prompt belongs to the target speaker, the output target voice does not include a tone characteristic recognizable as the target speaker.
8. A text-to-speech model training device, comprising: The data acquisition module is used for acquiring a training data set, wherein the training data set comprises a forgetting subset and a reserved subset, the forgetting subset comprises voice samples of target speakers, and the reserved subset comprises voice samples of non-target speakers; The initialization module is used for taking the parameters of the pre-trained original model as the current parameters of the preset forgetting model; the training module is used for executing iterative updating training on the forgetting model until a preset convergence condition is met to obtain a target text-to-speech model, and the iterative updating training comprises the following steps in each iteration: calculating a first retention loss based on the current parameter of the forgetting model and the retention subset, and heuristically updating the current parameter by utilizing the first retention loss to obtain a temporary parameter; Calculating a forgetting loss based on the temporary parameter and the forgetting subset, and calculating a second preserving loss based on the current parameter and the preserving subset, the first preserving loss and the second preserving loss being used to constrain output features of a forgetting model on a preserving subset to fit timbre features of the non-targeted speaker; Updating current parameters of the forgetting model according to the gradient of the forgetting loss and the gradient of the second retention loss, wherein the forgetting loss is used for restraining the output characteristics of the forgetting model on a forgetting subset to deviate from the tone characteristics of the target speaker.
9. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-7.
10. A storage medium storing a computer program comprising program instructions which, when executed by a processor, implement the method of any one of claims 1-7.

Description

Text-to-speech model training method and device, computer equipment and storage medium Technical Field The present invention relates to the field of speech synthesis and financial science and technology, and in particular, to a method and apparatus for training a text-to-speech model, a computer device, and a storage medium. Background The existing Zero-sample Text-to-Speech (ZS TTS) system extracts the timbre (prosody and style) characteristics of a speaker by inputting voice prompts, and can generate voices with similar timbres without seeing the speaker data. This approach achieves higher naturalness and flexibility in personalized speech synthesis, but at the same time introduces serious privacy and compliance risks. In the field of financial science and technology, the voice synthesis technology is widely applied to intelligent customer service, voice broadcasting, voiceprint verification assistance, personalized financial consultants and other scenes. However, when the model touches the voice sample of a sensitive individual during reasoning, highly similar timbre outputs are duplicated and synthesized indiscriminately, resulting in identity traceability and privacy leakage. Particularly in financial business, if an attacker uses ZS TTS technology to impersonate the voice of a customer or a financial practitioner, serious financial security risks such as telecom fraud, identity fraud, transaction fraud and the like can be caused, which conflicts with strict requirements of financial supervision on customer privacy protection and data compliance. The ideal compliance model should completely lose the ability to imitate the timbre of a particular forgotten individual while preserving overall speech generation performance, ensuring that the output does not contain any information that can identify the individual. However, the anonymization, feature perturbation, speaker representation filtering or other means commonly used in the prior art often cannot completely meet the requirement, because even if part of identity related features are removed, an attacker can recover identifiable voiceprint features from the anonymized embedded space by means of model inversion, speech re-synthesis, targeted fine tuning or the like, so that privacy risks are continuously present. Machine forgetting (Machine Unlearning, MU) is a privacy preserving research direction aimed at completely and quickly removing specific samples and their effects from trained models. In the ZS TTS field, MU faces unique challenges in that when a timbre that needs to be forgotten exists in the pre-training dataset, the model needs to ensure that the timbre feature is no longer reproduced through forgetting learning, while for out-of-domain timbres that the pre-training model does not see, the model still needs to acquire generalized forgetting capability. The application of the existing MU method in ZS TTS is insufficient to effectively balance forgetting of specific tone and keep overall generation performance, and the dual requirements of the financial science and technology field on high availability and strong privacy protection are difficult to meet, so that the model may lose voice quality after forgetting, or privacy hidden danger cannot be thoroughly eliminated. Disclosure of Invention The embodiment of the invention provides a text-to-speech model training method, a device, computer equipment and a storage medium, and aims to solve the technical problems that the existing zero-sample text-to-speech system has privacy leakage risk, and the existing machine forgetting method is difficult to effectively balance the tone of a forgetting specific speaker and retain the overall speech generation performance. In a first aspect, an embodiment of the present invention provides a text-to-speech model training method, where the method includes obtaining a training dataset, where the training dataset includes a forgetting subset and a preserving subset, where the forgetting subset includes a speech sample of a target speaker, and the preserving subset includes a speech sample of a non-target speaker, taking parameters of a pre-trained original model as current parameters of a preset forgetting model, performing iterative update training on the forgetting model until a preset convergence condition is satisfied, to obtain a target text-to-speech model, where the iterative update training performs the following steps in each iteration: calculating a first retention loss based on the current parameter of the forgetting model and the retention subset, and heuristically updating the current parameter by utilizing the first retention loss to obtain a temporary parameter; Calculating a forgetting loss based on the temporary parameter and the forgetting subset, and calculating a second preserving loss based on the current parameter and the preserving subset, the first preserving loss and the second preserving loss being used to constrain output features o