CN-122024693-A - Tone forgetting method and device, computer equipment and storage medium

CN122024693ACN 122024693 ACN122024693 ACN 122024693ACN-122024693-A

Abstract

The application relates to the technical field of artificial intelligence and discloses a tone forgetting method, a tone forgetting device, computer equipment and a storage medium, wherein the tone forgetting method comprises the steps of acquiring source voice data and target voice data; extracting source semantic features from source voice data, carrying out voice synthesis on candidate tone noise features and the source semantic features based on an original voice synthesis model to obtain noisy voice data, updating the candidate tone noise features based on the noisy voice data and the target voice data to obtain target tone noise features, constructing a tone forgetting data set based on the target tone noise features and a sample voice data set, and carrying out model adjustment on the original voice synthesis model to obtain a target voice synthesis model which does not contain tone parameters, wherein the target voice synthesis model is used for synthesizing voice which does not contain target tone information. The method can be applied to the voice synthesis scenes of finance science and technology and medical health, and improves the accuracy of tone forgetting.

Inventors

CHEN MINCHUAN
WAN CHENCHEN
WANG SHAOJUN

Assignees

深圳平安通信科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260326

Claims (10)

1. A tone forgetting method, comprising: Acquiring a noise training data set, wherein the noise training data set comprises a source voice text, source voice data and target voice data, the source voice data is synthesized based on the source voice text and contains target tone information, and the target voice data is synthesized based on the source voice text and does not contain the target tone information; extracting the characteristics of the source voice data to obtain source semantic characteristics; Performing voice synthesis on the preset candidate tone noise characteristics and the source semantic characteristics based on a pre-trained original voice synthesis model to obtain noisy voice data, wherein the original voice synthesis model contains tone parameters, and the tone parameters are obtained based on the target tone information; Performing feature updating on the candidate tone noise features based on the noisy speech data and the target speech data to obtain target tone noise features; performing data set construction based on the target tone noise characteristics and a preset sample voice data set to obtain a tone forgetting data set; Performing model adjustment on the original voice synthesis model based on the tone forgetting data set to obtain a target voice synthesis model, wherein the target voice synthesis model does not contain the tone parameters and is used for synthesizing voice which does not contain the target tone information; and performing voice synthesis on the original voice data and the original voice synthesis text which are acquired in advance through the target voice synthesis model to obtain synthesized voice data, wherein the synthesized voice data does not contain the target tone information.
2. The method of claim 1, wherein the tone forgetting dataset includes basic sample data and noise sample data, wherein the model adjustment is performed on the original speech synthesis model based on the tone forgetting dataset to obtain a target speech synthesis model, comprising: model training is carried out on the original speech synthesis model based on the basic sample data to obtain reconstruction loss data; Model training is carried out on the original voice synthesis model based on the noise sample data to obtain tone forgetting loss data; Performing aggregation calculation on the reconstruction loss data and the tone forgetting loss data to obtain target loss data; And carrying out parameter adjustment on the original voice synthesis model based on the target loss data to obtain the target voice synthesis model.
3. The method of claim 2, wherein the base sample data comprises a base sample speech and a base sample text, the base sample speech not including the target tone information, wherein the model training the original speech synthesis model based on the base sample data to obtain reconstruction loss data comprises: extracting the characteristics of the basic sample voice to obtain basic tone characteristics; extracting features of the basic sample text to obtain basic semantic features; performing voice synthesis on the basic semantic features and the basic tone features through the original voice synthesis model to obtain basic synthesized voice; And carrying out loss calculation based on the basic synthesized voice and the basic sample voice to obtain the reconstruction loss data.
4. The method of tone forgetting according to claim 2, wherein the noise sample data includes noise sample speech, noise sample text and the target tone noise feature, wherein the model training the original speech synthesis model based on the noise sample data to obtain tone forgetting loss data comprises: extracting features of the noise sample text to obtain noise semantic features; Performing voice synthesis on the noise semantic features and the target tone noise features through the original voice synthesis model to obtain noise synthesized voice; And carrying out loss calculation based on the noise synthesized voice and the noise sample voice to obtain the tone forgetting loss data.
5. The tone forgetting method of claim 2, wherein the performing data set construction based on the target tone noise feature and a preset sample voice data set to obtain a tone forgetting data set comprises: grouping the sample voice data sets to obtain a first grouping data set and a second grouping data set; identifying the first set of packet data as the base sample data; And combining the target timbre noise characteristic with the second packet data set to obtain the noise sample data.
6. A tone forgetting method as recited in any one of claims 1 to 5, wherein said feature updating the candidate tone noise feature based on the noisy speech data and the target speech data to obtain a target tone noise feature comprises: Performing similarity calculation on the noise-added voice data and the target voice data to obtain synthesized audio similarity loss data; And carrying out characteristic adjustment on the candidate tone noise characteristics based on the synthesized audio similarity loss data to obtain the target tone noise characteristics.
7. The tone forgetting method according to any one of claims 1 to 5, wherein the speech synthesis of the pre-acquired original speech data and the original speech synthesis text by the target speech synthesis model to obtain synthesized speech data includes: extracting features of the original voice data to obtain original tone information; extracting features of the original speech synthesis text to obtain original semantic data; And carrying out voice synthesis on the original semantic data and the original tone information based on the target voice synthesis model to obtain the synthesized voice data, wherein the synthesized voice data contains the original tone information and does not contain the target tone information.
8. A tone forgetting device, comprising: The noise training data set acquisition module is used for acquiring a noise training data set, wherein the noise training data set comprises source voice text, source voice data and target voice data, the source voice data is synthesized based on the source voice text and comprises target tone information, and the target voice data is synthesized based on the source voice text and does not comprise the target tone information; The feature extraction module is used for extracting features of the source voice data to obtain source semantic features; The noise-adding voice synthesis module is used for carrying out voice synthesis on the preset candidate tone noise characteristics and the source semantic characteristics based on a pre-trained original voice synthesis model to obtain noise-adding voice data, wherein the original voice synthesis model comprises tone parameters, and the tone parameters are obtained based on the target tone information; The noise characteristic updating module is used for carrying out characteristic updating on the candidate tone noise characteristics based on the noise-added voice data and the target voice data to obtain target tone noise characteristics; the data set construction module is used for carrying out data set construction based on the target tone noise characteristics and a preset sample voice data set to obtain a tone forgetting data set; The model adjustment module is used for carrying out model adjustment on the original voice synthesis model based on the tone forgetting data set to obtain a target voice synthesis model, wherein the target voice synthesis model does not contain the tone parameters and is used for synthesizing voice which does not contain the target tone information; The target voice synthesis module is used for carrying out voice synthesis on the original voice data and the original voice synthesis text which are acquired in advance through the target voice synthesis model to obtain synthesized voice data, wherein the synthesized voice data does not contain the target tone information.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the tone forgetting method according to any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the tone forgetting method according to any one of claims 1 to 7.

Description

Tone forgetting method and device, computer equipment and storage medium Technical Field The application relates to the technical fields of artificial intelligence technology and natural language processing, is suitable for financial science and technology scenes and medical health scenes, and particularly relates to a tone forgetting method, a tone forgetting device, computer equipment and a storage medium. Background The tone forgetting method of the speech synthesis model can be used to eliminate the tone characteristics of a specific speaker in the pre-trained speech synthesis model. The tone forgetting method can be applied to a plurality of application scenes, taking a financial science and technology scene as an example, tone forgetting can be carried out on a voice synthesis model for synthesizing intelligent customer service voice through tone forgetting technology, so that tone expression of the voice synthesis model can be quickly adjusted, and taking a medical health scene as an example, tone forgetting can be carried out on the voice synthesis model for synthesizing medical robot voice through tone forgetting technology, so that the medical robot can dynamically adjust tone according to task requirements. At present, tone forgetting of a voice synthesis model is mainly realized through model fine tuning, namely, other voice data sets are utilized to fine tune a pre-trained voice synthesis model. However, in practical application, the inventor finds that, because the voice synthesis model has deeply learned the original tone color characteristics in the pre-training stage, the model fine tuning may result in incomplete forgetting, and the voice synthesis model can still imitate the original tone color to a certain extent, thereby affecting the accuracy of tone color forgetting. Disclosure of Invention The application provides a tone forgetting method, a tone forgetting device, computer equipment and a storage medium, which are used for solving the technical problem that a pre-trained voice synthesis model cannot accurately forget original tone, and improving the accuracy of tone forgetting of the voice synthesis model. In a first aspect, a tone forgetting method is provided, including: Acquiring a noise training data set, wherein the noise training data set comprises a source voice text, source voice data and target voice data, the source voice data is synthesized based on the source voice text and contains target tone information, and the target voice data is synthesized based on the source voice text and does not contain the target tone information; extracting the characteristics of the source voice data to obtain source semantic characteristics; Performing voice synthesis on the preset candidate tone noise characteristics and the source semantic characteristics based on a pre-trained original voice synthesis model to obtain noisy voice data, wherein the original voice synthesis model contains tone parameters, and the tone parameters are obtained based on the target tone information; Performing feature updating on the candidate tone noise features based on the noisy speech data and the target speech data to obtain target tone noise features; performing data set construction based on the target tone noise characteristics and a preset sample voice data set to obtain a tone forgetting data set; Performing model adjustment on the original voice synthesis model based on the tone forgetting data set to obtain a target voice synthesis model, wherein the target voice synthesis model does not contain the tone parameters and is used for synthesizing voice which does not contain the target tone information; and performing voice synthesis on the original voice data and the original voice synthesis text which are acquired in advance through the target voice synthesis model to obtain synthesized voice data, wherein the synthesized voice data does not contain the target tone information. In a second aspect, a tone quality forgetting device is provided, comprising: The noise training data set acquisition module is used for acquiring a noise training data set, wherein the noise training data set comprises source voice text, source voice data and target voice data, the source voice data is synthesized based on the source voice text and comprises target tone information, and the target voice data is synthesized based on the source voice text and does not comprise the target tone information; The feature extraction module is used for extracting features of the source voice data to obtain source semantic features; The noise-adding voice synthesis module is used for carrying out voice synthesis on the preset candidate tone noise characteristics and the source semantic characteristics based on a pre-trained original voice synthesis model to obtain noise-adding voice data, wherein the original voice synthesis model comprises tone parameters, and the tone parameters are obtained based on the target tone information; T