CN-116645956-B - Speech synthesis method, speech synthesis system, electronic device, and storage medium

CN116645956BCN 116645956 BCN116645956 BCN 116645956BCN-116645956-B

Abstract

The embodiment of the application provides a voice synthesis method, a voice synthesis system, electronic equipment and a storage medium, and belongs to the technical field of financial science and technology. The method comprises the steps of carrying out character adjustment on an obtained sample text sequence according to a character adjustment sub-model of an original speech synthesis model to obtain an initial sample variable sequence, carrying out speech synthesis on the initial sample variable sequence according to an initial speech prediction sub-model to determine first predicted speech, carrying out character screening on the initial speech prediction sub-model according to the first predicted speech to obtain a target sample variable sequence, inputting the sequence into the candidate speech synthesis model to obtain second predicted speech, determining a target speech synthesis model according to the second predicted speech and the candidate speech synthesis model, and inputting the target text sequence into the target speech synthesis model to carry out speech synthesis to obtain target synthesized speech. The embodiment of the application can generate the synthetic voice with natural expression and smooth semantics.

Inventors

GUO YANG
WANG JIANZONG
CHENG NING

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260505
Application Date: 20230616

Claims (8)

1. A method of speech synthesis, the method comprising: acquiring a sample text sequence and sample voice of the sample text sequence, wherein the sample text sequence comprises sample initial characters; inputting the sample text sequence into a preset original voice synthesis model, wherein the original voice synthesis model comprises a character adjustment sub-model and an initial voice prediction sub-model; Performing text character adjustment on the sample text sequence according to the character adjustment sub-model to obtain an initial sample variable sequence, wherein the initial sample variable sequence comprises a sample initial variable character, a sample candidate character, a character combination unit and a character combination unit, wherein the sample initial variable character is obtained by performing random character extraction on the sample text sequence; Performing voice synthesis processing on the initial sample variable sequence according to the initial voice prediction sub-model to obtain first predicted voice; Performing spectrum loss calculation according to the first predicted voice and the sample voice to obtain predicted loss data, performing partial derivative calculation on the sample candidate characters according to the predicted loss data to obtain character variable data, performing average calculation on the character variable data according to the number of the sample text sequences to obtain character measurement data, performing numerical comparison on a preset character measurement threshold value and the character measurement data to obtain a measurement comparison result, performing character screening on the sample candidate characters according to the measurement comparison result to obtain sample target characters, and performing parameter adjustment on the initial voice prediction sub-model according to the sample initial characters, the sample initial variable characters and the sample target characters to obtain a candidate voice synthesis model; character screening is carried out on the initial sample variable sequence according to the first predicted voice and the sample voice, so that a target sample variable sequence is obtained; Inputting the target sample variable sequence into the candidate speech synthesis model for speech synthesis processing to obtain second predicted speech; Parameter adjustment is carried out on the candidate speech synthesis model according to the second predicted speech and the sample speech, so that a target speech synthesis model is obtained; And inputting the acquired target text sequence into the target voice synthesis model to perform voice synthesis processing to obtain target synthesized voice.
2. The method of claim 1, wherein said performing parameter adjustment on said initial speech predictor model based on said sample initial character, said sample initial variable character, and said sample target character to obtain said candidate speech synthesis model comprises: performing character judgment on the sample target character to obtain a sample target variable character; Performing pseudo-discovery calculation according to the sample initial character, the sample initial variable character, the sample target character and the sample target variable character to obtain character pseudo-discovery data; comparing the character pseudo discovery data with a preset pseudo discovery threshold value to obtain a pseudo discovery comparison result; And carrying out parameter adjustment on the initial voice prediction sub-model according to the pseudo discovery comparison result to obtain the candidate voice synthesis model.
3. The method according to claim 2, wherein said performing parameter adjustment on said initial speech predictor model according to said pseudo-discovery comparison result to obtain said candidate speech synthesis model comprises: if the false discovery comparison result indicates that the character false discovery data is larger than the preset false discovery threshold value, executing the text character adjustment on the sample text sequence according to the character adjustment sub-model again so as to update the initial sample variable sequence; And carrying out parameter adjustment on the initial voice prediction sub-model according to the updated initial sample variable sequence to obtain the candidate voice synthesis model.
4. The method of claim 1, wherein character screening the initial sample variable sequence based on the first predicted speech and the sample speech to obtain a target sample variable sequence, comprises: Performing character screening on the initial sample variable sequence according to the first predicted voice and the sample voice to obtain a sample screening sequence, wherein the sample screening sequence comprises the sample target characters; performing character recognition on the sample target character to obtain a character recognition result; And if the character recognition result indicates that the currently recognized sample target character is the sample initial variable character, performing character removal on the currently recognized sample target character to obtain the target sample variable sequence.
5. The method according to any one of claims 1 to 4, wherein the speech predictor model includes a speech character encoding layer, an attention layer, a linear projection layer, a post-processing layer, and a prediction output layer, and the performing speech synthesis processing on the initial sample variable sequence according to the initial speech predictor model to obtain a first predicted speech includes: performing phonetic character coding processing on the initial sample variable sequence according to the phonetic character coding layer to obtain phonetic sample coding characteristics; Extracting context characteristics from the coding characteristics of the voice sample according to the attention layer to obtain the context characteristics of the current step of the first sample; performing feature stitching on the current step context feature of the first sample and a preset Mel frequency spectrum to obtain a context feature to be processed; Inputting the context characteristics to be processed into a preset double-layer long-short time memory layer to conduct context characteristic prediction, and obtaining the current-step context characteristics of a second sample; Performing linear projection processing on the context characteristics of the current step of the second sample according to the linear projection layer to obtain current-step projection scalar data; Performing spectrum updating processing on the context characteristics of the current step of the second sample according to the post-processing layer so as to update the preset Mel spectrum; And carrying out voice synthesis processing on the current-step projection scalar data according to the prediction output layer to obtain the first prediction voice.
6. A speech synthesis system, the system comprising: the text acquisition module is used for acquiring a sample text sequence and sample voice of the sample text sequence, wherein the sample text sequence comprises sample initial characters; The model input module is used for inputting the sample text sequence into a preset original voice synthesis model, wherein the original voice synthesis model comprises a character adjustment sub-model and an initial voice prediction sub-model; The text character adjusting module is used for adjusting text characters of the sample text sequence according to the character adjusting sub-model to obtain an initial sample variable sequence, and concretely comprises the steps of extracting random characters of the sample text sequence to obtain sample initial variable characters; the first voice prediction module is used for carrying out voice synthesis processing on the initial sample variable sequence according to the initial voice prediction sub-model to obtain first predicted voice; The first parameter adjustment module is used for carrying out spectrum loss calculation according to the first predicted voice and the sample voice to obtain predicted loss data, carrying out partial derivative calculation on the sample candidate characters according to the predicted loss data to obtain character variable data, carrying out mean calculation on the character variable data according to the number of the sample text sequences to obtain character measurement data, carrying out numerical comparison on a preset character measurement threshold value and the character measurement data to obtain a measurement comparison result, carrying out character screening on the sample candidate characters according to the measurement comparison result to obtain sample target characters, and carrying out parameter adjustment on the initial voice prediction submodel according to the sample initial characters, the sample initial variable characters and the sample target characters to obtain a candidate voice synthesis model; the character screening module is used for carrying out character screening on the initial sample variable sequence according to the first predicted voice and the sample voice to obtain a target sample variable sequence; the second voice prediction module is used for inputting the target sample variable sequence into the candidate voice synthesis model to perform voice synthesis processing to obtain second predicted voice; the second parameter adjustment module is used for carrying out parameter adjustment on the candidate speech synthesis model according to the second predicted speech and the sample speech to obtain a target speech synthesis model; and the target voice synthesis module is used for inputting the acquired target text sequence into the target voice synthesis model to perform voice synthesis processing so as to obtain target synthesized voice.
7. An electronic device comprising a memory storing a computer program and a processor implementing the method of any of claims 1 to 5 when the computer program is executed by the processor.
8. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any one of claims 1 to 5.

Description

Speech synthesis method, speech synthesis system, electronic device, and storage medium Technical Field The present application relates to the technical field of financial science and technology, and in particular, to a voice synthesis method, a voice synthesis system, an electronic device, and a storage medium. Background With the rapid development of financial science and technology and socioeconomic performance, people have increasingly demanded a bank service level. In the scenes of intelligent customer service, multi-round dialogue, robot outbound and the like, related information can be transferred to a target object through natural and semantically accurate voice expression, and the method is one of the most effective and direct methods for improving customer experience and service level. Currently, deep learning-based speech synthesis systems typically perform speech synthesis through a vocoder. However, when performing speech synthesis on some text with large semantic variation and variable length, the related art speech synthesis method cannot accurately capture different context relations in the text, thereby affecting the accuracy of the vocoder in speech synthesis and generating synthesized speech with unnatural expression and unsmooth semantics. Therefore, how to improve the accuracy of text-to-speech prediction and generate synthetic speech with natural expression and smooth semantics becomes a technical problem to be solved. Disclosure of Invention The embodiment of the application mainly aims to provide a voice synthesis method, a voice synthesis system, electronic equipment and a storage medium, which can improve the prediction accuracy of text to voice and generate synthetic voice with natural expression and smooth semantics. To achieve the above object, a first aspect of an embodiment of the present application provides a speech synthesis method, including: acquiring a sample text sequence and sample voice of the sample text sequence; inputting the sample text sequence into a preset original voice synthesis model, wherein the original voice synthesis model comprises a character adjustment sub-model and an initial voice prediction sub-model; Text character adjustment is carried out on the sample text sequence according to the character adjustment sub-model, and an initial sample variable sequence is obtained; Performing voice prediction processing on the initial sample variable sequence according to the initial voice prediction sub-model to obtain first predicted voice; Parameter adjustment is carried out on the initial voice prediction sub-model according to the first predicted voice and the sample voice, so that a candidate voice synthesis model is obtained; character screening is carried out on the initial sample variable sequence according to the first predicted voice and the sample voice, so that a target sample variable sequence is obtained; Inputting the target sample variable sequence into the candidate speech synthesis model for speech synthesis processing to obtain second predicted speech; Parameter adjustment is carried out on the candidate speech synthesis model according to the second predicted speech and the sample speech, so that a target speech synthesis model is obtained; And inputting the acquired target text sequence into the target voice synthesis model to perform voice synthesis processing to obtain target synthesized voice. In some embodiments, the sample text sequence includes a sample initial character, and the text character adjustment is performed on the sample text sequence according to the character adjustment sub-model to obtain an initial sample variable sequence, including: random character extraction is carried out on the sample text sequence, so that a sample initial variable character is obtained; and carrying out character combination on the sample initial character and the sample initial variable character to obtain an initial sample variable sequence. In some embodiments, the initial sample variable sequence includes sample candidate characters, and the performing parameter adjustment on the initial speech prediction sub-model according to the first predicted speech and the sample speech to obtain a candidate speech synthesis model includes: Performing spectrum loss calculation according to the first predicted voice and the sample voice to obtain predicted loss data; performing partial derivative calculation on the sample candidate characters according to the predicted loss data to obtain character variable data; Performing average value calculation on the character variable data according to the number of the sample text sequences to obtain character measurement data; Performing numerical comparison on a preset character measurement threshold value and the character measurement data to obtain a measurement comparison result; performing character screening on the sample candidate characters according to the measurement comparison result to obtain sampl