CN-115240631-B - Speech synthesis method and device, storage medium and electronic device

CN115240631BCN 115240631 BCN115240631 BCN 115240631BCN-115240631-B

Abstract

The application discloses a voice synthesis method and device, a storage medium and an electronic device. The method comprises the steps of obtaining text data, obtaining voice audio data synthesized by the text data through a preset NAT processing model, wherein the preset NAT processing model comprises an encoder, a Gaussian upsampling module and a decoder, the encoder adopts a unidirectional long and short time memory network and a reverse delay controllable cyclic neural network, the Gaussian upsampling module carries out Gaussian upsampling according to a preset block, the decoder comprises a Mel vocoder based on the neural network and an LPL vocoder based on the neural network, and obtaining a synthesis result of the text data according to the Mel vocoder of the neural network and the LPL vocoder processing result based on the neural network. The application solves the technical problems of time delay of the whole voice synthesis system and incapability of synthesizing voice.

Inventors

SI YUJING
ZHANG QIN
WANG TONG
XI WEN
SHEN BINBIN
PU YAO
LI QUANZHONG

Assignees

普强时代(珠海横琴)信息技术有限公司

Dates

Publication Date: 20260505
Application Date: 20220722

Claims (9)

1. A method of speech synthesis for a client, the method comprising: acquiring text data; Obtaining voice audio data synthesized by text data through a preset NAT processing model, wherein the preset NAT processing model comprises an encoder, a Gaussian upsampling module and a decoder, the encoder adopts a unidirectional long short-time memory network and a reverse delay controllable cyclic neural network, the Gaussian upsampling module performs Gaussian upsampling according to a preset block, and the decoder comprises a Mel vocoder based on the neural network and an LPL vocoder based on the neural network; Processing voice audio data according to the Mel vocoder of the neural network and the LPL vocoder based on the neural network to obtain a synthesis result of the voice audio data; the decoder includes a neural network-based mel vocoder and a neural network-based LPL vocoder including: through the predicted mel feature vector and the predicted LPL feature vector; and determining a Mel vocoder based on the neural network and an LPL vocoder based on the neural network to obtain a synthesis result.
2. The method of claim 1, wherein the encoder replaces a bi-directional long and short time memory network with the unidirectional long and short time memory network and the reverse delay controllable recurrent neural network to control the delay of the encoder.
3. The method of claim 1, wherein the gaussian upsampling module performs gaussian upsampling according to a preset block as an input to the encoder and is related to block size.
4. The method of claim 1, wherein the obtaining the text-data synthesized voice-audio data by the preset NAT processing model includes: And controlling the delay of the whole system by controlling the delay of the encoder and the Gaussian up-sampling module.
5. The method of claim 1, wherein the processing the voice audio data according to the mel vocoder of the neural network and the LPL vocoder based on the neural network to obtain the synthesis result of the voice audio data comprises: And processing voice audio data according to the Mel vocoder of the neural network and the LPL vocoder based on the neural network to obtain different synthesized tone quality or synthesized rhythm in the voice audio data.
6. A method for speech synthesis, the method comprising: receiving text data of a client; analyzing voice audio data synthesized by text data through a preset NAT processing model, wherein the preset NAT processing model comprises an encoder, a Gaussian upsampling module and a decoder, the encoder adopts a unidirectional long short-time memory network and a reverse delay controllable cyclic neural network, the Gaussian upsampling module performs Gaussian upsampling according to a preset block, and the decoder comprises a Mel vocoder based on the neural network and an LPL vocoder based on the neural network; the decoder includes a neural network-based mel vocoder and a neural network-based LPL vocoder including: through the predicted mel feature vector and the predicted LPL feature vector; Determining a Mel vocoder based on a neural network and an LPL vocoder based on the neural network to obtain a synthesis result; and transmitting a synthesis result of the voice audio data, which is obtained by processing the voice audio data by the Mel vocoder based on the neural network and the LPL vocoder based on the neural network, to the client.
7. A speech synthesis apparatus, comprising: The acquisition module is used for acquiring text data; The processing module is used for obtaining voice audio data synthesized by text data through a preset NAT processing model, wherein the preset NAT processing model comprises an encoder, a Gaussian upsampling module and a decoder, the encoder adopts a unidirectional long short-time memory network and a reverse delay controllable cyclic neural network, the Gaussian upsampling module performs Gaussian upsampling according to a preset block, and the decoder comprises a Mel vocoder based on the neural network and an LPL vocoder based on the neural network; the decoder includes a neural network-based mel vocoder and a neural network-based LPL vocoder including: through the predicted mel feature vector and the predicted LPL feature vector; Determining a Mel vocoder based on a neural network and an LPL vocoder based on the neural network to obtain a synthesis result; and the synthesis module is used for processing the voice audio data according to the Mel vocoder of the neural network and the LPL vocoder based on the neural network to obtain a synthesis result of the voice audio data.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, wherein the computer program is arranged to execute the method of any of the claims 1 to 6 when run.
9. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of claims 1 to 6.

Description

Speech synthesis method and device, storage medium and electronic device Technical Field The present application relates to the field of processing of text data speech, and in particular, to a speech synthesis method and apparatus, a storage medium, and an electronic apparatus. Background Tacotron end-to-end speech synthesis technology is proposed by google in 2017, tacotron end-to-end technology is proposed in 2018, non-ATTENTIVE TACOTRON is proposed in 2020, and sound quality and stability of an end-to-end speech synthesis model are greatly improved. However, since the encoder contains a bi-directional LSTM model, gaussian upsampling requires computation on all encoder outputs, and on some processors with limited computational power, the requirements of speech synthesis cannot be met. In addition, the acoustic parameters output by Tacotron are mel spectra, which cannot be used in LPC vocoders. Aiming at the problems of delay of a voice synthesis overall system and incapability of synthesizing voice in the related art, no effective solution is proposed at present. Disclosure of Invention The application mainly aims to provide a voice synthesis method and device, a storage medium and an electronic device, so as to solve the problems of delay of a voice synthesis overall system and incapability of synthesizing voice. In order to achieve the above object, according to one aspect of the present application, there is provided a speech synthesis method for a client. The voice synthesis method comprises the steps of obtaining text data, obtaining voice audio data through a preset NAT processing model, wherein the preset NAT processing model comprises an encoder, a Gaussian up-sampling module and a decoder, the encoder adopts a unidirectional long and short time memory network and a reverse delay controllable cyclic neural network, the Gaussian up-sampling module carries out Gaussian up-sampling according to the preset block, the decoder comprises a Mel vocoder based on the neural network and an LPL vocoder based on the neural network, and the voice audio data is processed according to the Mel vocoder based on the neural network and the LPL vocoder based on the neural network, so that a synthesis result of the voice audio data is obtained. Further, the decoder includes a neural network-based Mel vocoder and a neural network-based LPL vocoder including determining the neural network-based Mel vocoder and the neural network-based LPL vocoder by the predicted Mel feature vector and the predicted LPL feature vector to obtain the synthesized result. Further, the encoder replaces a bidirectional long-short time memory network by the unidirectional long-short time memory network and the reverse time delay controllable cyclic neural network, so as to control the time delay of the encoder. Further, the gaussian upsampling module performs gaussian upsampling according to a preset block as an input to the encoder and is related to a block size. Further, the voice audio data synthesized by the text data is obtained through a preset NAT processing model, and the method comprises the step of controlling the delay of the whole system by controlling the delay of the encoder and the Gaussian up-sampling module. Further, the processing of the voice audio data according to the Mel vocoder of the neural network and the LPL vocoder based on the neural network to obtain the synthesis result of the voice audio data includes processing the voice audio data according to the Mel vocoder of the neural network and the LPL vocoder based on the neural network to obtain different synthesis tone quality or synthesis rhythm in the voice audio data. In order to achieve the above object, according to another aspect of the present application, there is provided a speech synthesis method for a server. The voice synthesis method comprises the steps of receiving text data of a client, analyzing voice audio data synthesized by the text data through a preset NAT processing model, wherein the preset NAT processing model comprises an encoder, a Gaussian up-sampling module and a decoder, the encoder adopts a unidirectional long and short time memory network and a reverse delay controllable cyclic neural network, the Gaussian up-sampling module carries out Gaussian up-sampling according to a preset block, the decoder comprises a Mel vocoder based on the neural network and an LPL vocoder based on the neural network, and the synthesis result of the voice audio data obtained by processing the voice audio data through the Mel vocoder based on the neural network and the LPL vocoder based on the neural network is issued to the client. In order to achieve the above object, according to another aspect of the present application, there is provided a speech synthesis apparatus. The voice synthesis device comprises an acquisition module, a processing module and a synthesis module, wherein the acquisition module is used for acquiring text data, the processing module is