CN-115440235-B - NPU-based method for synthesizing vocoder by stream type voice and related products
Abstract
The embodiment of the invention provides a method for synthesizing a vocoder by streaming voice based on NPU and a related product. The method comprises the steps of obtaining input features to be processed about the streaming voice, wherein the input features to be processed are of fixed length, processing the input features of the fixed length based on a vocoder model deployed on the NPU to output audio information, and determining a synthesis result of the streaming voice based on the audio information. According to the scheme provided by the invention, the vocoder model is effectively deployed on the NPU to perform synthesis of the convection voice, so that the synthesis of high-quality voice is ensured by combining the high performance of the NPU (particularly the advantages of the NPU in the aspect of neural network reasoning and the like), and meanwhile, the time delay and the real-time rate are effectively reduced, so that the actual market demand is met. In addition, the invention also provides a device and a computer readable storage medium.
Inventors
- GAO FEI
- ZHANG GUANGYONG
- GAO QIANG
- BU BING
- DUAN YITAO
Assignees
- 网易有道信息技术(北京)有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20220729
Claims (8)
- 1. A method of streaming voice synthesis vocoder based on an embedded neural network processor NPU, comprising: acquiring input features to be processed about the streaming voice, wherein the input features to be processed are of a fixed length, in particular, preprocessing the streaming voice to split the input features to be processed into one or more input features of a fixed length, wherein preprocessing the streaming voice comprises extracting the input features of the fixed length from the streaming voice by utilizing a sliding window method, wherein the sliding window method comprises: Sequentially sliding and extracting the input features with the fixed length from the streaming voice by utilizing sliding windows with the fixed length, wherein the sliding step length of each sliding is smaller than the fixed length so that overlapping exists between the extracted input features of adjacent sliding windows, the sliding windows cover an effective part positioned in the middle of the window and overlapping parts positioned at two ends of the window, the size of the effective part is equal to the sliding step length, and the size of the overlapping parts is determined according to the receptive field of a vocoder model structure on the NPU; in response to the presence of an input feature having a length less than the fixed length, performing a fill process on the input feature having a length less than the fixed length to adjust its length to the fixed length; processing fixed length input features based on a vocoder model deployed on the NPU to output audio information, and Cutting the vocoder model based on the output of each input feature to obtain audio information corresponding to the effective part of each input feature; Splicing the audio information corresponding to the effective part of each input characteristic, and And determining the synthesis result according to the audio information obtained by splicing.
- 2. The method as recited in claim 1, further comprising: Directly using the spliced audio information as the synthesis result, or And performing fade-in and fade-out post-processing on the spliced audio information by using a cosine signal to obtain the synthesis result.
- 3. The method of claim 2, wherein the performing the cross-fade post-processing of the spliced audio information using the cosine signal comprises: Performing fade-out processing on the audio information in the front half period window by utilizing the front half period of the cosine signal; Fade-in processing of audio information within the window of the second half period by using the second half period of the cosine signal, and And carrying out noise reduction processing on the spliced audio information based on the fade-in processing and the fade-out processing.
- 4. The method of claim 1, wherein filling input features having a length less than the fixed length comprises: the length of the input feature is filled to the fixed length with a fixed value.
- 5. The method according to any one of claims 1 to 4, wherein, The vocoder model deployed on the NPU is converted via a trained vocoder model supporting streaming speech synthesis.
- 6. The method of claim 5, wherein the vocoder model supporting streaming speech synthesis is converted to a vocoder model deployed on an NPU via an NPU tool chain.
- 7. An apparatus, comprising: NPU of embedded neural network processor, and A memory storing computer instructions of an NPU-based streaming voice synthesis vocoder that, when executed by the NPU, cause the apparatus to perform the method of any of claims 1-6.
- 8. A computer readable storage medium containing program instructions of an NPU-based streaming speech synthesis vocoder which, when executed by the NPU, cause the method according to any of claims 1-6 to be implemented.
Description
NPU-based method for synthesizing vocoder by stream type voice and related products Technical Field Embodiments of the present invention relate to the field of information processing technology, and more particularly, to a method of NPU-based streaming voice synthesis vocoder, an apparatus for performing the same, and a computer-readable storage medium. Background This section is intended to provide a background or context to the embodiments of the application that are recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Accordingly, unless indicated otherwise, what is described in this section is not prior art to the description and claims of the present application and is not admitted to be prior art by inclusion in this section. With the rapid development of artificial intelligence, a Speech synthesis (TTS) algorithm is mature, and for faster response speed and better privacy protection, the importance of TTS application facing To the end side is increasing. TTS is a text-to-speech synthesis process that is generally divided into three parts, the text front-end, the acoustic model, and the vocoder. The vocoder plays a decisive role in the quality of the voice synthesis. Among them, vocoders can be roughly classified into phase reconstruction-based vocoders and neural network-based vocoders. Phase reconstruction-based vocoders use algorithms to derive phase characteristics and reconstruct speech waveforms primarily because acoustic characteristics (e.g., mel characteristics, etc.) used by TTS have lost phase characteristics. The vocoder based on the neural network directly models the acoustic characteristics and the voice waveform, so that the synthesized voice quality is higher. Currently, vocoders based on neural networks are used in the prior art, and they are implemented by a central processing unit (Central Processing Unit, abbreviated as CPU). However, with the gradual limitation of the computing power of the CPU and the demand for high-quality speech synthesis, the requirements of low delay, high quality and the like in speech synthesis cannot be met by relying on the CPU for end-side TTS. Disclosure of Invention The speech synthesis effect of the known CPU-based vocoder is not ideal, which is a very annoying procedure. Therefore, there is a great need for an improved scheme of a streaming voice synthesis vocoder based on an embedded neural network processor (Neural Network Processing Unit, hereinafter referred to as NPU) and related products thereof, which can effectively improve voice quality of voice synthesis and reduce delay and real-time rate. In this context, embodiments of the present invention desire to provide a method of NPU-based flow-type speech synthesis vocoder and related products. In a first aspect of an embodiment of the present invention, there is provided a method of a streaming voice synthesis vocoder based on an embedded neural network processor NPU, comprising obtaining an input feature to be processed for the streaming voice, wherein the input feature to be processed is a fixed length, processing the input feature of the fixed length based on a vocoder model deployed on the NPU to output audio information, and determining a synthesis result for the streaming voice based on the audio information. In one embodiment of the invention, obtaining input features to be processed with respect to the streaming voice includes pre-processing the streaming voice to split the input features to be processed into one or more fixed-length input features. In another embodiment of the present invention, preprocessing the streaming voice includes extracting fixed length input features from the streaming voice using a sliding window method. In yet another embodiment of the present invention wherein the sliding window method includes sequentially sliding extracting the fixed length input features from the streaming voice using sliding windows of the fixed length, the sliding step size of each sliding being smaller than the fixed length so that there is overlap between the extracted input features of adjacent sliding windows. In yet another embodiment of the present invention, the sliding window covers an effective portion located in the middle of the window and overlapping portions located at both ends of the window, the effective portion having a size equal to the sliding step, the overlapping portions having a size determined according to a receptive field of the vocoder model structure. In one embodiment of the invention, determining the synthesis result of the streaming voice based on the audio information comprises cutting the vocoder model based on the output of each input feature to obtain audio information corresponding to the effective part of each input feature, splicing the audio information corresponding to the effective part of each input feature, and determining the synt