CN-121999754-A - Method and device for improving speech synthesis speed of Kexil denoising diffusion probability model

CN121999754ACN 121999754 ACN121999754 ACN 121999754ACN-121999754-A

Abstract

The invention discloses a method and a device for improving the speech synthesis speed of a Keuchy denoising diffusion probability model, which comprise the steps of (1) defining a Keuchy denoising diffusion probability model facing speech synthesis, (2) defining a loss function of the Keuchy denoising diffusion probability model facing speech synthesis, (3) constructing and optimizing a denoising neural network facing speech synthesis, (4) defining a loss function of a Keuchy square scale mapping network facing speech synthesis, (5) constructing and optimizing the Keuchy square scale mapping network facing speech synthesis, (6) defining and executing a Keuchy square scale optimal short table search method facing speech synthesis, and (7) defining and executing a Keuchy rapid sampling method facing speech synthesis. The invention can effectively improve the quality and diversity of the synthesized voice on the premise of meeting the real-time performance of voice synthesis.

Inventors

QI YU
LIAN QI

Assignees

浙江大学

Dates

Publication Date: 20260508
Application Date: 20260202

Claims (8)

1. A method for improving the speech synthesis speed of a Kexil denoising diffusion probability model is characterized by comprising the following steps: (1) Defining a Cauchy denoising diffusion probability model facing to voice synthesis, wherein the Cauchy denoising diffusion probability model comprises a Cauchy priori square scale long table, a Cauchy posterior square scale long table, a Cauchy single-step diffusion operation and a Cauchy multi-step diffusion operation; (2) Constructing a denoising neural network, which is used for predicting the Cauchy noise and Cauchy posterior square scale of all the diffusion steps, defining a first loss function according to the Cauchy prior square scale long table and the Cauchy posterior square scale long table, and optimizing parameters of the denoising neural network by using the first loss function; (3) Defining a Cauchy square scale mapping neural network, predicting Cauchy noise by using the denoising neural network as a true value, and constructing a second loss function to optimize parameters of the Cauchy square scale mapping neural network; (4) Defining a Cauchy square scale optimal short table searching method, based on an optimized Cauchy square scale mapping neural network, searching a Cauchy priori square scale short table with optimal performance by using two iterative sampling modes of randomness and certainty of a Cauchy denoising diffusion probability model and adopting a grid searching mode; (5) A Cauchy rapid sampling method is defined, a Cauchy posterior square scale short table corresponding to the Cauchy prior square scale short table is calculated in an approximate mode, a single-step sampling operation is defined based on the Cauchy prior square scale short table and the Cauchy posterior square scale short table, and the single-step sampling operation is executed in an iterative mode to realize rapid speech synthesis.
2. The method for improving the speech synthesis speed of a cauchy denoising diffusion probability model according to claim 1, wherein in the step (1), the cauchy priori square scale length table is defined as follows: ; Wherein, the And Respectively representing that the diffusion steps of two Gaussian denoising diffusion probability models in a first-check square scale long table are A priori squared scale value at time; The method is characterized in that the diffusion step number of a Cauchy prior square scale long-table diffusion probability model is expressed as follows A priori squared scale value at time; the definition of the Cauchy posterior square scale long table is as follows: ; Wherein, the And Respectively representing that the diffusion steps of two Gaussian denoising diffusion probability models in a posterior square scale long table are Posterior square scale value at time; the diffusion step number of the Cauchy denoising diffusion probability model in the Cauchy posterior square scale long table is shown as Posterior square scale value at time; The definition of the kexi single step diffusion operation is as follows: ; Wherein, the Representing an input speech signal; And Respectively representing the number of diffusion steps in the Cauchy priori square scale long table as And A voice signal at that time; the definition of the cauchy multi-step diffusion operation is as follows: ; ; Wherein, the Indicating that the diffusion step number in the prior square-scale long table is Standard cauchy noise of the time samples, An open root number calculation value representing the cumulative residual scale at a diffusion step number t.
3. The method for improving the speech synthesis speed of a cauchy denoising diffusion probability model according to claim 1, wherein in the step (2), a first loss function is defined according to a cauchy priori square scale long table and a cauchy posterior square scale long table, specifically: ; ; ; ; ; Wherein, the Representing an input speech signal; Representing a first loss function; the diffusion step number of the Cauchy prior square scale long table representing the Cauchy denoising diffusion probability model is Loss function value at time; the diffusion step number of the Cauchy prior square scale long table representing the Cauchy denoising diffusion probability model is Noise predictive loss value at the time; the diffusion step number of the Cauchy prior square scale long table representing the Cauchy denoising diffusion probability model is Predicting a loss value by posterior square scale; a trade-off parameter representing a noise predictive loss value and a posterior square scale predictive loss value; The number of diffusion steps of the denoising neural network in the Cauchy priori square scale long table is represented as A cauchy noise predictive value at that time; Indicating the number of diffusion steps in the Cauchy prior square scale long table as Real value of cauchy noise at that time; The number of diffusion steps of the denoising neural network in the Cauchy priori square scale long table is represented as The real-time cauchy posterior square scale predictive value; the diffusion step number of the Cauchy prior square scale long table representing the Cauchy denoising diffusion probability model is Posterior square scale true values at time.
4. The method for improving speech synthesis speed according to claim 3, wherein in step (3), the step of defining a cauchy square scale mapping neural network comprises defining a cauchy prior square scale mapping method, a cauchy prior square scale short table iterative operation and a second loss function of the cauchy square scale mapping neural network, wherein the method comprises the following steps: Defining a Cauchy prior square scale mapping mode: ; Wherein, the Representing a Cauchy a priori square scale long table; Representing a short table of Cauchy a priori square dimensions; Indicating the number of diffusion steps in the Cauchy priori square scale long table as A voice signal at that time; representing the number of diffusion steps in the Cauchy prior square scale short table as A voice signal at that time; defining a Cauchy prior square scale short table iterative operation: ; ; Wherein, the And The prior square scale values when the diffusion step numbers in the Cauchy prior square scale short table are n and n+1 are respectively represented; Performing prior square scale iterative operation on the Cauchy square scale mapping neural network in training; representing a deep neural network, receiving speech signals at a spread number t As input, predicting the ratio of a priori square scale value of two adjacent diffusion steps in the cauchy a priori square scale short table; defining a second loss function of the cauchy square scale mapping neural network: ; ; Wherein, the Expressing the cauchy noise predicted value when the diffusion step number of the optimized denoising neural network is t in the priori square scale long table; and the real value of the cauchy noise randomly sampled when the diffusion step number in the prior square scale short table is n is represented.
5. The method for improving the speech synthesis speed of a cauchy denoising diffusion probability model according to claim 1, wherein the specific process of the step (4) is as follows: The definition of cauchy square scale mapping iteration is as follows ; ; Wherein, the Representing the optimized Cauchy square scale mapping neural network at a given position The ratio of adjacent elements in the short table of the cauchy square scale predicted at that time; is shown in given And During calculation, based on the optimized Cauchy square scale mapping neural network Is a formula of (2); is shown in given And Time calculation Is a formula of (2); based on the calculation And Definition of iterative sampling mode of randomness and certainty is as follows ; Wherein, the And Respectively representing the voice signals when the diffusion steps in the Cauchy priori square scale short table are n and n+1; And Respectively representing the cauchy noise value and the cauchy square scale value predicted by the optimized cauchy denoising diffusion probability model; And When, respectively using deterministic sampling and random sampling; Given an initial value And Iterative calculation by the foregoing procedure 、 And Realizing the synthesis of voice and dividing in a grid mode And And (3) respectively performing voice synthesis and evaluating the quality of the synthesized voice to realize the search of the optimal Cauchy priori square scale short list.
6. The method for improving the speech synthesis speed of a cauchy denoising diffusion probability model according to claim 1, wherein in the step (5), a cauchy posterior square scale short table corresponding to a cauchy prior square scale short table is calculated in an approximate manner, and the formula is as follows: ; Wherein, the And Given by the best cauchy square scale short table searching method; The short table of the approximate Cauchy posterior square scale is shown.
7. The method for improving the speech synthesis speed of a cauchy denoising diffusion probability model according to claim 1, wherein in the step (5), a single-step sampling operation is defined based on a cauchy priori square scale short table and a cauchy posterior square scale short table, specifically: ; Wherein, the And Respectively representing the voice signals when the diffusion steps in the Cauchy priori square scale short table are n-1 and n; representing the noise prediction value of the optimized Cauchy denoising diffusion probability model; and continuously executing single-step sampling operation by taking the Mel spectrogram as a conditional input item to realize rapid speech synthesis.
8. An apparatus for improving the speech synthesis speed of a cauchy de-noising diffusion probability model, comprising a memory and one or more processors, wherein executable codes are stored in the memory, and the one or more processors are configured to implement the method for improving the speech synthesis speed of a cauchy de-noising diffusion probability model according to any one of claims 1 to 7 when executing the executable codes.

Description

Method and device for improving speech synthesis speed of Kexil denoising diffusion probability model Technical Field The invention relates to the technical field of speech synthesis, in particular to a method and a device for improving the speech synthesis speed of a Cauchy denoising diffusion probability model. Background Depth generation models have made significant breakthroughs and excellent performance in the field of speech synthesis. In the field of depth generation models for speech synthesis, current mainstream models include a stream-based generation model, a generation model based on an antagonistic neural network, and a depth generation model based on a denoising diffusion probability model. The depth generation model gradually leads the latest trend and development in the field of speech synthesis due to the advantages of stable training of the denoising diffusion probability model, various generated samples and the like. The Chinese patent document with publication number CN120998174A discloses a voice synthesis method, a training device and training equipment of a diffusion model, which are used for carrying out coding processing on text information based on the text coding submodel of the diffusion model to obtain a vector sequence, carrying out acoustic feature extraction processing on the vector sequence based on the acoustic feature extraction submodel to obtain a first Mel spectrum, carrying out text-voice alignment processing on the first Mel spectrum based on the context sensing submodel according to the vector sequence to obtain a second Mel spectrum, carrying out residual learning and multi-scale acoustic feature extraction processing on the second Mel spectrum based on the high-frequency compensation diffusion submodel to obtain a target Mel spectrum, and determining target voice based on the diffusion submodel according to the target Mel spectrum to improve the voice synthesis effect. Due to objective constraint of diffusion process theory, the research and development of the denoising diffusion probability model facing to voice synthesis is mainly focused on the addition and removal of Gaussian noise, and in order to solve the problem of unbalanced voice data, researchers provide a denoising diffusion probability model based on the addition and removal of heavy tail noise, so that the quality and diversity of synthesized voice are effectively improved. For example, the chinese patent document with publication number CN119049446a discloses a speech synthesis method and device based on a cauchy denoising probability diffusion model, wherein cauchy noise is introduced into the denoising probability diffusion model, so as to realize training and sampling of the diffusion model, and finally complete speech synthesis, thereby improving the robustness of the speech synthesis method and effectively improving the quality of synthesized speech. One of the limitations of the denoising diffusion probability model is that the sampling speed is too slow, and the voice synthesis speed is difficult to meet the real-time requirement in the voice synthesis. On one hand, the current denoising diffusion probability model for rapid speech synthesis focuses on Gaussian noise, and is difficult to cope with challenges caused by speech data imbalance. On the other hand, the denoising diffusion probability model based on heavy tail noise for rapid speech synthesis is challenged to design the method and the device due to the transition of the bottom layer theory. For example, chinese patent document CN120877701a discloses a system and method for improving the speech synthesis speed of a diffusion model, which can generate acoustic features with fewer iterations, transmit the acoustic features to a trained or fine-tuned vocoder to synthesize speech signals, improve the speech synthesis speed, and generate high-quality speech signals. Disclosure of Invention The invention provides a method and a device for improving the speech synthesis speed of a Kexil denoising diffusion probability model, which can obviously improve the speech synthesis speed and improve the quality of synthesized speech. A method for improving the speech synthesis speed of a Kexil denoising diffusion probability model comprises the following steps: (1) Defining a Cauchy denoising diffusion probability model facing to voice synthesis, wherein the Cauchy denoising diffusion probability model comprises a Cauchy priori square scale long table, a Cauchy posterior square scale long table, a Cauchy single-step diffusion operation and a Cauchy multi-step diffusion operation; (2) Constructing a denoising neural network, which is used for predicting the Cauchy noise and Cauchy posterior square scale of all the diffusion steps, defining a first loss function according to the Cauchy prior square scale long table and the Cauchy posterior square scale long table, and optimizing parameters of the denoising neural network by using the first loss functio