CN-116092469-B - Model training method and voice synthesis method based on semi-supervised knowledge distillation

CN116092469BCN 116092469 BCN116092469 BCN 116092469BCN-116092469-B

Abstract

The invention provides a model training method and a voice synthesis method based on semi-supervised knowledge distillation, wherein the method comprises the steps of obtaining a sample phoneme sequence and a sample linear spectrum, and coding the sample linear spectrum through an initial teacher model to obtain a sample teacher coding sequence; the method comprises the steps of performing monotone alignment search training on an initial teacher model according to a sample phoneme sequence and a sample teacher coding sequence to obtain a target teacher model, inputting the sample phoneme sequence into the target teacher model to obtain a teacher prediction phoneme duration and a teacher prediction audio, inputting the sample phoneme sequence into an initial student model, and performing training on the initial student model according to the teacher prediction phoneme duration and the teacher prediction audio to obtain a target student model, wherein the target student model is used for synthesizing target voice according to the input target phoneme sequence. According to the technical scheme provided by the embodiment of the invention, the target teacher model can be aligned at the frame level, the phoneme alignment mode is more accurate, and the training precision and effect of the student model are improved.

Inventors

LI CHAOHUI

Assignees

珠海亿智电子科技有限公司

Dates

Publication Date: 20260512
Application Date: 20230118

Claims (9)

1. A model training method based on semi-supervised knowledge distillation, comprising: acquiring a sample phoneme sequence and a sample linear spectrum, and coding the sample linear spectrum through an initial teacher model to obtain a sample teacher coding sequence; Training the initial teacher model in a monotone alignment search mode according to the sample phoneme sequence and the sample teacher coding sequence to obtain a target teacher model; inputting the sample phoneme sequence into a target teacher model to obtain teacher prediction phoneme duration and teacher prediction audio; Inputting the sample phoneme sequence into an initial student model, training the initial student model according to the teacher predicted phoneme duration and the teacher predicted audio to obtain a target student model, wherein the target student model is used for synthesizing target voice according to the input target phoneme sequence; The student model comprises a text encoder, a duration predictor, a length adjuster and a spectrum decoder, the sample phoneme sequence is input into an initial student model, and training is carried out on the initial student model according to the teacher predicted phoneme duration and the teacher predicted audio, and the method comprises the following steps: converting the sample phoneme sequence into an initial student coding sequence by the text encoder, and determining a student predicted phoneme duration of the initial student coding sequence by the duration predictor; Inputting the teacher predicted phoneme duration and the initial student coding sequence into the length regulator, and increasing the feature number of each dimension of the initial student coding sequence by the length regulator according to the teacher predicted phoneme duration to obtain a sample student coding sequence; Decoding the sample student coding sequence through the frequency spectrum decoder to obtain student prediction mel spectrum; training the initial student model according to the teacher predicted phoneme duration, the teacher predicted audio, the student predicted phoneme duration and the student predicted mel spectrum.
2. The semi-supervised knowledge distillation based model training method of claim 1, wherein prior to said obtaining a sample phoneme sequence and a sample linear spectrum, said method further comprises: acquiring a sample text sequence and corresponding sample audio, and converting the sample text sequence into the sample phoneme sequence; Determining the head mute time and the tail mute time of each sample audio; The head mute duration and the tail mute duration of each sample audio are adjusted to be preset mute durations through clipping or mute compensation; And adding a mute identifier at a corresponding position in the sample phoneme sequence based on the adjusted head mute duration and the adjusted tail mute duration of the sample audio.
3. The semi-supervised knowledge distillation based model training method of claim 1, wherein said text encoder and said spectral decoder are hole convolution residual modules and said duration predictor is a convolution residual module.
4. The method of claim 1, wherein the training the initial student model based on the teacher predicted phoneme duration, the teacher predicted audio, the student predicted phoneme duration, and the student predicted mel profile comprises: performing Mel spectrum transformation on the teacher prediction audio to obtain a teacher prediction Mel spectrum; Calculating a time loss value according to the teacher predicted phoneme time length, the student predicted phoneme time length and a preset time loss function; Calculating a mel spectrum loss value according to the mel spectrum predicted by the teacher, the mel spectrum predicted by the student and a preset mel spectrum loss function; Performing voice recognition and voice quality evaluation on the teacher predicted audio, and determining a loss weight value of the target teacher model according to a voice recognition result and a voice quality evaluation result; and determining a training total loss value according to the duration loss value, the Mel spectrum loss value and the loss weight value, and determining that the initial student model is trained into the target student model when the training total loss value meets a preset convergence condition.
5. The semi-supervised knowledge distillation based model training method according to claim 4, wherein the long duration loss function is a smooth absolute value loss function, and the mel spectrum loss function includes a structural similarity loss function and an absolute value loss function.
6. The model training method based on semi-supervised knowledge distillation according to claim 4, wherein the performing speech recognition and speech quality assessment on the teacher predicted audio, determining the loss weight value of the target teacher model based on the result of speech recognition and the result of speech quality assessment, comprises: Inputting the teacher prediction audio into a preset MOSNet network to obtain a voice quality evaluation result, wherein the voice quality evaluation result is used for representing the pronunciation intelligibility of the teacher prediction audio; When the value of the voice quality evaluation result is larger than a preset intelligibility threshold, determining a first weight factor as 1, otherwise, determining the first weight factor as 0; Inputting the teacher predicted audio into a preset U2 network to decode a predicted audio recognition phoneme, determining a second weight factor as 1 when the predicted audio recognition phoneme is identical to the sample phoneme sequence, otherwise, determining the second weight factor as 0; The product of the first weight factor and the second weight factor is determined as the loss weight value.
7. A method of speech synthesis, comprising: Acquiring target text information to be synthesized, and converting the target text information into a target phoneme sequence; inputting the target phoneme sequence into a target student model, and synthesizing target voice through the target student model, wherein the target student model is obtained through training by the model training method based on semi-supervised knowledge distillation according to any one of claims 1 to 6.
8. An electronic device comprising at least one control processor and a memory communicatively coupled to the at least one control processor, the memory storing instructions executable by the at least one control processor to enable the at least one control processor to perform the semi-supervised knowledge distillation based model training method as recited in any of claims 1 to 6, or to perform the speech synthesis method as recited in claim 7.
9. A computer-readable storage medium storing computer-executable instructions for causing a computer to perform the semi-supervised knowledge distillation based model training method of any of claims 1 to 6, or the speech synthesis method of claim 7.

Description

Model training method and voice synthesis method based on semi-supervised knowledge distillation Technical Field The invention relates to the technical field of voice processing, in particular to a model training method and a voice synthesis method based on semi-supervised knowledge distillation. Background At present, most acoustic models based on knowledge distillation adopt non-autoregressive models due to the limitation of the reasoning speed in the field of speech synthesis, and most available data sets of the non-autoregressive models do not label phoneme duration information, so that in the training process, phoneme duration information needs to be obtained from a teacher model or by using an auxiliary tool, and then the prediction of a mel spectrum is trained based on the phoneme duration information. However, the target mel spectrum distilled by the teacher model loses more information, the teacher model cannot use the real label, the predicted time length of the target mel spectrum is inconsistent with the time length of the real mel spectrum, and training the student model by using the target mel spectrum can reduce the training effect. The phoneme duration information acquired by the auxiliary tool is time domain information, the mel spectrum is frequency domain information, and phonemes cannot be aligned when the audio is framed, so that the phoneme duration information is inaccurate, and the training effect of the student model is affected. Therefore, the student model trained by the related art has poor performance and low accuracy of speech synthesis. Disclosure of Invention The present invention aims to solve at least one of the technical problems existing in the prior art. Therefore, the invention provides a model training method and a voice synthesis method based on semi-supervised knowledge distillation, which can improve the training effect of a student model and improve the quality of a mel spectrum synthesized by the student model. In a first aspect, an embodiment of the present invention provides a model training method based on semi-supervised knowledge distillation, including: acquiring a sample phoneme sequence and a sample linear spectrum, and coding the sample linear spectrum through an initial teacher model to obtain a sample teacher coding sequence; Training the initial teacher model in a monotone alignment search mode according to the sample phoneme sequence and the sample teacher coding sequence to obtain a target teacher model; inputting the sample phoneme sequence into a target teacher model to obtain teacher prediction phoneme duration and teacher prediction audio; And inputting the sample phoneme sequence into an initial student model, training the initial student model according to the teacher predicted phoneme duration and the teacher predicted audio to obtain a target student model, wherein the target student model is used for synthesizing target voice according to the input target phoneme sequence. According to some embodiments of the invention, before the obtaining the sample phoneme sequence and the sample linear spectrum, the method further comprises: acquiring a sample text sequence and corresponding sample audio, and converting the sample text sequence into the sample phoneme sequence; Determining the head mute time and the tail mute time of each sample audio; The head mute duration and the tail mute duration of each sample audio are adjusted to be preset mute durations through clipping or mute compensation; And adding a mute identifier at a corresponding position in the sample phoneme sequence based on the adjusted head mute duration and the adjusted tail mute duration of the sample audio. According to some embodiments of the invention, the student model includes a text encoder, a duration predictor, a length adjuster, and a spectral decoder, the inputting the sample phoneme sequence into an initial student model, training the initial student model according to the teacher predicted phoneme duration and the teacher predicted audio, includes: converting the sample phoneme sequence into an initial student coding sequence by the text encoder, and determining a student predicted phoneme duration of the initial student coding sequence by the duration predictor; Inputting the teacher predicted phoneme duration and the initial student coding sequence into the length regulator, and increasing the feature number of each dimension of the initial student coding sequence by the length regulator according to the teacher predicted phoneme duration to obtain a sample student coding sequence; Decoding the sample student coding sequence through the frequency spectrum decoder to obtain student prediction mel spectrum; training the initial student model according to the teacher predicted phoneme duration, the teacher predicted audio, the student predicted phoneme duration and the student predicted mel spectrum. According to some embodiments of the invention, the text encoder and