CN-122024698-A - Voice duration prediction method, device, equipment and storage medium

CN122024698ACN 122024698 ACN122024698 ACN 122024698ACN-122024698-A

Abstract

The application relates to the technical field of artificial intelligence and financial science and technology, and provides a voice duration prediction method, a voice duration prediction device, voice duration prediction equipment and a computer readable storage medium. The method can be applied to intelligent interaction scenes in financial and medical scenes. The method comprises the steps of obtaining a text sequence of a text to be synthesized and extracting text coding features, obtaining target emotion information and extracting emotion features, obtaining text length and extracting length features, inputting the text coding features, the emotion features and the length features into a pre-constructed duration predictor, and outputting predicted duration corresponding to each phoneme unit. According to the method, through introducing the emotion characteristics and the text length characteristics, the technical problem that the predicted duration of the traditional method has deviation is solved.

Inventors

SHI YAN
CHEN MINCHUAN

Assignees

平安科技（深圳）有限公司

Dates

Publication Date: 20260512
Application Date: 20260114

Claims (10)

1. A method for predicting a duration of speech, the method comprising: acquiring a text sequence of a text to be synthesized, and extracting text coding features through a text encoder based on the text sequence; Acquiring target emotion information of the text to be synthesized, and extracting emotion characteristics through an emotion encoder based on the target emotion information; Acquiring the text length of the text to be synthesized, and extracting length features by a length feature extraction module based on the text length; Inputting the text coding feature, the emotion feature and the length feature into a pre-constructed duration predictor, and outputting predicted duration corresponding to each phoneme unit in the text sequence by the duration predictor.
2. The method of claim 1, wherein the obtaining the text length of the text to be synthesized comprises: Counting the number of phonemes or the number of characters contained in the text to be synthesized; And determining the number of phonemes or the number of characters as the text length of the text to be synthesized.
3. The method according to claim 1, wherein the obtaining the target emotion information of the text to be synthesized includes: Acquiring an emotion tag or emotion embedded vector preset for the text to be synthesized; And constructing the target emotion information through the emotion encoder based on the emotion label or the emotion embedded vector.
4. The method of claim 1, wherein the duration predictor is trained by: Acquiring a training sample set, wherein the training sample set comprises a sample text sequence, a sample length, sample emotion characteristics and corresponding real time length; Inputting the sample coding features, the sample emotion features and the sample length corresponding to the sample text sequence into a duration prediction model to be trained to obtain a sample prediction duration; Calculating a loss value between the sample predicted duration and the real duration; and updating parameters of the duration prediction model to be trained based on the loss value until a preset training stopping condition is met.
5. The speech duration prediction method according to claim 1, wherein after obtaining the predicted duration corresponding to each phoneme unit, the method further comprises: inputting the predicted duration and the text coding features into an acoustic model, and generating a corresponding Mel frequency spectrum by the acoustic model; and inputting the Mel frequency spectrum into a vocoder for synthesis processing, and outputting synthesized voice.
6. The method of claim 5, wherein the method further comprises: extracting complexity characteristics of the text to be synthesized; dynamically determining a sampling step number of the acoustic model in generating the mel spectrum based on the length features and the complexity features; wherein the number of sampling steps is positively correlated with the text length characterized by the length feature and the text complexity characterized by the complexity feature.
7. The speech duration prediction method according to any one of claims 1-6, wherein the inputting the text encoding feature, the emotion feature, and the length feature into a pre-constructed duration predictor comprises: adjusting, by the duration predictor, an overall speech rate prosody of the text sequence based on the length feature; when the length characteristic is larger than a first threshold value, the predicted length of the single phoneme unit output by the length predictor is shortened so as to compact the generated voice rhythm; When the length characteristic is smaller than a second threshold, the predicted length of the single phoneme unit outputted by the length predictor is increased to ease the generated speech prosody.
8. A speech duration prediction apparatus, the apparatus comprising: the text feature extraction module is used for obtaining a text sequence of a text to be synthesized and extracting text coding features through a text encoder based on the text sequence; the emotion feature extraction module is used for acquiring target emotion information of the text to be synthesized and extracting emotion features through an emotion encoder based on the target emotion information; The length feature extraction module is used for acquiring the text length of the text to be synthesized and extracting length features based on the text length; And the duration prediction module is used for inputting the text coding feature, the emotion feature and the length feature into a pre-constructed duration predictor, and outputting the predicted duration corresponding to each phoneme unit in the text sequence by the duration predictor.
9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the speech duration prediction method of any one of claims 1to 7 when the computer program is executed by the processor.
10. A computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the method according to any of claims 1 to 7.

Description

Voice duration prediction method, device, equipment and storage medium Technical Field The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for predicting a voice duration. Background A Text-to-Speech (TTS) system serves as an important interface for human-machine interaction, and the naturalness of the TTS system is largely dependent on the accuracy of prosody generation. The Duration prediction module (Duration Predictor) plays a core role, and is used for accurately predicting the Duration (Duration) of each phoneme or pronunciation unit in the text on a time axis, so as to provide an alignment reference for the generation of subsequent acoustic features. However, most existing duration prediction methods infer only based on local Context (Context) or statistical law of the phoneme sequence itself, and the output predicted duration tends to be averaged. This conventional approach has a significant technical disadvantage in that the speech rate is not constant in the true human expression habits. First, emotion can change the pronunciation duration significantly, e.g., anger emotion generally corresponds to a shorter phoneme duration (fast speed), while sad emotion corresponds to a longer phoneme duration (slow speed). Secondly, text length also has a macroscopic impact on prosody, and in order to maintain breathing rhythm, the average phoneme length in long sentences is usually compressed, while short sentences are stretched. It can be seen that the predicted durations output by the prior art models tend to average out, resulting in deviations in the predictions of the durations. Disclosure of Invention The embodiment of the application provides a voice duration prediction method, a voice duration prediction device, computer equipment and a computer readable storage medium, aiming at solving the problem of duration prediction deviation in the traditional scheme. A method of speech duration prediction, the method comprising: acquiring a text sequence of a text to be synthesized, and extracting text coding features through a text encoder based on the text sequence; Acquiring target emotion information of the text to be synthesized, and extracting emotion characteristics through an emotion encoder based on the target emotion information; Acquiring the text length of the text to be synthesized, and extracting length features by a length feature extraction module based on the text length; Inputting the text coding feature, the emotion feature and the length feature into a pre-constructed duration predictor, and outputting predicted duration corresponding to each phoneme unit in the text sequence by the duration predictor. In an implementation manner, the obtaining the text length of the text to be synthesized includes: Counting the number of phonemes or the number of characters contained in the text to be synthesized; And determining the number of phonemes or the number of characters as the text length of the text to be synthesized. In an implementation manner, the obtaining the target emotion information of the text to be synthesized includes: Acquiring an emotion tag or emotion embedded vector preset for the text to be synthesized; And constructing the target emotion information through the emotion encoder based on the emotion label or the emotion embedded vector. In one implementation, the duration predictor is trained by: Acquiring a training sample set, wherein the training sample set comprises a sample text sequence, a sample length, sample emotion characteristics and corresponding real time length; Inputting the sample coding features, the sample emotion features and the sample length corresponding to the sample text sequence into a duration prediction model to be trained to obtain a sample prediction duration; Calculating a loss value between the sample predicted duration and the real duration; and updating parameters of the duration prediction model to be trained based on the loss value until a preset training stopping condition is met. In an implementation, the calculating the loss value between the sample predicted duration and the real duration includes: And calculating the difference between the sample predicted time length and the real time length through an L1 loss function or an L2 loss function to obtain the loss value. In an implementation manner, after obtaining the predicted duration corresponding to each phoneme unit, the method further includes: inputting the predicted duration and the text coding features into an acoustic model, and generating a corresponding Mel frequency spectrum by the acoustic model; and inputting the Mel frequency spectrum into a vocoder for synthesis processing, and outputting synthesized voice. In one implementation, the method further comprises: extracting complexity characteristics of the text to be synthesized; dynamically determining a sampling step n