CN-115470350-B - Method and device for constructing prosody model, prosody labeling method and electronic equipment

CN115470350BCN 115470350 BCN115470350 BCN 115470350BCN-115470350-B

Abstract

The application provides a method and a device for constructing a prosody model, a prosody labeling method, electronic equipment and a readable storage medium, wherein the method for constructing the prosody model comprises the steps of obtaining a prediction prompt of input data, wherein the prediction prompt is used for indicating a prosody grade when prosody labeling is performed on the input data; and according to the marking error between the marking text of the rhythm and the expected marking text of the input data, carrying out parameter adjustment on the classifier to obtain a rhythm model which enables the marking error to be in an expected error range. The application overcomes the dependence on the integrality of the expected labeling text in the input data and reduces the acquisition difficulty of the input data.

Inventors

FENG XIAOQIN
YANG XIPENG
CHEN YUNLIN
YE SHUNPING

Assignees

出门问问信息科技有限公司

Dates

Publication Date: 20260512
Application Date: 20220914

Claims (8)

1. The method for constructing the prosody model is characterized by comprising the following steps of: obtaining a prediction hint of input data, the prediction hint being used to indicate a prosody level at which prosody annotation is performed on the input data; Based on the predictive prompt, performing the prosody annotation on the input data to obtain prosody annotation text having a prosody level in the predictive prompt, and According to the marking errors between the rhythm marking text and the expected marking text of the input data, carrying out parameter adjustment on the classifier to obtain a rhythm model enabling the marking errors to be in an expected error range; The step of marking the prosody on the input data based on the prediction prompt to obtain prosody marking text with prosody level in the prediction prompt comprises the following steps: extracting a prosody level for the input data from the predictive cues, wherein the prosody level includes prosodic words, prosodic phrases, and intonation phrases; starting a labeling channel corresponding to the prosody level according to the prosody level, and Performing prosody annotation on the training text of the input data by utilizing the annotation channel to obtain a prosody annotation text with the prosody grade; And according to the labeling error between the prosody labeling text and the expected labeling text of the input data, performing parameter adjustment on a classifier to obtain a prosody model with the labeling error in an expected error range, wherein the method comprises the following steps: And starting at least one labeling channel corresponding to an input mode of input data in the classifier, and carrying out parameter adjustment on at least one labeling channel according to a labeling error between the prosody labeling text and an expected labeling text of the input data so as to obtain a prosody model enabling the labeling error to be in an expected error range.
2. The method for constructing a prosody model according to claim 1, comprising, before the obtaining of the predicted cue of the input data: screening the input data in a sample library, wherein the method comprises the following steps: Extracting sample library features of the sample library; Determining the input mode of each sample data in a sample library according to the characteristics of the sample library; The input data is determined from a plurality of the sample data based on the input pattern.
3. The method for constructing a prosody model according to claim 2, wherein the determining the input mode of each sample data in the sample library according to the characteristics of the sample library comprises: When the sample library features that the sample data corresponding to each prosodic grade are similar in quantity, the sample data with the same prosodic grade are input one by one as an input mode, or When the sample library features that the number difference of the sample data corresponding to each prosody level exceeds a preset threshold, the sample data corresponding to different prosody levels are input in a crossing mode.
4. The method for constructing a prosodic model according to claim 1, wherein the step of starting at least one annotation channel corresponding to the input mode in the classifier, and performing parameter adjustment on at least one annotation channel according to an annotation error between the prosodic annotation text and an expected annotation text of the input data, so as to obtain the prosodic model with the annotation error within an expected error range, comprises: Starting a labeling channel corresponding to the input data in the classifier in response to the input mode of inputting sample data with the same prosodic grade one by one, and performing parameter adjustment on the labeling channel corresponding to the input data according to the labeling error to obtain a prosodic model enabling the labeling error to be in a desired error range, or And starting all labeling channels in the classifier in response to the input modes of cross input of sample data corresponding to different prosody grades, and carrying out parameter adjustment on each labeling channel according to the labeling errors so as to obtain a prosody model enabling the labeling errors to be in an expected error range.
5. A prosody model constructing device, comprising: The system comprises an acquisition module, a prediction prompt generation module and a control module, wherein the acquisition module is used for acquiring a prediction prompt of input data, and the prediction prompt is used for indicating a prosody grade when prosody annotation is carried out on the input data; The marking module is used for marking the rhythm of the input data based on the prediction prompt to obtain a rhythm marking text with the rhythm level in the prediction prompt, and The adjustment module is used for carrying out parameter adjustment on the classifier according to the marking errors between the rhythm marking text and the expected marking text of the input data so as to obtain a rhythm model with the marking errors in an expected error range; The step of marking the prosody on the input data based on the prediction prompt to obtain prosody marking text with prosody level in the prediction prompt comprises the following steps: extracting a prosody level for the input data from the predictive cues, wherein the prosody level includes prosodic words, prosodic phrases, and intonation phrases; starting a labeling channel corresponding to the prosody level according to the prosody level, and Performing prosody annotation on the training text of the input data by utilizing the annotation channel to obtain a prosody annotation text with the prosody grade; And according to the labeling error between the prosody labeling text and the expected labeling text of the input data, performing parameter adjustment on a classifier to obtain a prosody model with the labeling error in an expected error range, wherein the method comprises the following steps: And starting at least one labeling channel corresponding to an input mode of input data in the classifier, and carrying out parameter adjustment on at least one labeling channel according to a labeling error between the prosody labeling text and an expected labeling text of the input data so as to obtain a prosody model enabling the labeling error to be in an expected error range.
6. A prosody annotation method, comprising: acquiring a target text; Prosody annotation of the target text using a prosody model constructed by the method of constructing a prosody model according to any one of claims 1 to 4, and And generating prosodic annotation text of the target text, wherein the prosodic annotation text is provided with annotation information corresponding to at least one prosodic grade.
7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, when executing the program, to implement the method of constructing a prosody model according to any of claims 1 to 4.
8. A readable storage medium, characterized in that it stores a computer program adapted to be loaded by a processor to perform the method of constructing a prosody model according to any one of claims 1 to 4.

Description

Method and device for constructing prosody model, prosody labeling method and electronic equipment Technical Field The application relates to the technical field of intelligent voice, in particular to a method for constructing a prosody model, a device for constructing the prosody model, a prosody labeling method, electronic equipment and a readable storage medium. Background The precise control of the prosodic level of the text is a key to improve the naturalness of speech synthesis, so that in the process of speech synthesis, the prosodic level of the text is usually analyzed by using a prosodic model to determine the prosodic structure of the input text. Obviously, the prosodic model is critical in the prosodic level acquisition process. In the related art, the complete prosody joint data is generally used as a training sample for training the prosody model. However, since the joint prosody data needs to completely label each prosody level of the sample, the joint prosody data depends on strict expert experience and text processing effect, and the acquisition difficulty is high. The decoupled prosody data corresponding to each prosody level is relatively easy to obtain, but if a plurality of prosody models are respectively constructed by using a plurality of decoupled prosody data, parameters of each prosody model cannot be shared, and flexible prosody level training cannot be supported. Disclosure of Invention In order to solve at least one of the above technical problems, the present application provides a prosody model construction method, a prosody model construction device, a prosody labeling method, an electronic device, and a readable storage medium. One aspect of the present application provides a method of constructing a prosody model, which may include obtaining a predictive cue of input data, wherein the predictive cue is used to indicate a prosody level when prosody annotation is performed on the input data, performing prosody annotation on the input data based on the predictive cue to obtain a prosody annotation text having the prosody level in the predictive cue, and performing parameter adjustment on a classifier according to an annotation error between the prosody annotation text and an expected annotation text of the input data to obtain the prosody model having the annotation error within a desired error range. In some embodiments, prosody annotation of input data based on a predictive prompt to obtain prosody annotation text having a prosody level in the predictive prompt may include extracting prosody levels for the input data from the predictive prompt, wherein the prosody levels include prosody words, prosody phrases, and intonation phrases, initiating an annotation channel corresponding to the prosody levels based on the prosody levels, and prosody annotating training text of the input data with the annotation channel to obtain prosody annotation text having the prosody levels. In some embodiments, before the predictive prompt of the input data is obtained, the method can comprise screening the input data from a sample library, extracting sample library features of the sample library, determining input modes of the sample data in the sample library according to the sample library features, and determining the input data from a plurality of sample data based on the input modes. In some embodiments, determining the input mode of each sample data in the sample library according to the sample library features may include inputting sample data having the same prosody level one by one as the input mode when the sample library features are similar in number of sample data corresponding to each prosody level, or inputting sample data corresponding to different prosody levels in a cross manner as the input mode when the difference in the number of sample data corresponding to each prosody level of the sample library features exceeds a preset threshold. In some embodiments, parameter adjustment of the classifier according to the annotation error between the prosodic annotation text and the expected annotation text of the input data to obtain a prosodic model with the annotation error within the expected error range may include starting at least one annotation channel corresponding to the input mode in the classifier, and parameter adjustment of the at least one annotation channel according to the annotation error between the prosodic annotation text and the expected annotation text of the input data to obtain a prosodic model with the annotation error within the expected error range. In some embodiments, starting at least one labeling channel corresponding to the input mode in the classifier, performing parameter adjustment on the at least one labeling channel according to the labeling error between the prosodic labeling text and the expected labeling text of the input data to obtain the prosodic model with the labeling error in an expected error range, and may include starting the labeling