CN-122024703-A - Tibetan prosody structure prediction method based on grammar information

CN122024703ACN 122024703 ACN122024703 ACN 122024703ACN-122024703-A

Abstract

The invention discloses a Tibetan prosody structure prediction method based on grammar information, relates to the technical field of speech synthesis, and is applied to the field of prosody structure prediction in Tibetan speech synthesis. The Tibetan prosody structure prediction method based on grammar information comprises the steps of S1, completing Tibetan analysis and continuous quantification of the virtual words, extracting grammar probability, hierarchical focus and pause difference, preprocessing and storing to construct a Tibetan prosody prediction database, S2, carrying out boundary judgment based on the continuous relation strength of the virtual words, hit marks of an influence range and word sequence distances, S3, carrying out entropy feature analysis through grammar role probability data and boundary judgment data, S4, generating a three-layer prosody boundary type sequence according to boundary shaping and inhibiting excessive pause, and S5, carrying out rhythm strength assessment through pause duration difference and grammar information entropy data. The problems of pause confusion and increased understanding burden caused by dislocation of prosody and semantic hierarchy in Tibetan language long voice synthesis are solved.

Inventors

Labaton Ju
Bian Bazhuoma
Zhuo Mazhaxi
WANG HONGFENG
ZHU JIE
Deqing Zhuoma
Si Getinka
Pu Bu Dan Zeng
SUN YADONG
Zha Xiduoji
Pu Pianduo
Renqing Nobu

Assignees

西藏大学

Dates

Publication Date: 20260512
Application Date: 20260205

Claims (10)

1. The Tibetan prosody structure prediction method based on grammar information is characterized by comprising the following steps of: S1, completing Tibetan language analysis and continuous quantification of the virtual words, extracting grammar probability, hierarchical focus and pause difference, preprocessing and storing to construct a Tibetan language prosody prediction database; s2, boundary judgment is carried out based on the strength of the continuing relation of the virtual words, the hit marks of the influence range and the word sequence distance, and candidate boundary screening, merging of the serial sections, configuration of pause tags and updating of grammar function labels are realized; S3, performing entropy feature analysis through grammar role probability data and boundary judgment data, and performing the works of marking the anchor points of the works and adjusting and pause compression of clause skeletons according to the entropy feature analysis result; s4, reading grammar annotation and hierarchy information of the candidate boundary, generating a three-layer prosody boundary type sequence according to boundary shaping, and inhibiting excessive stop; S5, performing rhythm strength evaluation through the pause time difference and the grammar information entropy data, and executing phrase level fineness and pause mark adjustment according to a rhythm evaluation result.
2. The Tibetan prosody structure prediction method based on grammar information of claim 1, wherein the specific measures of completing Tibetan parsing and successive quantification of the virtual word and extracting grammar probability, hierarchical focus and pause difference are as follows: Collecting Tibetan text data, namely receiving a text to be processed through an input interface, performing word segmentation operation to obtain a word sequence and a part-of-speech sequence, and outputting a dependency arc set, a center language index set and a modification range set by combining syntax dependency analysis; identifying all the works in the word sequence, dividing the works into three classes of works, namely a word aid word, a general works and a class works, marking the position of each works and numbering, collecting works continuing rules, extracting works continuing constraint list formed by sorting based on corpus statistics and grammar rules, namely valid connected works combination, synonym special-shaped replacement form and prefix and suffix attachment information, generating works continuing relation strength matrix and word sequence distance matrix by combining the word sequences to realize works quantification, synchronously encoding the matrix and the text sequence, constructing an indication function matrix according to boundary forming rules and candidate boundary setting conditions, judging whether each pair of works is located in the same candidate boundary influence range to obtain a boundary influence range indication value, collecting syntactic probability data, judging whether each pair of works is located in the same candidate boundary influence range, outputting probability distribution of various grammatical roles possibly born by each works, recording as a grammar role probability sequence, extracting works hierarchy structure and semantic focus information, determining works continuing from sentence hierarchy index, sentence ending label and central word chain according to the semantic dependency relation, modification range and sentence division, extracting topic label and punctuation label, recording the time-stop phrase and the pause time-stop-pause time period by marking the word list, and marking the time-stop-pause time period corresponding to the text node, calculating the time length difference between the prosodic phrase and the pause period to obtain the pause time length difference.
3. The Tibetan prosody structure prediction method based on grammar information of claim 1, wherein the preprocessing, storing and constructing a Tibetan prosody prediction database comprises the following specific measures: The method comprises the steps of marking the stop words with the difference of the stop time length smaller than a stop threshold value as short stop labels, marking the stop words with the difference of the stop time length larger than the stop threshold value as long stop labels, carrying out maximum value and minimum value normalization processing on collected data of the continuous relation strength, word sequence distance and grammar role probability value, unifying all dimension ranges, splicing the structural fields of the category of the stop words, the continuous strength, the grammar role probability, the dependency index, the distance, the hierarchical label and the focus mark according to the text sequence based on normalized characteristic data, generating the same-frequency characteristic vector, adding the sequence number of the stop words, the data batch label and the main control field interval, synchronously storing the characteristic vector and the grammar label of the stop words, and constructing a Tibetan prosody prediction database.
4. The Tibetan prosody structure prediction method based on grammar information of claim 1, wherein the specific steps of boundary discrimination based on the strength of the connection relation of the virtual words, the hit mark of the influence range and the word sequence distance are as follows: The method comprises the steps of obtaining connection relation strength, boundary influence range indication values and word sequence distances of a position i and all the virtual word positions j, multiplying the connection relation strength and the boundary influence range indication values of each group of the position i and the position j to obtain boundary contribution items, adding the word sequence distances and tiny constants to square to obtain distance penalty items, dividing the boundary contribution items by the distance penalty items to obtain boundary contribution values of the position i and the position j, and carrying out summation operation on the boundary contribution values of all the position j to finally obtain boundary candidate values at the position i.
5. The Tibetan prosody structure prediction method based on grammar information of claim 1, wherein the specific measures for realizing candidate boundary screening, tandem segment merging, pause tag configuration and grammar function labeling updating are as follows: By comparing the boundary candidate value with the boundary threshold value in real time, when the boundary candidate value is smaller than the boundary threshold value, sequentially checking whether legal connection combinations registered in the continuous virtual word constraint table exist between adjacent virtual words, executing prosodic phrase boundary merging and canceling temporary pause labels among all virtual words in the detected continuous virtual word series section, and returning grammar function labels triggered by the virtual word series section; when the boundary candidate value is greater than or equal to the boundary threshold value, determining the current position as an effective prosody boundary candidate, keeping the position in a candidate boundary list and marking the position as a prosody boundary, and simultaneously carrying out validity check on the combination of the virtual words at the left side and the right side of the position according to the virtual word continuing constraint table, and endowing short pause labels to the virtual word serial sections conforming to the constraint rule and synchronously marking grammar functions.
6. The method for predicting the prosody structure of Tibetan language based on grammar information of claim 1 wherein the specific measure of entropy feature analysis by grammar role probability data and boundary judgment data is as follows: Obtaining probability values of kth virtual words on each grammar role and boundary candidate values of corresponding positions, multiplying the probability of each grammar role by the logarithm of the grammar role to obtain information items of a single grammar role, summing the information items of all grammar roles and taking the opposite number to obtain basic grammar entropy values of the kth virtual words, multiplying boundary candidate values of corresponding positions by a boundary adjusting factor to obtain boundary correction terms and adding the boundary correction terms to obtain entropy correction terms, and multiplying the basic grammar entropy values and the entropy correction terms to finally obtain grammar information entropy values of the kth virtual words.
7. The method for predicting the Tibetan prosody structure based on the grammar information of claim 1, wherein the specific measures for performing the stop-word anchor point calibration, the clause skeleton adjustment and the pause compression operation according to the entropy feature analysis result are as follows: By comparing the grammar information entropy value with the entropy value threshold in real time, when the grammar information entropy value is smaller than the entropy value threshold, marking the corresponding virtual word as a grammar function stable anchor point, fixing a lattice relation mark, a clause connection mark and a language gas mark, and not degrading and merging in a subsequent boundary optimization stage; When the grammar information entropy value is larger than or equal to the entropy value threshold, the corresponding virtual word is marked as a grammar function undetermined unit, a grammar role slow-release strategy is triggered, whether the combination before and after the virtual word continuing constraint table meets legal order requirements is checked, the inconsistent grammar function marking is carried out and is backed to three categories of word assisting, general virtual word and class virtual word, legal connection combination rechecking is carried out on the virtual word before and after the position in the virtual word continuing constraint table, the grammar information entropy value is recalculated after the grammar role probability distribution is updated on the hit combination, if the grammar information entropy value is still larger than or equal to the entropy value threshold after recalculation, the position is forbidden to participate in the selection of a focus anchor point of the rhythm, virtual words with the position nearest to the left and right are changed to be smaller than the entropy value threshold are used as focus anchor points of long pause time, and the position corresponding pause labels are marked as short pause grades, and long sentence layer dislocation caused by binding pause and accent drop points to grammar function fluctuation positions is avoided.
8. The method for predicting the prosodic structure of Tibetan language based on the grammar information of claim 1, wherein the specific measures of reading the grammar annotation and the hierarchy information of the candidate boundary, generating a three-layer prosodic boundary type sequence according to boundary shaping and inhibiting excessive stop are as follows: In the candidate boundary list, reading a function label of a grammar function of an imaginary word at each candidate position, indexing from a sentence level, semanteme focus anchor points and grammar information entropy values, and outputting a three-layer boundary type sequence according to a boundary forming rule, wherein the boundary forming rule comprises that when the grammar function of the candidate position imaginary word is marked as a lattice auxiliary word, the candidate position is determined as a prosodic word boundary, and the adjacent syntactic components on the left side of the candidate position are combined into the same prosodic word, when the grammar function of the candidate position imaginary word is marked as a temporal auxiliary word and a noun suffix, the candidate position is determined as a prosodic word boundary and prosodic word closing is completed, when the grammar function of the candidate position imaginary word is marked as a prosodic word, the candidate position is determined as a boundary of a intonation phrase and is written into prosodic-end type control information, when the sentence-end imaginary word in the candidate position is marked as a connecting type word and the complex sentence-associated component mark, the candidate position is marked as a boundary of a prosodic word, and the candidate position is determined as a phrase, and the prosodic word is kept in a prosodic word boundary value of the same as a prosodic word, and the prosodic word is kept in a threshold value when the value is greater than the prosodic word boundary value, and the prosodic word boundary value is kept in a certain range of the prosodic word type constraint value.
9. The method for predicting the prosody structure of Tibetan language based on grammar information of claim 1, wherein the specific measures for evaluating the intensity of the rhythm by the difference of the pause time length and the entropy data of the grammar information are as follows: Obtaining the time difference between the mth prosodic phrase and each pause period, the time node of each pause period, the start-stop time node of the mth prosodic phrase boundary and the grammar information entropy value, dividing the time difference by one for each pause period and adding the product of the adjustment weight factor and the absolute value of the time node difference value to obtain a single pause contribution item, performing integral operation on all pause contribution items within a given time range to obtain a basic rhythm intensity item, multiplying the grammar information entropy value by an information weight factor and adding the grammar information entropy value to obtain a rhythm correction factor, and multiplying the basic rhythm intensity item by the rhythm correction factor to finally obtain the layered rhythm intensity value of the mth prosodic phrase.
10. The method for predicting the prosodic structure of Tibetan language based on grammar information of claim 1, wherein the specific steps of performing phrase level refining and pause mark adjustment according to the result of the rhythm evaluation are as follows: By comparing the intensity value of the layered rhythm with the intensity threshold in real time, when the intensity value of the layered rhythm is smaller than the intensity threshold, a conservative subdivision strategy is executed, the boundary type is kept unchanged in the determined boundary of the rhythm phrase, new segmentation points are stopped, convergence processing is only carried out on the pause time length around the boundary of the same rhythm phrase, pause labels are uniformly compressed to short pause grades, continuous occurrence of adjacent short pauses is restrained, and therefore rhythm fragmentation caused by introduction of redundant grades into fragments with insignificant difference of pause time lengths is avoided; When the hierarchical rhythm intensity value is greater than or equal to the intensity threshold value, executing an enhanced subdivision strategy, namely sorting candidate positions which are the prosodic phrase boundaries according to the hierarchical rhythm intensity value by taking the hierarchical rhythm intensity value as a subdivision priority basis, preferentially selecting the position which is the front of the hierarchical phrase boundary as a prosodic big phrase boundary, downwards regulating the similar boundary which is the rear of the hierarchical phrase boundary as a prosodic small phrase boundary, and driving the hierarchical separation of the short pause and the long pause by using the pause time difference; when the conditions of large prosody phrase and small prosody phrase are hit at the same position, determining the final level and locking boundary type according to the priority list at one side of the hierarchical rhythm intensity value which is closer to the intensity threshold value to avoid repeated switching, continuously modifying the chain to be overlong segments at the sentence end, calculating the hierarchical rhythm intensity value for the boundary periphery of the large prosody phrase and selecting the cutting point with larger hierarchical rhythm intensity value to migrate the boundary of the large prosody phrase forward, pulling the boundary of the large prosody phrase and the boundary of the adjacent large prosody phrase apart to tighten the tail rhythm, when dynamic planning cutting is carried out on the candidate boundary sequence of prosody phrase, taking the hierarchical rhythm intensity value as the cost item to participate in the path scoring, making the boundary of the clause cover, the complete center language chain and the length distribution constraint of prosody phrase be satisfied simultaneously, outputting the boundary of prosody phrase, the boundary of small prosody phrase, the boundary of prosody phrase, the boundary coding of the large prosody phrase and semantic focus information, and inserting pause symbols at the speech synthesis end, combining the boundary of the small prosody phrase and the large prosody phrase into pause symbols, controlling the semantic focus and stretching the boundary of the sentence and using the boundary of the sentence phrase for the joint constraint of the basic frequency and prosody phrase, the dwell level subdivision is implemented in alignment with the semantic level.

Description

Tibetan prosody structure prediction method based on grammar information Technical Field The invention relates to the technical field of speech synthesis, in particular to a Tibetan prosody structure prediction method based on grammar information. Background In the Tibetan prosody structure prediction field, accurately identifying and adjusting the pause position and the accent drop point is a key for improving the text reading effect. In the prior art, most research has focused on prosody prediction based on grammar information and speech data, usually by simple pause rules or timing models based on short-term observations. The invention patent with the bulletin number of CN106294310B discloses a Tibetan language tone prediction method and system, which comprises the steps of receiving a Tibetan language text to be processed, carrying out word segmentation processing on the Tibetan language text to be processed to obtain word units, determining the part of speech of the word units according to the context environment information of the word units in the Tibetan language text to be processed, predicting the prosodic boundaries of the Tibetan language text to be processed, adjusting the word unit boundaries at the prosodic boundaries according to the part of speech of the word units at the prosodic boundaries, and carrying out tone prediction on syllable units of the Tibetan language text to be processed after the word unit boundaries are adjusted according to the part of speech of the word units to obtain tone information of the Tibetan language text to be processed. The invention can solve the problem of tone variation of the multi-tone mode words at different prosody boundaries, and effectively improves the application effect of the Tibetan language voice system. However, the research of the prior method on the prosody structure is mainly focused on the application of static rules, and modeling of the dynamic relationship between the grammar information and the prosody is deficient. How to extract grammar functions, clause levels and pause differences of the virtual words based on grammar information and to optimize prosody structures by combining the features has not been realized by effective technology. Therefore, in view of the above problems, there is a need for a Tibetan prosody structure prediction method based on grammar information. Disclosure of Invention Technical problem to be solved Aiming at the defects of the prior art, the invention provides a Tibetan prosody structure prediction method based on grammar information, which solves the problems of pause confusion and increased understanding burden caused by prosody grouping and semantic hierarchy dislocation in Tibetan long sentence speech synthesis. Technical proposal The Tibetan prosody structure prediction method based on grammar information comprises the following steps of S1, completing Tibetan analysis and continuous quantification of a virtual word, extracting grammar probability, hierarchical focus and pause difference, preprocessing and storing to construct a Tibetan prosody prediction database, S2, carrying out boundary judgment based on the strength of the continuous relation of the virtual word, a hit mark of an influence range and a word sequence distance, realizing candidate boundary screening, serial segment merging, pause label configuration and grammar function labeling updating, S3, carrying out entropy feature analysis through grammar role probability data and boundary judgment data, carrying out virtual word anchor point calibration, sentence skeleton adjustment and pause compression operation according to entropy feature analysis results, S4, reading grammar labeling and hierarchical information of candidate boundaries, generating a three-layer prosody boundary type sequence according to boundary forming, and inhibiting excessive pause, S5, carrying out rhythm strength assessment through pause time length difference and grammar information entropy data, and carrying out phrase hierarchy fine and pause mark adjustment according to rhythm assessment results. The method comprises the steps of collecting Tibetan text data, receiving a text to be processed through an input interface, executing word segmentation operation to obtain a word sequence and a word part sequence, combining syntactic dependency analysis to output a dependency arc set, a central word index set and a modification range set, identifying all the virtual words in the word sequence, dividing the virtual words into three classes of check-assisted words, general virtual words and class virtual words according to grammar characteristics, marking the position of each virtual word and numbering, collecting the virtual word connection rule, extracting valid connected virtual word combination, synonym special-shaped replacement morphology and word affix and suffix attachment information from a virtual word connection constraint table formed by sorting based on corp