KR-20260066936-A - METHOD FOR GENERATING PRONUNCIATION SEQUENCE AND APPARATUS THEREOF

KR20260066936AKR 20260066936 AKR20260066936 AKR 20260066936AKR-20260066936-A

Abstract

The present invention relates to a deep learning-based method for generating a pronunciation sequence, comprising: receiving a string to be subject to speech synthesis; dividing the string into phoneme units; predicting a phoneme-unit pronunciation sequence based on the divided phoneme-unit string using a pre-trained pronunciation sequence prediction model; and combining the predicted phoneme-unit pronunciation sequences to generate a syllable-unit pronunciation sequence.

Inventors

김미연

Assignees

주식회사 케이티

Dates

Publication Date: 20260512
Application Date: 20241105

Claims (20)

In a method for generating a pronunciation sequence performed by a computing device, A step of receiving a string to be the target of speech synthesis; A step of dividing the above string into phoneme units; A step of predicting a phoneme-unit pronunciation sequence based on the segmented phoneme-unit string using a pre-trained pronunciation sequence prediction model; and A method for generating a pronunciation sequence comprising the step of combining the predicted phoneme-unit pronunciation sequences to generate a syllable-unit pronunciation sequence.
In paragraph 1, A method for generating a pronunciation sequence that further includes the step of inserting predetermined boundary symbols at the syllable/word/sentence level of the above string.
In paragraph 2, A method for generating a pronunciation sequence that further includes the step of verifying the predicted phoneme-unit pronunciation sequence using a phoneme-unit string with boundary symbols inserted at the syllable/word/sentence unit level.
In paragraph 3, the verification step is, A method for generating a pronunciation sequence characterized by comparing a phoneme-unit string and a phoneme-unit pronunciation sequence based on a sentence boundary symbol to verify sentence repetition errors in the pronunciation sequence.
In paragraph 3, the verification step is, A method for generating a pronunciation sequence characterized by comparing a phoneme-unit string and a phoneme-unit pronunciation sequence based on a word boundary symbol to verify word repetition errors in the pronunciation sequence.
In paragraph 3, the verification step is, A method for generating a pronunciation sequence characterized by comparing a phoneme-unit string and a phoneme-unit pronunciation sequence based on syllable boundary symbols to verify the syllable prediction error of the pronunciation sequence.
In paragraph 6, the above verification step is, A method for generating a pronunciation sequence characterized by dividing the pronunciation sequence of the phoneme unit into syllable units using the syllable boundary symbol and verifying whether the initial/medial/final consonants of each divided syllable conform to standard pronunciation rules.
In paragraph 1, A method for generating a pronunciation sequence that further includes the step of providing the generated syllable-unit pronunciation sequence to a speech synthesis model.
In paragraph 1, A method for generating a pronunciation sequence that further includes the step of generating the above-mentioned pronunciation sequence prediction model.
In paragraph 9, the step of generating the pronunciation sequence prediction model is, A step of obtaining multiple string data and multiple pronunciation sequence data; A step of preprocessing the above-mentioned string and pronunciation sequence data; A step of dividing the above-mentioned preprocessed string and pronunciation sequence data into phoneme units; A step of constructing the above-described divided phoneme-unit string and pronunciation sequence data into training data pairs; and A method for generating a pronunciation sequence characterized by including the step of generating a pronunciation sequence prediction model by fine-tuning a pre-trained language model (PLM) based on the above-described training data pairs.
In Clause 10, the step of generating the above-mentioned pronunciation sequence prediction model is, A method for generating a pronunciation sequence characterized by further including the step of inserting predetermined boundary symbols into syllable/word/sentence units of the preprocessed string and pronunciation sequence data.
In item 10, the above preprocessing step is, A method for generating a pronunciation sequence characterized by extracting non-Hangul characters from the above-mentioned string and pronunciation sequence, and Hangulizing the extracted non-Hangul characters.
In Paragraph 10, A method for generating a pronunciation sequence characterized by the above PLM model being a multilingual language model.
A computer program stored on a computer-readable recording medium so that a method according to any one of claims 1 to 13 can be executed on a computer.
In a pronunciation sequence generating device comprising one or more processors, The above one or more processors are: The operation of receiving a string to be the target of speech synthesis; The operation of dividing the above string into phoneme units; An operation of predicting a phoneme-unit pronunciation sequence based on the segmented phoneme-unit string using a pre-trained pronunciation sequence prediction model; and A pronunciation sequence generation device that performs the operation of generating a syllable-unit pronunciation sequence by combining the above-mentioned predicted phoneme-unit pronunciation sequences.
In paragraph 15, The above one or more processors are, A pronunciation sequence generating device that further performs the operation of inserting predetermined boundary symbols at the syllable/word/sentence level of the above string.
In Paragraph 16, The above one or more processors are, A pronunciation sequence generation device that further performs the operation of verifying the predicted phoneme-unit pronunciation sequence using a phoneme-unit string with the boundary symbols of the syllable/word/sentence units inserted therein.
In Paragraph 17, The above one or more processors are, A pronunciation sequence generation device characterized by verifying sentence repetition errors in the pronunciation sequence by comparing the phoneme-unit string and the phoneme-unit pronunciation sequence based on sentence boundary symbols.
In Paragraph 17, The above one or more processors are, A pronunciation sequence generation device characterized by verifying word repetition errors in the pronunciation sequence by comparing the phoneme-unit string and the phoneme-unit pronunciation sequence based on word boundary symbols.
In Paragraph 17, The above one or more processors are, A pronunciation sequence generation device characterized by verifying a syllable prediction error of the pronunciation sequence by comparing the phoneme-unit string and the phoneme-unit pronunciation sequence based on syllable boundary symbols.

Description

Method for generating a pronunciation sequence and apparatus therefor The present invention relates to speech synthesis technology, and more specifically, to a method and apparatus for generating a phoneme-based pronunciation sequence using deep learning technology. Text-to-Speech (TTS) refers to a technology that enables a computer or device to provide information intended for a user in the form of speech that people can hear, or the software and hardware that implement such technology. In expressing language, since the spelling of written sentences does not always match the pronunciation spoken by humans, there is a problem where the quality of speech synthesis deteriorates when synthesizing speech based on written notation. To solve this problem, a technology has been proposed that generates a sequence of pronunciations as actually spoken by a human based on an input string. Conventional methods for generating pronunciation sequences include rule-based and deep learning-based methods. The rule-based method generates a pronunciation sequence by applying Korean pronunciation conversion rules using part-of-speech information of the morphemes constituting the input string. However, this rule-based method has the problem of frequently occurring exceptions that are not applied to the Korean pronunciation conversion rules. Meanwhile, the deep learning-based pronunciation sequence generation method constructs a syllable-unit pronunciation sequence prediction model using a Korean language model and generates a pronunciation sequence using the constructed prediction model. However, this deep learning-based method has problems such as incorrect mapping between strings and pronunciation sequences, omission of some pronunciations, and repetition of some pronunciations. Therefore, a new pronunciation sequence generation method is needed to improve the quality of speech synthesis. FIG. 1 is a block diagram of a pronunciation sequence generating device according to one embodiment of the present invention; FIG. 2 is a flowchart illustrating a method for generating a pronunciation sequence prediction model according to an embodiment of the present invention; FIG. 3 is a drawing referenced to explain the method of generating a pronunciation sequence prediction model of FIG. 2; FIG. 4 is a flowchart illustrating a method for generating a pronunciation sequence according to an embodiment of the present invention; FIG. 5 is a drawing referenced to explain the method of generating a pronunciation sequence of FIG. 4; FIG. 6 is a flowchart illustrating a pronunciation sequence verification method according to an embodiment of the present invention; FIG. 7 is a drawing referenced to explain the pronunciation sequence verification method of FIG. 6; FIG. 8 is a block diagram of a computing device according to an embodiment of the present invention. Hereinafter, embodiments disclosed in this specification will be described in detail with reference to the attached drawings. Identical or similar components regardless of drawing symbols will be assigned the same reference number, and redundant descriptions thereof will be omitted. The suffixes "module" and "part" for components used in the following description are assigned or used interchangeably solely for the ease of drafting the specification and do not inherently possess distinct meanings or roles. That is, the term "part" used in this invention refers to a hardware component such as software, FPGA, or ASIC, and the "part" performs certain roles. However, the meaning of "part" is not limited to software or hardware. The "part" may be configured to reside in an addressable storage medium or may be configured to run one or more processors. Accordingly, as an example, a 'part' includes components such as software components, object-oriented software components, class components, and task components, as well as processes, functions, attributes, procedures, subroutines, segments of program code, drivers, firmware, microcode, circuits, data, databases, data structures, tables, arrays, and variables. The functionality provided within the components and 'parts' may be combined into a smaller number of components and 'parts' or further separated into additional components and 'parts'. In addition, when describing the embodiments disclosed in this specification, if it is determined that a detailed description of related prior art may obscure the essence of the embodiments disclosed in this specification, such detailed description is omitted. Furthermore, the attached drawings are intended only to facilitate understanding of the embodiments disclosed in this specification, and the technical concept disclosed in this specification is not limited by the attached drawings; it should be understood that they include all modifications, equivalents, and substitutions that fall within the spirit and technical scope of the present invention. The present invention proposes a method and apparatus for g