CN-121983029-A - Harmonic sound generation method and terminal based on deep learning

CN121983029ACN 121983029 ACN121983029 ACN 121983029ACN-121983029-A

Abstract

The application discloses a harmonic sound generation method and a terminal based on deep learning, which are used for acquiring a source language text of a language to be converted and extracting a phoneme stream of the source language text; dividing a phoneme stream according to a language rule of a target language to obtain a plurality of first syllables, obtaining a pronunciation library corresponding to the target language, matching a second syllable corresponding to each first syllable in the pronunciation library, obtaining a font corresponding to the second syllable, establishing an association relation between a source language text and all fonts, training an initial model through the association relation to obtain a target model, and obtaining harmonic fonts corresponding to the text to be converted through the target model when receiving the text to be converted corresponding to the language to be converted. According to the application, based on the nonlinear mapping rule between the phoneme combination of the source language text and the target language pronunciation, the system can still generate accurate and reasonable harmonic sound phonetic notation in a reasoning manner according to the learned general pronunciation logic when facing the rare vocabulary which never appears in the training set and the new word.

Inventors

QIU ZHONGHAO
ZHENG YUAN
LIU LINGHUI

Assignees

福建星网视易信息系统有限公司

Dates

Publication Date: 20260505
Application Date: 20251229

Claims (10)

1. A harmonic generation method based on deep learning, comprising: acquiring a source language text of a language to be converted, and extracting a phoneme stream of the source language text; Dividing the phoneme stream according to the language rule of the target language to obtain a plurality of first syllables; Acquiring a pronunciation library corresponding to the target language, matching a second syllable corresponding to each first syllable in the pronunciation library, and acquiring a font corresponding to the second syllable; establishing association relations between the source language text and all the fonts, and training an initial model through the association relations to obtain a target model; and when receiving the text to be converted corresponding to the language to be converted, obtaining the harmonic and sound font corresponding to the text to be converted through the target model.
2. The method of deep learning based harmonic generation of claim 1 wherein the extracting the phoneme stream of the source language text comprises: Acquiring an original phoneme stream corresponding to the source language text through a font-to-phoneme conversion model; and converting the original phoneme stream into an international phonetic symbol form to obtain a phoneme stream corresponding to the source language text.
3. The deep learning based harmonic generation method of claim 1 wherein the segmenting the phone stream according to the language rules of the target language to obtain a plurality of first syllables comprises: traversing the phoneme stream, judging whether all undetermined phonemes which are not segmented before the current phoneme meet a preset syllable structure or not when traversing the current phoneme, if yes, checking syllable positions of the current phoneme and the undetermined phonemes, if the check is passed, continuing traversing, if the check is not passed, segmenting the current phoneme and all undetermined phonemes which are not segmented before the current phoneme, taking all undetermined phonemes which are not segmented before the current phoneme as a first syllable, and starting traversing again from the current phoneme.
4. A method of deep learning based harmonic generation as in claim 3 wherein traversing the phoneme stream further comprises: when traversing to the current phoneme, if the current phoneme does not have an unsegmented pending phoneme before, continuing traversing.
5. The method of harmonic generation based on deep learning as defined in claim 3, wherein the determining whether the current phoneme and all pending phonemes not segmented before the current phoneme satisfy a preset syllable structure further comprises: if the current phoneme and all the undetermined phonemes which are not segmented before the current phoneme do not meet a preset syllable structure, segmenting the undetermined phonemes into a first syllable and starting traversing from the current phoneme again.
6. The method of generating harmonic sounds based on deep learning as claimed in claim 1, wherein said matching each of the corresponding second syllables of the first syllables in the pronunciation library comprises: acquiring candidate syllables corresponding to the first syllable in the pronunciation library; Calculating an overall speech distance between the first syllable and each of the candidate syllables; And acquiring the sorting result of all the candidate syllables based on the overall voice distance, and determining a second syllable from all the candidate syllables according to the sorting result.
7. The deep learning based harmonic generation method of claim 6 wherein the obtaining the ranking result of all the candidate syllables based on the overall phonetic distance, determining a second syllable from all the candidate syllables based on the ranking result comprises: calculating a final weighted score for each candidate syllable; FinalScore=Dist(S1,S2)-w×log(Frequency(S2)); Wherein FinalScore denotes the final weighted score, dist (S1, S2) denotes the overall speech distance between the first syllable and the candidate syllable corresponding to the first syllable, S1 denotes the first syllable, S2 denotes the candidate syllable, w denotes a preset weight parameter, and Frequency () denotes the occurrence Frequency of the font corresponding to the candidate syllable in brackets; And obtaining the sorting results of all the candidate syllables according to the order from small to large of the final weighted scores, and obtaining the first candidate syllable as a second syllable.
8. The harmonic generation method based on deep learning as claimed in claim 1, wherein said training the initial model through the association relation to obtain the target model comprises: Inputting the phoneme stream into an initial encoder in the initial model to obtain a predicted font sequence output by an initial decoder corresponding to the initial encoder; And adjusting parameters in the initial encoder and the initial decoder to obtain a target model by taking the minimum deviation between the predicted font sequence and the target font sequence formed by all fonts corresponding to the phoneme stream as a target.
9. The deep learning based harmonic generation method of claim 8 further comprising: And if the deviation is smaller than a deviation threshold or the training iteration number reaches the preset number, ending model training to obtain a target model, otherwise, taking the initial encoder and the initial decoder after parameter adjustment as a new initial model, executing the initial encoder for inputting the phoneme stream into the initial model, and outputting a predicted font sequence by the initial decoder corresponding to the initial encoder.
10. A deep learning-based harmonic generation terminal comprising a memory, a processor and a computer program stored on the memory and running on the processor, the processor implementing the steps of a deep learning-based harmonic generation method according to any one of claims 1 to 9 when the computer program is executed.

Description

Harmonic sound generation method and terminal based on deep learning Technical Field The present invention relates to the field of natural language processing, and in particular, to a harmonic generation method and terminal. Background When learning a foreign language, if the difference between the pronunciation system of the foreign language and the native language is large, a large challenge is faced in the process of learning pronunciation, and the main current pronunciation learning modes of the foreign language mainly include the following three modes: 1) Professional phonetic symbol learning method, i.e. a learner needs to master a brand new phonetic symbol system, such as international phonetic symbol (IPA), japanese Roman (Romaji), korean Roman notation, etc. The method has the advantages of accuracy and standardization, but has the defects of high learning threshold, heavy memory burden and relatively boring process, and is not efficient for beginners who wish to quickly master song learning or daily spoken language imitation. 2) The "empty ear" harmonic notation is that a learner or sharer uses Chinese characters with similar pronunciation to label foreign language pronunciation, for example, a Japanese "kokumi" is marked as "empty you seven Java". The method is visual and easy to get on hand, and reduces the psychological threshold of beginners. However, the defects are obvious that 1, different labeling persons can select different Chinese characters, so that labeling results are not uniform and the standard is lacking. 2. The system is lacking, the "empty ear" label is usually scattered and individual, and a set of systematic learning method cannot be formed to comprehensively improve the pronunciation ability of learners. 3. The efficiency is low, the manual marking is low, and the current situation that the current vocabulary is growing is not effective. Manual tagging is theoretically not able to completely cover all word pronunciations. 3) The method is an upgrade of the method 2) and is a table look-up mapping method based on frequency statistics. The phoneme mapping dictionary needs to be constructed manually, the workload is large, the efficiency is low, all inputs are in the dictionary range, and finally 'empty-ear' harmonic sounds can be generated normally, so that the universality and generalization capability are weak as a whole. Therefore, the related art lacks a multi-voice pronunciation auxiliary tool capable of giving consideration to three points of visual usability, high efficiency, accuracy, universality and generalization capability, and cannot efficiently help a native language user to solve the problem of foreign language pronunciation learning. Disclosure of Invention The invention aims to solve the technical problem of providing a harmonic sound generation method and a terminal, which are used for realizing improvement of the accuracy of generating foreign language harmonic sound by a mother language. In order to solve the technical problems, the invention adopts a technical scheme that: a method of harmonic generation, comprising: acquiring a source language text of a language to be converted, and extracting a phoneme stream of the source language text; Dividing the phoneme stream according to the language rule of the target language to obtain a plurality of first syllables; Acquiring a pronunciation library corresponding to the target language, matching a second syllable corresponding to each first syllable in the pronunciation library, and acquiring a font corresponding to the second syllable; establishing association relations between the source language text and all the fonts, and training an initial model through the association relations to obtain a target model; and when receiving the text to be converted corresponding to the language to be converted, obtaining the harmonic and sound font corresponding to the text to be converted through the target model. In order to solve the technical problems, the invention adopts another technical scheme that: A harmonic sound generation terminal comprising a memory, a processor and a computer program stored on the memory and running on the processor, the processor implementing each of the steps of a harmonic sound generation method as described above when executing the computer program. The method has the advantages that the language to be converted is expressed in the form of a phoneme stream, the phoneme stream is segmented into a plurality of first syllables according to the language rule of the target language to be converted, and then the second syllable of the target language corresponding to each first syllable is searched through the pronunciation library corresponding to the target language, so that the pronunciation of the first syllable is simulated into the pronunciation of the second syllable, the pronunciation of the text of the language to be converted is simulated into the pronunciation of the target language,