KR-102961624-B1 - Lip sync model generation device and lip sync model generation method using the same

KR102961624B1KR 102961624 B1KR102961624 B1KR 102961624B1KR-102961624-B1

Abstract

The present invention relates to a lip-sync model generating device and a method for generating a lip-sync model using the same, and more specifically, to a lip-sync model generating device that changes the shape of the lips according to a received voice and a method for generating a lip-sync model using the same. To this end, the lip-sync generating device of the present invention comprises: a voice data receiving unit that receives voice data from the outside; a control module that divides the voice data received from the voice data receiving unit into phoneme units, calculates the ratio of consonants and vowels constituting the divided phoneme units, and calculates the time allocated to maintain the lip shape for each phoneme according to the calculated ratio of consonants and vowels; a storage module that stores the lip shape corresponding to the phoneme unit; and an output module that outputs the lip shape corresponding to the allocated time calculated by the control module.

Inventors

박성호
서종국

Assignees

(주)모션어드바이저

Dates

Publication Date: 20260507
Application Date: 20241114

Claims (5)

A voice data receiving unit that receives voice data from the outside; A control module that separates voice data received from the voice data receiving unit into phoneme units, calculates the ratio of consonants and vowels constituting the separated phoneme units, and calculates the time allocated to maintain the lip shape for each phoneme according to the calculated ratio of consonants and vowels; A storage module for storing lip shapes corresponding to the above phoneme unit; and It includes an output module that outputs a lip shape corresponding to the allocated time calculated by the control module above; and The above control module allocates relatively more pronunciation time to vowels than to consonants, and calculates the pronunciation time of each consonant and vowel using the number of consonants and vowels constituting the received voice data and the pronunciation weight of the vowels. The distinction between consonants and vowels in the above phonemes is, A first weight is assigned to a first model that calculates the probability of a consonant appearing and a vowel appearing in a segmented phoneme, and a second weight is assigned to a second model that calculates the start and end times of phonemes constituting speech data, wherein the first weight is greater than the second weight. The above-mentioned first model receives a phoneme distribution in each time frame and calculates the probability that consonants and vowels appear in the phonemes, and calculates the probability that consonants and vowels appear in the phoneme transition intervals, and The above second model distinguishes the boundaries between phonemes and determines the specific time lengths of consonants and vowels, and calculates the start and end times of the phonemes based on the actual duration of each phoneme in continuous pronunciation. Combining the results of the first model and the second model at the point of consonant and vowel transition, In speech recognition, the first weight is set higher than the second weight, and in pronunciation correction, the second weight is set higher than the first weight, and Model 1 and Model 2 adjust the relative ratio of consonants and vowels using consonant and vowel ratio adjustment variables, and The first model adjusts the sensitivity of phoneme judgment using a probability scaling variable, and The second model adjusts the window size of the phoneme time point and end point using a time window adjustment variable, and Model 2 mitigates the influence between neighboring phonemes using boundary smoothing variables, and The virtual object includes facial image data including lip shape, and A lip-sync generating device characterized by changing the lip shape when the face of a virtual object looks in the frontal direction and when looking in a direction other than the frontal direction.
A lip-sync generating device according to claim 1, characterized in that the control module converts received voice data into text data and separates the converted text data into phoneme units.
In claim 1, the control module is, From the voice data received from the voice data receiver, voice segments containing the speaker's voice and non-voice segments not containing the speaker's voice are distinguished, and the voice data of the voice segments is amplified. A lip-sync generating device characterized by distinguishing between the voice segment and the nasal segment by calculating a log amplitude graph corresponding to the received voice data, and using the log amplitude value at the point connecting the region having a relatively high peak and the region having a relatively low peak in the calculated log amplitude graph.
In claim 1, the control module A lip-sync generating device characterized by generating a face shape including lip shapes using at least two learning algorithms.
In Article 1, A lip-sync generator characterized by further including a communication module that receives an algorithm required to drive the lip-sync generator from an external server.

Description

Lip sync model generation device and lip sync model generation method using the same The present invention relates to a lip-sync model generating device and a method for generating a lip-sync model using the same, and more specifically, to a lip-sync model generating device that changes the shape of the lips according to a received voice and a method for generating a lip-sync model using the same. The technology that converts sentences and text into speech is called TTS (Text to Speech), and with the development of various AI-based learning technologies, it is now possible to achieve significantly natural speech conversion. Furthermore, Speech To Face (STF) technology has recently been developed, which applies an effect to transform the mouth shape of a person or character on a screen, similar to an avatar, based on the speech. Accordingly, to create a face that matches the voice, a face generation process is added. This process reflects the speaker's speaking habits, intonation, speed, and other characteristics, as well as their natural facial features and lip shape, to provide a video with an effect that mimics the speaker actually speaking. Furthermore, these TTS and STF technologies are making significant advancements in their respective fields by utilizing methods based on artificial intelligence neural networks. Recently, services that produce speech videos using avatars by utilizing these artificial intelligence neural network technologies have been proposed; however, current technology only allows for the time-consuming production of test videos that merely speak simple text. Furthermore, due to various commercialization issues—such as the unnaturalness of mechanically constructed videos, poor image quality, unnatural lip movements, and difficulties in real-time processing—the technology remains in the research and experimental stages. FIG. 1 is a block diagram illustrating the configuration of a lip-sync generating device according to an embodiment of the present invention. FIG. 2 is a diagram illustrating operations performed in a control module according to an embodiment of the present invention. FIG. 3 illustrates a lip-sync generating device according to an embodiment of the present invention. FIG. 4 is a diagram illustrating the concept of removing noise during a preprocessing process according to an embodiment of the present invention. FIG. 5 illustrates the operation of a lip-sync generating device according to an embodiment of the present invention. The foregoing and additional aspects of the present invention will become more apparent through preferred embodiments described with reference to the accompanying drawings. Hereinafter, the present invention will be described in detail so that those skilled in the art can easily understand and reproduce it through such embodiments. FIG. 1 is a block diagram illustrating the configuration of a lip-sync generating device according to an embodiment of the present invention. Hereinafter, the configuration of the lip-sync generating device according to an embodiment of the present invention will be examined in detail using FIG. 1. According to FIG. 1, the lip-sync generating device (100) includes a voice data receiving module (110), a storage module (120), a communication module (130), an output module (140), and a control module (150). Of course, other configurations in addition to the above-described configuration may be included in the lip-sync generating device (100) proposed in the present invention. The voice data receiving module (110) is composed of a microphone, etc. and receives voice data from the outside. Of course, the voice data receiving module (110) can receive voice data using a communication means other than a microphone. The storage module (120) stores a program or application necessary to operate the lip-sync generating device (100) proposed in the present invention. The storage module (120) stores lip shapes corresponding to phoneme units. In this way, the present invention stores lip shapes in syllable units. The communication module (130) communicates with an external server. The communication module (130) receives a program or algorithm necessary to operate the lip-sync generator (100) from the external server. In addition, the communication module (130) transmits the received voice to the external server and receives a lip shape corresponding to the transmitted voice. The output module (140) outputs a lip shape corresponding to the received voice. The output module (140) continuously outputs lip shapes according to the received voice data. The control module (150) controls the operation and function of each component constituting the lip-sync generation device proposed in the present invention. The control module (150) does not immediately generate a lip shape corresponding to voice data received from a microphone, but first converts the received voice data into text corresponding to the voice data, and then generates a lip s