KR-20260065580-A - AI-BASED VOICE GENERATION METHOD, NEURAL NETWORK TRANING METHOD, AND ELECTRONIC DEVICE FOR PERFORMING THE SAME

KR20260065580AKR 20260065580 AKR20260065580 AKR 20260065580AKR-20260065580-A

Abstract

An AI-based voice generation method performed by an electronic device is disclosed. The AI-based voice generation method comprises the steps of receiving artist information corresponding to user input, receiving an alarm time and alarm text for outputting an alarm, generating voice data by converting the alarm text into an artist's voice based on the artist information using an artificial neural network module, and outputting an alarm based on the voice data when the alarm time arrives. The artificial neural network module is a module trained to output voice data converted into the artist's voice corresponding to the user input based on voice feature data and speech pattern data for the artist.

Inventors

김미희
김정우

Assignees

주식회사 빅크

Dates

Publication Date: 20260508
Application Date: 20251031
Priority Date: 20241101

Claims (13)

As an AI-based voice generation method performed by an electronic device, A step of receiving artist information corresponding to user input; A step of receiving an alarm time and alarm text to output an alarm; A step of generating voice data by converting the alarm text into an artist voice based on the artist information using an artificial neural network module; and The method includes the step of outputting the alarm based on the voice data when the above alarm time arrives. The above artificial neural network module is, A voice generation method, comprising a module trained to output voice data that converts the alarm text into the voice of the artist corresponding to the user input, based on voice feature data and speech pattern data for the artist.
In paragraph 1, A step of obtaining a learning voice of an artist corresponding to the user input from the artist information; and The method further includes the step of obtaining the voice feature data and the speech pattern data from the learned voice of the artist. The step of acquiring the above-mentioned voice feature data is, A step of obtaining timbre information, frequency information, and voice amplitude information from the learning voice of the above-mentioned artist; and The method includes the step of obtaining voice feature data by combining feature vectors corresponding to each of the voice information, the frequency information, and the voice amplitude information. The step of acquiring the above-mentioned utterance pattern data is, A step of obtaining speech intonation information, speech rate information, and speech pause information from the training voice of the above-mentioned artist; and A speech generation method comprising the step of obtaining speech pattern data by combining feature vectors corresponding to each of the speech intonation information, speech rate information, and speech pause information.
In paragraph 2, The above artificial neural network module is, The above alarm text is divided into phoneme units to generate an embedding vector corresponding to the above alarm text, and Each of the above voice feature data and the above speech pattern data is converted into an embedding space, and Speech feature data and speech pattern data converted into the above embedding space are merged to generate an acoustic feature vector for the artist, and A speech generation method, which is a module trained to generate speech data by decoding the above acoustic feature vector.
In paragraph 3, The above artificial neural network module is, In the first layer, an operation is performed between the embedding vector and the first mapping data corresponding to the speech feature data, and In the second layer, an operation is performed between the output of the first layer and the second mapping data corresponding to the utterance pattern data, and A speech generation method, which is a module trained to generate the acoustic feature vector based on the output of the second layer.
In paragraph 4, The step of generating the above voice data is, A step of identifying an emotion type corresponding to the above alarm text; A step of determining a first weight corresponding to the identified emotion type based on an emotion type-specific weight table including weight information corresponding to the emotion type; and The method includes the step of generating the acoustic feature vector by applying the first weight to the second mapping data using the artificial neural network module; The above emotion type is, A voice generation method including a joy emotion type, a sadness emotion type, and an encouragement emotion type.
In paragraph 5, The step of generating the above voice data is, A step of identifying the alarm text type by analyzing the ending form of the above alarm text; A step of determining a second weight corresponding to the above alarm text type; and The method includes the step of generating the acoustic feature vector by using the artificial neural network module to additionally apply the second weight to the second mapping data to which the first weight has been applied. The above alarm text type is, A speech generation method that is of either a sentence-ending type or an interrogative type.
In paragraph 6, The above first layer is, It is a layer set by fixed parameters based on the unique vocal characteristics of the aforementioned artist, and The above second layer is, A voice generation method, wherein the layer is set by variable parameters based on at least one of the above-mentioned emotion type and the above-mentioned alarm text type.
In paragraph 5, A step of acquiring multiple sound sources for the artist based on the artist information; A step of determining a sound source corresponding to the emotion type among the plurality of sound sources; and A voice generation method further comprising the step of simultaneously outputting the above alarm and the above determined sound source.
In paragraph 8, The step of determining a sound source corresponding to the emotion type among the plurality of sound sources is: A step of obtaining emotion tag information for each of the above plurality of sound sources; A step of calculating the degree of agreement between the above emotion tag information and the above emotion type; A voice generation method comprising the step of determining the sound source with the greatest degree of agreement as the sound source corresponding to the emotion type.
In paragraph 1, The step of outputting the above alarm is, A step of receiving an acoustic setting value for the above alarm; and The method includes the step of correcting the generated voice data based on the above sound setting value; The above sound setting value is, A voice generation method including the tone, volume size, and output speed of the above alarm.
As an artificial neural network learning method performed by an electronic device, A step of obtaining input data and correct answer data corresponding to the input data; A step of obtaining output data from the input data using the artificial neural network module; A step of calculating loss based on the above output data and the above correct answer data; and The method includes the step of training the artificial neural network module based on the above loss; The above input data is, Voice feature data of an artist corresponding to user input, speech pattern data of the artist, and alarm text for outputting an alarm, and The above correct answer data is, This is data in which the artist's learned voice is applied to the above alarm text, and The above output data is, An artificial neural network learning method, which is voice data converted from the above alarm text into the voice of the above artist.
In Paragraph 11, The step of calculating the loss based on the above output data and the above correct answer data is, A step of generating an acoustic feature vector for an artist by mapping the voice feature data and the speech pattern data to an embedding vector corresponding to the alarm text; A step of obtaining the output data by decoding the acoustic feature vector; A step of calculating a first loss between voice feature data corresponding to the output data and voice feature data corresponding to the correct answer data; A step of calculating a second loss between speech pattern data corresponding to the output data and speech pattern data corresponding to the correct answer data; and An artificial neural network learning method comprising the step of calculating the loss by applying different weights to the first loss and the second loss.
It includes a processor and memory connected to the processor, The above memory is configured to store a program, and The above processor is configured to execute the above program, and An electronic device in which any one of claims 1 to 12 is implemented when the above program is executed.

Description

AI-based voice generation method, artificial neural network training method, and electronic device for performing the same The present disclosure relates to an artificial intelligence (AI)-based speech generation method, an artificial neural network learning method, and an electronic device for performing the same. Recently, speech synthesis technologies capable of outputting input text like real human speech are being developed. Conventionally, speech data was output through rule-based speech synthesis; however, because this relied on linguistic rules and pronunciation dictionaries, it was difficult to achieve natural intonation or emotional expressions like real human speech, and the output had a mechanical tone. Since then, with the development of deep learning-based neural network speech synthesis technology, it has become possible to implement more natural pronunciation and intonation for input text. However, there are still limitations in generating personalized voices that reflect the unique timbre and speech style of specific artists or celebrities preferred by the user. In particular, in environments that influence a user's emotional state or the start of the day, such as alarm functions, emotionally charged voices can provide a more effective user experience than mechanically synthesized voices. FIG. 1 is a schematic block diagram of a computing system according to one or more embodiments. FIG. 2 is a flowchart illustrating an AI-based speech generation method according to one or more embodiments. FIG. 3 is a drawing for explaining an artificial neural network module according to one or more embodiments. FIG. 4 is a flowchart illustrating the process of acquiring voice feature data and speech pattern data according to one or more embodiments. FIG. 5 is a flowchart illustrating the process of generating acoustic feature vectors according to one or more embodiments. FIG. 6 is a drawing for explaining an artificial neural network module including a plurality of layers according to one or more embodiments. FIG. 7 is a flowchart illustrating the process of generating acoustic feature vectors based on emotion types according to one or more embodiments. FIG. 8 is a flowchart illustrating the process of generating acoustic feature vectors based on alarm text types according to one or more embodiments. FIG. 9 is a flowchart illustrating the process of simultaneously outputting an alarm and an artist sound source according to one or more embodiments. FIG. 10 is a flowchart illustrating an artificial neural network module learning process according to one or more embodiments. FIG. 11 is a flowchart illustrating the loss calculation process of an artificial neural network module according to one or more embodiments. FIG. 12 is a block diagram illustrating the configuration of a playback device according to one or more embodiments. The various embodiments described in this specification are illustrative for the purpose of clearly explaining the technical concept of this disclosure and are not intended to limit it to specific embodiments. The technical concept of this disclosure includes various modifications, equivalents, alternatives, and embodiments optionally combined from all or part of each embodiment described in this specification. Furthermore, the scope of the technical concept of this disclosure is not limited to the various embodiments presented below or the specific descriptions thereof. Terms used in this specification, including technical or scientific terms, may have the meaning generally understood by those skilled in the art to which this disclosure pertains, unless otherwise defined. Expressions used herein such as “comprising,” “may compose,” “possessing,” “possessing,” “having,” and “possessing” imply the existence of the subject feature (e.g., function, operation, or component, etc.) and do not exclude the existence of other additional features. That is, such expressions should be understood as open-ended terms implying the possibility of including a second embodiment. In this specification, singular expressions include plural expressions unless the context clearly specifies them as singular. Additionally, plural expressions include singular expressions unless the context clearly specifies them as plural. Throughout the specification, when a part is described as including a certain component, this means that, unless specifically stated otherwise, it does not exclude other components but may include additional components. Additionally, the terms 'module' or 'part' as used in the specification refer to software or hardware components, and the 'module' or 'part' performs certain roles. However, the meaning of 'module' or 'part' is not limited to software or hardware. The 'module' or 'part' may be configured to reside in an addressable storage medium or configured to run on one or more processors. Thus, as an example, the 'module' or 'part' may include components such as software components, ob