KR-20260063927-A - METHOD FOR TRAINING MODEL FOR SPEECH TOKENIZATION AND METHOD FOR OBTAINING SPEECH TOKEN FROM SUCH TRAINED MODEL

KR20260063927AKR 20260063927 AKR20260063927 AKR 20260063927AKR-20260063927-A

Abstract

A method for training a speech tokenization model according to one embodiment, wherein the method is performed by a computer device and comprises: a step of acquiring a training speech composed of a plurality of frames; a step of classifying the plurality of frames into a same phoneme frame group in which adjacent frames begin with the same phoneme and a different phoneme frame group in which adjacent frames begin with different phonemes; a step of training a same phoneme tokenization model such that each frame belonging to the same phoneme frame group is tokenized using a base token for the same phoneme and a residual token reflecting the acoustic residual in the remainder excluding the base token; and a step of training a different phoneme tokenization model such that each frame belonging to the different phoneme frame group is tokenized by applying a predetermined weight to the phoneme difference between adjacent frames.

Inventors

정원진

Assignees

에스케이텔레콤 주식회사

Dates

Publication Date: 20260507
Application Date: 20241031

Claims (10)

As a method for training a speech tokenization model, the above method is performed by a computer device, and A step of acquiring a training voice composed of multiple frames; A step of classifying the plurality of frames into a group of identical phoneme frames in which adjacent frames begin with the same phoneme and a group of different phoneme frames in which adjacent frames begin with a different phoneme; A step of training a predetermined identical phoneme tokenization model so that each frame belonging to the identical phoneme frame group is tokenized using a base token for the identical phoneme and a residual token reflecting the acoustic residual in the remainder excluding the base token; and A step comprising training a predetermined difference phoneme tokenization model such that a predetermined weight is applied to the phoneme difference between adjacent frames so that each frame belonging to the said difference phoneme frame group is tokenized. A method for training a speech tokenization model.
In Article 1, The above base token is, The result of applying vector quantization to the first frame in time within the same phoneme frame group mentioned above. A method for training a speech tokenization model.
In Article 2, The above method is, The method further includes the step of obtaining the similarity loss that each of the remaining frames, excluding the first frame, has with respect to the first frame within the same phoneme frame group. The above vector quantization is, Applied after the aforementioned similarity loss is reflected in the aforementioned first frame A method for training a speech tokenization model.
In Article 1, The above weights are, It is determined based on the contrast loss that each frame belonging to the above-mentioned phoneme frame group has with respect to one another. A method for training a speech tokenization model.
In Article 1, In the training of the above identical phoneme tokenization model and the above different phoneme tokenization model, respectively, Contrast loss is used, which defines the interval between tokens in adjacent frames as closer when the starting phonemes of each adjacent frame are acoustically similar, and as farther when they are acoustically different. A method for training a speech tokenization model.
In Article 1, The above residual token is, The result of applying vector quantization to the above acoustic residual A method for training a speech tokenization model.
In Article 1, In a language model for understanding a given speech, one base token is provided as input instead of N tokens for N frames (where N is a natural number greater than or equal to 2) within the same phoneme frame group, and In a given generative language model, a combination of the base token and the residual token is provided as input instead of each of the N tokens. A method for training a speech tokenization model.
As a computer-readable recording medium storing a computer program, The above computer program is, Instructions for causing a processor to perform a method according to any one of claims 1 to 7 Computer-readable recording medium.
As a method for obtaining a token for voice, the method is performed by a computer device, and A step of acquiring voice composed of multiple frames; A step of classifying the plurality of frames into a group of identical phoneme frames in which adjacent frames begin with the same phoneme and a group of different phoneme frames in which adjacent frames begin with a different phoneme; For each frame belonging to the same phoneme frame group, a step of obtaining a tokenization result using a base token for the same phoneme and a residual token reflecting the acoustic residual in the remainder excluding the base token; and For each frame belonging to the above-mentioned different phoneme frame group, the step of obtaining a tokenization result in which a predetermined weight is applied to the phoneme difference between adjacent frames Method for obtaining tokens for voice.
As a computer-readable recording medium storing a computer program, The above computer program is, Instructions for causing a processor to perform the method according to claim 9 Computer-readable recording medium.

Description

Method for training a speech tokenization model and method for obtaining speech tokens from such model {METHOD FOR TRAINING MODEL FOR SPEECH TOKENIZATION AND METHOD FOR OBTAINING SPEECH TOKEN FROM SUCH TRAINED MODEL} The present invention relates to a method for training a speech tokenization model and a method for obtaining tokens for speech from the model. With the successful development of Large Language Models (LLMs), Multimodal LLMs (MLLMs) are also frequently mentioned. Since these MLLMs are 'multimodal,' they can receive and process various forms of data in addition to text, such as speech. In neural speech codecs among these MLLMs, speech tokens are obtained as a result of applying vector quantization to speech, and these speech tokens can be provided as input to the MLLM along with text tokens. Here, discussions are underway regarding methodologies for how much data to provide to MLLM and in what form. In particular, discussions are proceeding in different ways regarding understanding models for understanding or analyzing, and generative models, which have recently gained popularity. For example, comprehension models require calculations regarding the relationships or interactions between given tokens; however, as the number of tokens increases, the amount of computation required for these interactions increases exponentially. Therefore, to obtain higher-quality results from comprehension models, it is necessary to minimize the number of input tokens. In contrast, in generative models, as the number of tokens increases, the amount of reference information required for generation also increases, thereby increasing the likelihood of obtaining content that aligns with the intent. Of course, the opposite case results in a lower likelihood of obtaining the desired outcome. In particular, when utilizing the results of speech tokenization, if only a portion of the tokens is used instead of the entire set, acoustic information such as nuance or volume may not be provided and may be omitted, making it difficult to obtain the intended content. Currently, the demand for generative models is increasing explosively, rivaling that for comprehension models, and situations where comprehension and generative models are utilized in combination are occurring frequently. Consequently, there is a need for deep consideration regarding methods to effectively provide speech as input to each model, taking these models into account. FIG. 1 illustrates, in accordance with one embodiment, a concept in which, when tokens for speech are generated in a speech tokenizer, among them, an understanding token is provided to an understanding language model and a generative token is provided to a generative language model. FIG. 2 is an exemplary configuration diagram of a computer device according to one embodiment. FIG. 3 illustrates an exemplary flowchart of a method for training a speech tokenization model according to one embodiment. FIG. 4 illustrates, as an example, a concept for training an identical phoneme tokenization model according to one embodiment. FIG. 5 illustrates, exemplarily, a concept for training a phoneme tokenization model according to one embodiment. FIG. 6 illustrates an exemplary flowchart of a method for obtaining a token for voice according to one embodiment. The advantages and features of the present invention and the methods for achieving them will become clear by referring to the embodiments described below in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below but may be implemented in various different forms. These embodiments are provided merely to ensure that the disclosure of the present invention is complete and to fully inform those skilled in the art of the scope of the invention, and the present invention is defined only by the scope of the claims. The terms used in this specification will be briefly explained, and the invention will be described in detail. The terms used in this invention have been selected based on currently widely used general terms, taking into account their functions within the invention; however, these terms may vary depending on the intent of those skilled in the art, case law, the emergence of new technologies, etc. Additionally, in specific cases, terms have been arbitrarily selected by the applicant, and in such cases, their meanings will be described in detail in the relevant description of the invention. Therefore, the terms used in this invention should be defined not merely by their names, but based on their meanings and the overall content of the invention. When a part of a specification is described as 'comprising' a certain component, this means that, unless specifically stated otherwise, it does not exclude other components but may include additional components. Additionally, the term "part" as used in the specification refers to software or hardware components, such as FPGAs or ASICs, and