Search

KR-20260063658-A - DEVICE AND METHOD FOR STANDARD LANGUAGE CONVERSION OF KOREAN DIALECTS USING A SUPER LARGE LANGUAGE MODEL

KR20260063658AKR 20260063658 AKR20260063658 AKR 20260063658AKR-20260063658-A

Abstract

The present invention relates to a device for converting Korean dialects into standard language using a super-large language model, comprising: a data preprocessing unit that collects dialect speech data, converts it into text consisting of at least one sentence, and identifies dialect words; a sentence difficulty classification unit that classifies sentence difficulty based on the number of dialect words in each of at least one sentence and constructs a dialect dataset by sentence difficulty; an LLM (Large Language Model) curriculum learning unit that constructs a dialect dataset by sentence difficulty and constructs a QLoRA-based dialect learning model that performs dialect learning through the corresponding dialect dataset while gradually increasing the sentence difficulty; and a language translation unit that converts a given sentence into standard language through the dialect learning model.

Inventors

  • 한요섭
  • 임수한
  • 한중혁

Assignees

  • 연세대학교 산학협력단

Dates

Publication Date
20260507
Application Date
20241030

Claims (9)

  1. A data preprocessing unit that collects dialect speech data, converts it into text consisting of at least one sentence, and identifies dialect words; A sentence difficulty classification unit that classifies sentence difficulty based on the number of dialect words in each of the above at least one sentence and constructs a dialect dataset by sentence difficulty; An LLM (Large Language Model) curriculum learning unit that constructs a dialect dataset according to the above sentence difficulty and builds a QLoRA-based dialect learning model that performs dialect learning through the corresponding dialect dataset while gradually increasing the above sentence difficulty; and A device for converting Korean dialects into standard language using a super-large language model that includes a language translation unit that converts a given sentence into standard language through the above-mentioned dialect learning model.
  2. In paragraph 1, the data preprocessing unit A device for converting Korean dialects into standard language using a super-large language model, characterized by classifying the dialect words by performing non-dialect filtering on the text above.
  3. In paragraph 1, the sentence difficulty classification unit A device for converting Korean dialects into standard language using a super-large language model, characterized by determining the sentence complexity for each of the at least one sentence to classify the sentence difficulty first, and then classifying the sentence difficulty secondarily based on the number of dialect words included in each of the at least one sentence to finally construct a dialect dataset according to the sentence difficulty.
  4. In paragraph 1, the above LLM curriculum learning department A device for converting Korean dialects into standard language using a super-large language model, characterized by adding speaker attributes consisting of gender and age group to the dialect dataset by sentence difficulty level to enable learning differences in dialect usage among various speakers prior to the application of the above-mentioned QLoRA.
  5. In paragraph 4, the above LLM curriculum learning department A device for converting Korean dialects into standard language using a super-large language model, characterized by inputting the dialect dataset according to sentence difficulty levels into the dialect learning model in sequential steps to train the dialect learning model.
  6. In paragraph 5, the above LLM curriculum learning department A device for converting a Korean dialect into a standard language using a super-large language model, characterized by dynamically applying quantization to the parameters of the dialect learning model according to the difficulty of the sentence above to convert the parameters to low precision.
  7. In Clause 6, the above LLM curriculum learning department A device for converting Korean dialects into standard language using a super-large language model, characterized by selectively fine-tuning parameters that play an important role in the dialect characteristics of the dialect learning model through low-rank adaptation.
  8. In paragraph 1, the language translation unit A device for converting a Korean dialect into a standard language using a super-large language model, characterized by converting the given sentence into the standard language through lexical conversion, sentence structure conversion, and reflection of speaker attributes.
  9. In a method for converting a Korean dialect into a standard language using a super-large language model, which is performed in a device for converting a Korean dialect into a standard language using a super-large language model, A data preprocessing step for collecting dialect speech data, converting it into text consisting of at least one sentence, and identifying dialect words; A sentence difficulty classification step that classifies sentence difficulty based on the number of dialect words in each of the above at least one sentence and constructs a dialect dataset for each sentence difficulty; A learning stage of the LLM (Large Language Model) curriculum for constructing a dialect dataset according to the above sentence difficulty and building a QLoRA-based dialect learning model that performs dialect learning through the corresponding dialect dataset while gradually increasing the above sentence difficulty; and A method for converting a Korean dialect into a standard language using a super-large language model that includes a language translation step for converting a given sentence into a standard language through the above-mentioned dialect learning model.

Description

Device and Method for Standard Language Conversion of Korean Dialects Using a Super Large Language Model The present invention relates to a technology for converting Korean dialects into standard language using a super-large language model, and more specifically, to an apparatus and method for converting Korean dialects into standard language using a super-large language model, which can convert a given sentence into standard language through a QLoRA-based dialect learning model that performs dialect learning using a corresponding dialect dataset while gradually increasing the sentence difficulty. Technology for converting dialects into standard language utilizes language models to automatically convert various dialects, including regional and cultural characteristics, into standard language, and primarily employs the following methods. By collecting parallel data between dialects and standard language through data collection and preprocessing, the model can learn various dialects and their corresponding standard language forms, thereby increasing the accuracy of the conversion through a rich dataset. Language models trained on dialect-standard parallel data can understand the context between sentences and perform transformations. Recently, transformer-based models, such as BERT and GPT, can support this. When real-time conversion is required, dialects can be rapidly converted even on smartphones and IoT devices by applying lightweight techniques such as quantization or knowledge distillation to reduce model speed and memory usage. By processing lexical and grammatical differences of dialects as separate modules to enhance the accuracy and flexibility of the conversion, it is possible to perform the conversion while maintaining consistency with the standard language after learning the characteristics of a specific dialect region. By applying the function of converting dialects into standard language in real-time interpretation, educational platforms, and chatbots, it can contribute to improving dialect understanding and facilitating regional linguistic communication. Furthermore, it is attracting attention as a useful tool for precisely recognizing various linguistic differences and converting multiple dialects into standard language. Korean Published Patent No. 10-2021-0162529 (November 23, 2021) includes, in one embodiment, a dialect language model that, when a text sequence containing a dialect is input, selects a dialect language model to convert the input dialect text sequence into a standard language text sequence, and a summary language model that summarizes the standard language text sequence converted through the dialect language model to generate a summary text sequence. FIG. 1 is a diagram illustrating a device for converting Korean dialects into standard language using a super-large language model according to one embodiment of the present invention. Figure 2 is a diagram illustrating the functional configuration of a standard language conversion device for Korean dialects using the super-large language model of Figure 1. Figure 3 is a diagram illustrating the system configuration of a standard language conversion device for Korean dialects using the super-large language model of Figure 1. Figure 4 is a flowchart illustrating a method for converting a Korean dialect into a standard language using a super-large language model according to the present invention. The description of the present invention is merely an example for structural or functional explanation, and therefore the scope of the present invention should not be interpreted as being limited by the examples described in the text. That is, since the examples are subject to various modifications and may take various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical concept. Furthermore, the objectives or effects presented in the present invention do not imply that a specific example must include all of them or only such effects; therefore, the scope of the present invention should not be understood as being limited by them. Meanwhile, the meaning of the terms described in this application should be understood as follows. Terms such as "first," "second," etc., are intended to distinguish one component from another, and the scope of rights shall not be limited by these terms. For example, the first component may be named the second component, and similarly, the second component may be named the first component. When it is stated that one component is "connected" to another component, it should be understood that it may be directly connected to that other component, or that there may be other components in between. Conversely, when it is stated that one component is "directly connected" to another component, it should be understood that there are no other components in between. Meanwhile, other expressions describing the relationships between components, such as "betwee