KR-20260067213-A - Method for fine-tuning an artificial intelligence model for multilingual speech recognition and a system for executing the same

KR20260067213AKR 20260067213 AKR20260067213 AKR 20260067213AKR-20260067213-A

Abstract

The present invention discloses a method for fine-tuning an artificial intelligence model for multilingual speech recognition and a system for executing the same. A method for fine-tuning an artificial intelligence model for multilingual speech recognition according to one disclosed embodiment is a method for fine-tuning an artificial intelligence model for multilingual speech recognition performed by a system comprising at least one processor, comprising: a model acquisition process of acquiring a first weight of a pre-trained artificial intelligence model and a second weight of an artificial intelligence model fine-tuned for a target language; a first vector calculation process of generating at least one task vector by capturing the element-wise difference between the first weight and the second weight as a vector; a second vector calculation process of generating a target task vector through arithmetic operations between at least one task vector; and a fine-tuning process of acquiring an artificial intelligence model fine-tuned for a target language by summing the value obtained by multiplying the target task vector by a scaling factor to the pre-trained artificial intelligence model.

Inventors

장준혁
이재홍
이문학
강지훈

Assignees

한양대학교 산학협력단

Dates

Publication Date: 20260512
Application Date: 20241105

Claims (11)

A method for fine-tuning an artificial intelligence model for multilingual speech recognition, performed by a system comprising at least one processor, A model acquisition process for acquiring the first weights of a pre-trained artificial intelligence model and the second weights of an artificial intelligence model fine-tuned for the target language; A first vector calculation process that calculates at least one task vector by capturing the element-wise difference between the first weight and the second weight as a vector; A second vector calculation process for calculating a target task vector through arithmetic operations between at least one task vector; and A fine-tuning process comprising: a value obtained by multiplying the above target task vector by a scaling factor and summing it to a pre-trained artificial intelligence model to obtain an artificial intelligence model fine-tuned for the target language; method.
In paragraph 1, The above model acquisition process is, It further includes an optional learning step of freezing the weights of a pre-trained AI model and obtaining second weights of a fine-tuned AI model by inserting a rank decomposition matrix into each layer of the Transformer. method.
In paragraph 2, The above optional learning step is, A Low-Rank Adaptation (LoRA) adapter is formed by fine-tuning a first low-rank matrix and a second low-rank matrix that are close to the weight matrix of a pre-trained artificial intelligence model, and the weights of the low-rank adaptation adapter are provided as the second weights. method.
In paragraph 1, The above fine-tuning process is, A task vector for the above target language is set as a negative task vector (-τ), and the negative task vector is added to a pre-trained artificial intelligence model to degrade the recognition performance for the above target language. method.
In paragraph 1, The above second vector calculation process is, Using the similarity relationships between multiple languages (A, B, C, D) regarding the similarity between A and B and the similarity between C and D, task vectors for the remaining languages excluding the target language expressed by the following mathematical formula ( , , Task vectors for the target language through arithmetic operations between ) Including an additional task inference step that produces ), [Mathematical Formula] method.
In paragraph 1, The above second vector calculation process is, Target task vector as the vector sum of one or more task vectors ( Acquiring ) method.
As a system for fine-tuning artificial intelligence models for multilingual speech recognition, Memory for storing multiple artificial intelligence models, including a multilingual speech recognition model; A processor that trains the above artificial intelligence model and performs a speech recognition function for input data input through the trained artificial intelligence model; The above processor is, Obtain the first weight of a pre-trained AI model and the second weight of an AI model fine-tuned for the target language, and The element-wise difference between the first weight and the second weight is captured as a vector to produce at least one task vector, and A target task vector is calculated through arithmetic operations between at least one task vector, and The method of obtaining an AI model finely tuned for the target language by multiplying the above target task vector by a scaling factor and summing the result to a pre-trained AI model. System.
In Paragraph 7, The above processor is, A Low-Rank Adaptation (LoRA) adapter is formed by fine-tuning a first low-rank matrix and a second low-rank matrix that are close to the weight matrix of a pre-trained artificial intelligence model, and the weights of the low-rank adaptation adapter are provided as the second weights. System.
In Paragraph 7, The above processor is, Setting the task vector for the above target language as a negative task vector (-τ) and summing the negative task vector to a pre-trained artificial intelligence model to degrade the recognition performance for the above target language, System.
In Paragraph 7, Using the similarity relationships between multiple languages (A, B, C, D) regarding the similarity between A and B and the similarity between C and D, task vectors for the remaining languages excluding the target language expressed by the following mathematical formula ( , , Task vectors for the target language through arithmetic operations between ) Including an additional task inference step that produces ), [Mathematical Formula] System.
A computer program stored on a computer-readable storage medium, wherein the computer program, when executed on one or more processors, performs operations to fine-tune an artificial intelligence model for multilingual speech recognition, and The above operations are, An operation to obtain the first weight of a pre-trained artificial intelligence model and the second weight of an artificial intelligence model fine-tuned for the target language; An operation to calculate at least one task vector by capturing the element-wise difference between the first weight and the second weight as a vector; An operation to calculate a target task vector through arithmetic operations between at least one task vector; and The operation of acquiring an AI model finely tuned for the target language by multiplying the above target task vector by a scaling factor and summing the result to a pre-trained AI model; Computer program.

Description

Method for fine-tuning an artificial intelligence model for multilingual speech recognition and a system for executing the same The disclosed invention relates to a method for fine-tuning an artificial intelligence model for multilingual speech recognition that can efficiently fine-tune a multilingual speech recognition model using a LoRA adapter and language-specific task vectors, and a system for executing the same. In recent natural language processing research, the use of Large Language Models (LLMs) is garnering significant attention for demonstrating overwhelmingly superior performance in various fields, including translation and summarization, compared to existing small-scale models. These large-scale models can be applied to diverse areas such as information retrieval, question answering, automatic document classification, newspaper article clustering, and AI speakers. To maximize the performance of large-scale language models, they can be pre-trained using a vast amount of pre-collected training data and subsequently transferred to the field of application. For example, a language model can be fine-tuned using training data collected locally within that specific field. However, there is a problem where the performance of fine-tuning deteriorates when there is a significant difference between the training data in the relevant field and the training data used for pre-training. Artificial intelligence models based on pre-trained language models are leveraging massive datasets to achieve groundbreaking performance in various downstream tasks, such as ChatGPT and Whisper. These artificial intelligence models achieve remarkable performance in various application fields, ranging from natural language processing to speech recognition, by being fine-tuned to specific tasks. However, existing methods for fine-tuning artificial intelligence models require retraining the entire model for each new task or training the model with additional task-specific data, which is inefficient as the computational cost for tuning increases as the model size increases. Furthermore, existing methods for fine-tuning artificial intelligence models have a problem in that when learning or fine-tuning for multi-tasks is performed sequentially, a phenomenon called catastrophic forgetting may occur, in which previous learning information is rapidly forgotten. FIG. 1 is a control block diagram of a system executing a method for fine-tuning an artificial intelligence model for disclosed multilingual speech recognition. Figure 2 is a flowchart illustrating a method for fine-tuning an artificial intelligence model for disclosed multilingual speech recognition. Figure 3 is a diagram illustrating a LoRA adapter of an artificial intelligence model for disclosed multilingual speech recognition. FIG. 4 is a flowchart illustrating in detail the second vector calculation process according to the disclosed embodiment. FIG. 5 is a flowchart illustrating in detail the fine-tuning process according to the disclosed embodiment. FIG. 6 is a diagram illustrating the results of a performance comparison of a multilingual speech recognition model finely tuned using a task vector according to a disclosed embodiment. FIG. 7 is a diagram illustrating the performance comparison results of a multilingual speech recognition model generated by adding a plurality of task vectors according to a disclosed embodiment. FIG. 8 is a diagram illustrating the performance comparison results of a multilingual speech recognition model generated by adding a negative task vector according to a disclosed embodiment. FIG. 9 is a diagram illustrating the performance comparison results of a multilingual speech recognition model generated by adding a task vector for a specific language through an analogy relationship using arithmetic operations according to a disclosed embodiment. Throughout the specification, the same reference numerals refer to the same components. This specification does not describe all elements of the embodiments, and general content in the art to which the invention pertains or content that overlaps between embodiments is omitted. Throughout the specification, when a part is described as being "connected" to another part, this includes not only cases where they are directly connected but also cases where they are indirectly connected, and indirect connections include connections made via a wireless communication network. Furthermore, when it is stated that a part "includes" a certain component, this means that, unless specifically stated otherwise, it does not exclude other components but may include additional components. Singular expressions include plural expressions unless there is an obvious exception in the context. In addition, terms such as "~part," "~unit," "~block," "~part," and "~module" may refer to a unit that processes at least one function or operation. For example, the above terms may refer to at least one piece of hardware such as an FPGA (