KR-20260064425-A - METHOD FOR KNOWLEDGE DISTILLATION-BASED LEARNING AND COMPUTING DEVICE FOR PERFORMING THE SAME
Abstract
A learning method according to one disclosed embodiment is a method performed in a computing device comprising a processor and a memory storing one or more programs executed by said processor, comprising the steps of: obtaining a refined teacher model having the same neural network structure as an original teacher model; smoothing each of said original teacher model and said refined teacher model to generate a smoothed map of said original teacher model and a smoothed map of said refined teacher model; training said refined teacher model based on the smoothed map of said original teacher model and the smoothed map of said refined teacher model; smoothing a student model including a lightweight neural network to generate a smoothed map of said student model; and training said student model based on the smoothed map of said teacher model and the smoothed map of said student model.
Inventors
- 허의남
- 엠디 임티아즈 호씬
- 배지승
Assignees
- 경희대학교 산학협력단
Dates
- Publication Date
- 20260507
- Application Date
- 20241227
- Priority Date
- 20241031
Claims (19)
- processor; and A method performed in a computing device comprising memory for storing one or more programs executed by the above-mentioned processor, wherein A step of obtaining a refined teacher model that has the same neural network structure as the original teacher model; A step of smoothing each of the original teacher model and the refined teacher model to generate a smoothed map of the original teacher model and a smoothed map of the refined teacher model; A step of training the refined teacher model based on the smoothed map of the original teacher model and the smoothed map of the refined teacher model; A step of smoothing a student model including a lightweight neural network to generate a smoothed map of the student model; and A learning method comprising the step of training the student model based on the smoothed map of the teacher model and the smoothed map of the student model.
- In claim 1, The step of generating the smoothed map of the original teacher model and the smoothed map of the refined teacher model is, A step of extracting features of the original teacher model and generating a feature map of the original teacher model; A step of extracting features of the above-mentioned refined teacher model and generating a feature map of the above-mentioned refined teacher model; A step of inputting the feature map of the original teacher model into a preset mapping function to generate a smoothed map of the original teacher model in which the first feature value is relaxed; and A learning method comprising the step of inputting a feature map of the refined teacher model into the mapping function to generate a smoothed map of the refined teacher model in which the second feature value is relaxed.
- In claim 2, The step of generating a smoothed map of the original teacher model above is, A step of dividing the feature map of the original teacher model into a plurality of first regions; A step of calculating the average of the first feature values for each of the plurality of first regions; and The method includes the step of replacing one or more first feature values included in each of the plurality of first regions with the average of each of the first regions. The step of generating a smoothed map of the above-mentioned refined teacher model is, A step of dividing the feature map of the above-mentioned refined teacher model into a plurality of second regions; A step of calculating the average of the second feature values for each of the plurality of second regions; and A learning method comprising the step of replacing one or more second feature values included in each of the plurality of second regions with the average of each of the second regions.
- In claim 1, The step of generating a smoothed map of the above student model is, A step of extracting features of the student model and generating a feature map of the student model; A learning method comprising the step of inputting a feature map of the student model into a preset mapping function to generate a smoothed map of the student model in which a third feature value is relaxed.
- In claim 4, The step of generating a smoothed map of the above student model is, A step of dividing the feature map of the above student model into a plurality of third regions; A step of calculating the average of the third feature values for each of the plurality of third regions; and A learning method comprising the step of replacing one or more of the third feature values included in each of the plurality of third regions with the average of each of the third regions.
- In claim 1, The step of training the above-mentioned refined teacher model is, A step of calculating a first difference between the smoothed map of the original teacher model and the smoothed map of the refined teacher model; A learning method comprising the step of training the refined teacher model so as to minimize the first difference.
- In claim 6, The step of training the refined teacher model so as to minimize the first difference is: The parameters of the refined teacher model are adjusted so that a first loss function including the first difference as a factor is minimized, and The above first loss function is a learning method such as (Equation 1) below. (Mathematical Formula 1) Lp: First loss function i: Layer index number Rp i : Smoothed map of the i-th layer of the refined teacher model Rt i : Smoothed map of the i-th layer of the original teacher model Lpurf i (,): Distance between two smoothed maps αi: Weights in the i-th layer that consider the distance between two smoothed maps Lcel: Entropy loss between the output value and the label value of the refinement teacher model β: Weights considering the accuracy of the refined teacher model
- In claim 1, The step of training the above student model is, A step of calculating a second difference between the smoothed map of the above-mentioned refined teacher model and the smoothed map of the above-mentioned student model; A learning method comprising the step of training the student model so as to minimize the second difference.
- In claim 8, The step of training the student model so that the above second difference is minimized is: The parameters of the student model are adjusted so that a second loss function including the second difference as a factor is minimized, and The above second loss function is a learning method such as (Equation 3) below. (Mathematical Formula 3) Ls: Second loss function i: Layer index number Rs i : Smoothed map of the i-th layer of the student model Rp i : Smoothed map of the i-th layer of the refined teacher model Lsmooth(,): Distance between two smoothed maps α: Weights considering the distance between smoothed maps Lce: Entropy loss between the student model's output value and label value
- processor; and A computing device comprising memory for storing one or more programs executed by the above-mentioned processor, One or more of the above programs are, Instructions to obtain a refined teacher model with the same neural network structure as the original teacher model; Instructions for smoothing each of the original teacher model and the refined teacher model to generate a smoothed map of the original teacher model and a smoothed map of the refined teacher model; Instructions for training the refined teacher model based on the smoothed map of the original teacher model and the smoothed map of the refined teacher model; Instructions for smoothing a student model including a lightweight neural network to generate a smoothed map of the student model; and A computing device comprising instructions for training the student model based on the smoothed map of the teacher model and the smoothed map of the student model.
- In claim 10, The command for generating the smoothed map of the original teacher model and the smoothed map of the refined teacher model is, A command to extract features of the original teacher model and generate a feature map of the original teacher model; A command to extract features of the above-mentioned refined teacher model and generate a feature map of the above-mentioned refined teacher model; A command for generating a smoothed map of the original teacher model with a first feature value relaxed by inputting the feature map of the original teacher model into a preset mapping function; and A computing device comprising a command for inputting a feature map of the refined teacher model into the mapping function to generate a smoothed map of the refined teacher model in which a second feature value is relaxed.
- In claim 11, The command to generate the smoothed map of the above original teacher model is, A command to divide the feature map of the above original teacher model into a plurality of first regions; A command for calculating the average of the first feature values for each of the plurality of first regions; and Includes a command for replacing one or more first feature values included in each of the plurality of first regions with the average of each of the first regions, and The command for generating a smoothed map of the above refined teacher model is, A command to divide the feature map of the above-mentioned refined teacher model into a plurality of second regions; A command for calculating the average of the second feature values for each of the plurality of second regions; and A computing device comprising a command for replacing one or more of the second feature values included in each of the plurality of second regions with the average of each of the second regions.
- In claim 10, The command for generating a smoothed map of the above student model is, A command to extract features of the above student model and generate a feature map of the above student model; A computing device comprising a command for generating a smoothed map of the student model with a third feature value relaxed by inputting the feature map of the student model into a preset mapping function.
- In claim 13, The command for generating a smoothed map of the above student model is, A command to divide the feature map of the above student model into a plurality of third regions; A command for calculating the average of the third feature values for each of the plurality of third regions; and A computing device comprising a command for replacing one or more of the third feature values included in each of the plurality of third regions with the average of each of the third regions.
- In claim 10, The command for training the above refined teacher model is, Instruction for calculating the first difference between the smoothed map of the original teacher model and the smoothed map of the refined teacher model; A computing device comprising instructions for training the refined teacher model so as to minimize the first difference.
- In claim 15, The instruction for training the refined teacher model so as to minimize the first difference is, The parameters of the refined teacher model are adjusted so that a first loss function including the first difference as a factor is minimized, and The above first loss function is a computing device as shown in (Equation 1) below. (Mathematical Formula 1) Lp: First loss function i: Layer index number Rp i : Smoothed map of the i-th layer of the refined teacher model Rt i : Smoothed map of the i-th layer of the original teacher model Lpurf i (,): Distance between two smoothed maps αi: Weights in the i-th layer that consider the distance between two smoothed maps Lcel: Entropy loss between the output value and the label value of the refinement teacher model β: Weights considering the accuracy of the refined teacher model
- In claim 10, The command to train the above student model is, Instruction for calculating a second difference between the smoothed map of the above-mentioned refined teacher model and the smoothed map of the above-mentioned student model; A computing device comprising instructions for training the student model so as to minimize the second difference.
- In claim 17, The instruction to train the student model so that the above second difference is minimized is, The parameters of the student model are adjusted so that a second loss function including the second difference as a factor is minimized, and The above second loss function is a computing device, such as (Equation 3) below. (Mathematical Formula 3) Ls: Second loss function i: Layer index number Rs i : Smoothed map of the i-th layer of the student model Rp i : Smoothed map of the i-th layer of the refined teacher model Lsmooth(,): Distance between two smoothed maps α: Weights considering the distance between smoothed maps Lce: Entropy loss between the student model's output value and label value
- As a computer program stored on a non-transitory computer-readable storage medium, The above computer program includes one or more instructions, and when the instructions are executed by a computing device having one or more processors, the computing device, A step of obtaining a refined teacher model that has the same neural network structure as the original teacher model; A step of smoothing each of the original teacher model and the refined teacher model to generate a smoothed map of the original teacher model and a smoothed map of the refined teacher model; A step of training the refined teacher model based on the smoothed map of the original teacher model and the smoothed map of the refined teacher model; A step of smoothing a student model including a lightweight neural network to generate a smoothed map of the student model; and A computer program that performs the step of training the student model based on the smoothed map of the teacher model and the smoothed map of the student model.
Description
Method for Knowledge Distillation-Based Learning and Computing Device for Performing the Same An embodiment of the present invention relates to the technology of a knowledge distillation-based learning method. In the field of deep learning, the importance of teacher-student model-based learning optimization techniques is becoming increasingly prominent, and consequently, the efficient training of student models is emerging as an essential task. A teacher-student model is generally a methodology in which knowledge learned by a high-performance teacher model is transferred to a small-scale student model, enabling the student model to possess both efficiency and performance. The need for such technology is growing, particularly in environments with limited computational resources or in various application fields such as mobile, IoT (Internet of Things), and edge computing, which require lightweight models. However, if the representation of the teacher model is excessively complex or incomplete, the student model fails to accurately learn the teacher's representation or properly utilize the received information. This leads to problems such as limited performance or the consumption of excessive computational resources during the training process. In particular, if the feature map generated by the teacher model is noisy or if the student model fails to accept it and convert it into a generalized representation, it can lead to a degradation in the student model's performance. Therefore, technology is required to efficiently transfer the refined representation of the teacher model to the student model and enable the student model to learn from it in an optimized manner. The present disclosure can be easily understood from the combination of the following detailed description and the accompanying drawings, where reference numerals denote structural elements. FIG. 1 is a block diagram showing the configuration of a computing device that executes a knowledge distillation-based method according to one embodiment. FIG. 2 is a schematic diagram illustrating a knowledge distillation-based learning method according to one embodiment. FIG. 3 is a flowchart illustrating the steps of a knowledge distillation-based learning method according to one embodiment. FIG. 4 is a schematic diagram illustrating a smoothed map from a feature map according to one embodiment. FIG. 5 is a schematic diagram illustrating training a student model based on a refined teacher model according to one embodiment. Hereinafter, specific embodiments of the present invention will be described with reference to the drawings. The following detailed description is provided to facilitate a comprehensive understanding of the methods, apparatuses, and/or systems described herein. However, this is merely illustrative and the present invention is not limited thereto. In describing the embodiments of the present invention, detailed descriptions of known technologies related to the present invention are omitted if it is determined that such detailed descriptions may unnecessarily obscure the essence of the present invention. Furthermore, the terms described below are defined in consideration of their functions within the present invention, and these may vary depending on the intentions or practices of the user or operator. Therefore, such definitions should be based on the content throughout this specification. Terms used in the detailed description are intended merely to describe the embodiments of the present invention and should not be limiting in any way. Unless explicitly stated otherwise, expressions in the singular form include the meaning of the plural form. In this description, expressions such as "include" or "comprise" are intended to refer to certain characteristics, numbers, steps, actions, elements, parts thereof, or combinations thereof, and should not be interpreted to exclude the existence or possibility of one or more other characteristics, numbers, steps, actions, elements, parts thereof, or combinations thereof other than those described. Additionally, terms including ordinal numbers, such as 'first' or 'second', may be used to describe various components, but said components should not be limited by said terms. Such terms may be used for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be named the second component, and similarly, the second component may be named the first component. Furthermore, when it is said that one configuration is "connected" to another configuration, this includes not only cases where they are "directly connected," but also cases where they are "connected with another configuration in between." In this specification, "communication network (50)" may include the Internet, one or more local area networks, wide area networks, cellular networks, mobile networks, other types of networks, or a combination of such networ