JP-2026074750-A - Method for training a custom model based on a pre-trained base model and a learning device using the same.

JP2026074750AJP 2026074750 AJP2026074750 AJP 2026074750AJP-2026074750-A

Abstract

[Problem] To provide a method for training a custom model based on a pre-trained base model. [Solution] The method includes the steps of: a learning device inputting training data into a custom model that performs a specific task to generate first to n intermediate features, and generating first to k transformation features by transforming the source domain corresponding to the base model in the first to k specific intermediate features into a target domain corresponding to a specific task; a learning device generating a task output by performing a learning operation on the kth fusion feature output from the kth residual unit through the task head block; and a learning device generating a task loss by referring to the task output and the original correct answer corresponding to the training data, and backpropagating the task loss to train at least a portion of the task head block, the first to k residual units, and the first to k adaptation blocks. [Selection Diagram] Figure 5

Inventors

金桂賢

Assignees

スパーブエーアイカンパニーリミテッド

Dates

Publication Date: 20260507
Application Date: 20241021

Claims (20)

In a method for training a custom model based on a pre-trained base model, (a) A learning device inputs learning data into a custom model that performs a specific task, comprising a base model including a pre-trained first feature extraction block to the nth feature extraction block (where n is an integer of 2 or more), a first adaptation block to the kth adaptation block (where k is an integer of 2 or more and less than or equal to n), a first residual unit to the kth residual unit, and a task head block, to generate first intermediate features to the nth intermediate features through each of the first feature extraction blocks to the nth feature extraction blocks of the base model, and generates first transformation features to the kth transformation features, each of which converts the source domain corresponding to the base model in each of the first specific intermediate features to the kth specific intermediate features selected from the first intermediate features to the nth intermediate features, into a target domain corresponding to the specific task; (b) After the learning device generates a first fusion feature by fusing the first fusion feature extracted for the specific task with the first transformation feature through the first fusion unit, and then generates an i-th provisional feature by fusing the (i-1) fusion feature with the i-th transformation feature (where i is an integer greater than or equal to 2 and less than or equal to k), it generates an i-th fusion feature by fusing the i-th fusion feature extracted for the specific task with the i-th provisional feature through the i-th fusion unit, and generates a task output by performing a learning operation on the k-th fusion feature output from the k-th fusion unit through the task head block; and (c) The learning device generates a task loss by referring to the task output and the original correct answer corresponding to the learning data, and backpropagates the task loss to train at least a portion of the task head block, the first fusion unit to the k-th fusion unit, and the first adaptation block to the k-th adaptation block; A method that includes this.
In step (a) above, The method according to claim 1, characterized in that the learning device selects the nth intermediate feature or the (n-1)th intermediate feature from the first intermediate feature to the nth intermediate feature through any one of the first to the kth adaptation blocks, and generates one transformation feature by converting the source domain to the target domain; and selects an intermediate feature which is one of the second intermediate feature to the (n-2)th intermediate feature through another adaptation block from the first to the kth adaptation blocks, and generates another transformation feature by converting the source domain to the target domain.
In step (a) above, The method according to claim 1, characterized in that each of the first to k-residual units includes a plurality of convolution layers, and a first filter included in at least some of the plurality of convolution layers is decomposed into a plurality of second filters having a lower rank than the first filter.
In step (a) above, Each of the first to n feature extraction blocks has its parameters fixed through freezing. In step (c) above, The method according to claim 1, characterized in that when the learning device backpropagates the task loss, it does not update the parameters of each of the first feature extraction block to the n feature extraction blocks.
In step (b) above, The method according to claim 1, characterized in that the learning device generates the first fusion feature by performing an add operation on the first resistive feature and the first transformation feature through the first element-wise add layer and fusing them; generates the i provisional feature by performing an add operation on the i transformation feature and the (i-1) fusion feature through the (i-1) element-wise add layer and fusing them; and generates the i fusion feature by performing an add operation on the i resistive feature and the i provisional feature through the i element-wise add layer and fusing them.
In a method for training a custom model based on a pre-trained base model, (a) A learning device comprising a base model including a pre-trained first feature extraction block to the nth feature extraction block (where n is an integer of 1 or more), an adaptation block, a residual unit, and a task head block, inputs training data to a custom model that performs a specific task, generates first intermediate features to the nth intermediate features through each of the first feature extraction blocks to the nth feature extraction blocks of the base model, and generates a transformation feature through the adaptation block that transforms the source domain corresponding to the base model in a specific intermediate feature selected from the first to the nth intermediate features into a target domain corresponding to the specific task; (b) The learning device generates a fusion feature by fusing the transform feature with the transform feature and the transform feature through the transform unit, and generates a task output by performing a learning operation on the fusion feature through the task head block; and (c) The learning device generates a task loss by referring to the task output and the original correct answer corresponding to the learning data, and backpropagates the task loss to train at least a portion of the task head block, the transform unit, and the adaptation block; A method that includes this.
In step (a) above, The method according to claim 6, characterized in that the learning device generates the conversion feature by selecting one of the second intermediate feature to the (n-1)th intermediate feature as the specific intermediate feature through the adaptation block and converting the source domain to the target domain.
In step (a) above, The method according to claim 6, characterized in that the resistive unit includes a plurality of convolution layers, and is configured such that a first filter included in at least some of the plurality of convolution layers is decomposed into a plurality of second filters having a lower rank than the first filter.
In step (a) above, Each of the first to n feature extraction blocks has its parameters fixed through freezing. In step (c) above, The method according to claim 6, characterized in that when the learning device backpropagates the task loss, it does not update the parameters of each of the first feature extraction block to the n feature extraction blocks.
In step (b) above, The method according to claim 6, characterized in that the learning device generates the fusion feature by performing an add operation on the residual feature and the conversion feature through the element-wise add layer and fusing them.
In a learning device that trains a custom model based on a pre-trained base model, Includes at least one memory for storing instructions; and at least one processor configured to execute the instructions; The processor comprises (I) a process of inputting training data into a custom model that performs a specific task, comprising a base model including a pre-trained first feature extraction block to the nth feature extraction block (where n is an integer of 2 or more), a first adaptation block to the kth adaptation block (where k is an integer of 2 or more and less than or equal to n), a first residual unit to the kth residual unit, and a task head block, to generate first intermediate features to the nth intermediate features through each of the first feature extraction blocks to the nth feature extraction blocks of the base model, and generating first transformation features to the kth transformation features, each of which converts the source domain corresponding to the base model in each of the first specific intermediate features to the kth specific intermediate features selected from the first intermediate features to the nth intermediate features through each of the first adaptation blocks to the kth adaptation blocks; and (II) a process of inputting training data into the first transformation feature through the first residual unit. A learning device that performs the following processes: (III) generating a first fusion feature by fusing a first residual feature extracted for a specific task with the first transformation feature, then fusing the (i-1) fusion feature with the i-th transformation feature (where i is an integer greater than or equal to 2 and less than or equal to k) to generate an i-th provisional feature, then generating an i-th fusion feature by fusing the i-th provisional feature with the i-th provisional feature extracted for the specific task using the i-th provisional feature via the i-th residual unit, and generating a task output by performing a learning operation on the k-th fusion feature output from the k-th residual unit via the task head block; and (III) generating a task loss by referring to the task output and the original correct answer corresponding to the learning data, and backpropagating the task loss to train at least a portion of the task head block, the first residual unit to the k-th residual unit, and the first to the k-th adaptation block.
In the above process (I), The learning device according to claim 11, characterized in that the processor generates a conversion feature by selecting the nth intermediate feature or the (n-1)th intermediate feature from the first intermediate feature to the nth intermediate feature through any one of the first to the kth adaptation blocks, and converting the source domain to the target domain; and generating another conversion feature by selecting an intermediate feature that is one of the second intermediate feature to the (n-2)th intermediate feature through another adaptation block from the first to the kth adaptation blocks, and converting the source domain to the target domain.
In the above process (I), The learning device according to claim 11, wherein each of the first to k-residual units includes a plurality of convolution layers, and is configured such that a first filter included in at least some of the plurality of convolution layers is decomposed into a plurality of second filters having a lower rank than the first filter.
In the above process (I), Each of the first to n feature extraction blocks has its parameters fixed through freezing. In the aforementioned (III) process, The learning device according to claim 11, characterized in that when the processor backpropagates the task loss, it does not update the parameters of each of the first feature extraction block to the n feature extraction block.
In the above process (II), The learning device according to claim 11, characterized in that the processor generates the first fusion feature by performing an add operation on the first resistive feature and the first transformation feature through the first element-wise add layer and fusing them; generates the i provisional feature by performing an add operation on the i transformation feature and the (i-1) fusion feature through the (i-1) element-wise add layer and fusing them; and generates the i fusion feature by performing an add operation on the i resistive feature and the i provisional feature through the i element-wise add layer and fusing them.
In a learning device that trains a custom model based on a pre-trained base model, Includes at least one memory for storing instructions; and at least one processor configured to execute the instructions; The processor comprises (I) a base model including a pre-trained first feature extraction block to the nth feature extraction block (where n is an integer of 1 or more), an adaptation block, a residual unit, and a task head block, inputting training data to a custom model that performs a specific task, generating first intermediate features to the nth intermediate features through each of the first feature extraction blocks to the nth feature extraction blocks of the base model, and through the adaptation block, the source domain corresponding to the base model in a specific intermediate feature selected from the first to the nth intermediate features to the target corresponding to the specific task. A learning device that performs the following steps: (II) a process of generating a transformation feature converted to a domain; (II) a process of generating a fusion feature by fusing the transformation feature with the residual feature extracted for the specific task using the transformation feature through the residual unit, and generating a task output by performing a learning operation on the fusion feature through the task head block; and (III) a process of generating a task loss by referring to the task output and the original correct answer corresponding to the training data, and backpropagating the task loss to train at least a portion of the task head block, the residual unit, and the adaptation block.
In the above process (I), The learning device according to claim 16, characterized in that the processor generates the conversion feature by selecting one of the second intermediate feature to the (n-1)th intermediate feature as the specific intermediate feature through the adaptation block and converting the source domain to the target domain.
In the above process (I), The learning device according to claim 16, wherein the resistive unit includes a plurality of convolution layers, and is configured such that a first filter included in at least some of the plurality of convolution layers is decomposed into a plurality of second filters having a lower rank than the first filter.
In the above process (I), Each of the first to n feature extraction blocks has its parameters fixed through freezing. In the aforementioned (III) process, The learning device according to claim 16, characterized in that when the processor backpropagates the task loss, it does not update the parameters of each of the first feature extraction block to the n feature extraction block.
In the above process (II), The learning device according to claim 16, characterized in that the processor generates the fusion feature by performing an add operation on the resistive feature and the conversion feature through the element-wise add layer and fusing them.

Description

This invention relates to a method for training a custom model based on a pre-trained base model, and a training apparatus using the same. Base models like Grounding DINO, trained using a wide range of large training datasets such as MSCOCO, Object365, and OpenImages, possess broad and universal knowledge. However, such base models lack or have insufficient detailed knowledge necessary to perform specific tasks, such as detecting defects in circuit boards. Therefore, research is being conducted to train base models to acquire the knowledge necessary for specific tasks using relatively small amounts of training data, based on the universal knowledge they possess. A related conventional training model is shown in Figure 1. As an example, Figure 1(a) shows a learning model trained using the Full Fine-Tuning method. For example, when training data is input to a pre-trained base model 100 containing five layers, the training data is sequentially processed through the first to fifth layers 110 to 150, and task data is output. Learning can then be performed by sequentially updating the weight parameters of each layer from the fifth layer 150 to the first layer 110 through backpropagation, which is a gradient operation performed in the reverse direction as shown by the dotted arrow, based on the loss generated by referencing the task data and its corresponding GT (Ground Truth). Therefore, while the Full Fine-Tuning method has the advantage of being able to learn all the weighted parameters of the base model 100 and thus offering high performance, it has the problem that, due to the nature of the base model 100, which generally has a very large number of weighted parameters, learning it for a specific task would require astronomical costs. As another example, Figures 1(b) and 1(c) show a learning model in which some layers of the base model 100 are frozen to solve the cost problem required in the Full Fine-Tuning method. The cost required for learning is reduced by not learning for the frozen weight parameters. For example, in (b), the weight parameters of the first layer 110 to the fourth layer 140 of the base model 100 are frozen, and in (c), the weight parameters of the first layer 110, the third layer 130, and the fourth layer 140 of the base model 100 are frozen. In this case, the process for generating the loss is performed in the same way as in (a), but in (b), only the weighting parameter of the fifth layer 150 is updated via backpropagation, and in (c), the weighting parameters of both the fifth layer 150 and the second layer 120 are updated via backpropagation. Because this allows only the information for some layers to be stored in memory, the cost can be reduced compared to the Full Fine-Tuning method. However, the base model 100 performs training to perform advanced inference by combining primitive features in the first half of the layers (for example, the first layer 110 which extracts primitive features from the training data) and in the second half of the layers (for example, the fifth layer 150) which combines these primitive features. However, if training is not performed on the first half of the layers, as in (b), it is not possible to generate primitive features for a specific task, and inference must be performed using only primitive features for universal tasks, resulting in a significant performance degradation. Furthermore, if training is performed on the first half of the layers to minimize the performance degradation, as in (c), it is necessary not only to perform gradient operations up to the first half of the layers via backpropagation, but also to store information up to the first half of the layers in memory, resulting in a significant loss of cost benefits. As another example, Figure 1(d) shows a learning model in which all layers of the base model 100 are frozen and a separate adapter layer is added. In (d), two adapter layers, the first adapter layer 210 and the second adapter layer 220, are exemplarily connected to the outputs of the second layer 120 and the fourth layer 140 of the base model 100, respectively. The outputs of the first adapter layer 210 and the second adapter layer 220 are fused through a predetermined fusion layer 300, and then task data is output through an output layer 400. Subsequently, by referencing the task data and its corresponding GT, the loss generated is backpropagated, and a gradient operation is performed on the output layer 400, the first adapter layer 210, and the second adapter layer 220 to update the weighting parameters. Since the first layer 110 through the fifth layer 150 of the base model 100 are all frozen and their weighting parameters are fixed, the base model 100 is not trained. This allows only the output layer 400, the first adapter layer 210, and the second adapter layer 220, which have relatively fewer weighting parameters than the base model 100, to be trained, effectively reducing the training costs. However, a problem exists: simply