KR-20260066417-A - METHOD AND SYSTEM FOR ROBUST AUTOMATIC LABELING TO NOISY LABELS

KR20260066417AKR 20260066417 AKR20260066417 AKR 20260066417AKR-20260066417-A

Abstract

An automatic labeling method and system robust to label noise are disclosed. An automatic labeling method performed by an automatic labeling system according to one embodiment may include: a step of assigning pseudo-labels to unlabeled data through an initial model trained using labeled train data; a step of identifying noise labels through an identification model using an inference transition matrix modeled based on a confusion matrix predicted for labeled valid data through the trained initial model; and a step of obtaining a final model through training using a dataset obtained by refining the identified noise labels.

Inventors

허영범
이원희

Assignees

인하대학교 산학협력단

Dates

Publication Date: 20260512
Application Date: 20241104

Claims (8)

In an automatic labeling method performed by an automatic labeling system, A step of assigning pseudo-labels to unlabeled data through an initial model trained using labeled train data; A step of identifying noise labels through an identification model using an inference transition matrix modeled based on a confusion matrix predicted for labeled valid data through the above-mentioned learned initial model; and A step of obtaining a final model through training using a dataset obtained by refining the identified noise labels. Automatic labeling method including
In paragraph 1, The above-mentioned allocation step is, A step of training an initial model using the above-mentioned label training data; and The step of inputting unlabeled data into the above-mentioned trained initial model Automatic labeling method including
In paragraph 1, The above-mentioned allocation step is, A step of predicting a confusion matrix for the labeled valid data through the above-mentioned trained initial model, and modeling an estimated transition matrix by normalizing the predicted confusion matrix to a value between 0 and 1. Automatic labeling method including
In paragraph 1, The above identification step is, Calculating the loss using soft labels for each row of the transition matrix corresponding to the actual label, and training an identification model based on the calculated loss Automatic labeling method including
In paragraph 4, The above identification step is, Step of calculating the softmax output for each data through the above-mentioned learned identification model Automatic labeling method including
In paragraph 5, The above identification step is, A step of evaluating the degree of noise by calculating the difference between the softmax output of the learned identification model and the transition vector corresponding to the label in the estimated transition matrix using KL-divergence (Kullback-Leibler Divergence). Automatic labeling method including
In paragraph 6, The above identification step is, A step of determining a clean sample if the above KL-divergence is smaller than a preset reference value, determining a noise sample if the above KL-divergence is larger than a preset reference value, and removing the determined noise sample. Automatic labeling method including
In automatic labeling systems, A pseudo-label assignment unit that assigns pseudo-labels to unlabeled data through an initial model trained using labeled train data; A noise label identification unit that identifies noise labels through an identification model using an inference transition matrix modeled based on a confusion matrix predicted for labeled valid data through the above-mentioned learned initial model; and A final model acquisition unit that obtains a final model through training using a dataset obtained by refining the identified noise labels above. Automatic labeling system including

Description

Method and System for Robotic Labeling Robust to Noisy Labels The following description concerns labeling technology. Recently, as the demand for AI model training has increased across various fields, the development and research of related technologies are actively underway. To improve the accuracy of model training, it is necessary to train the model in advance using a large amount of accurate data. Conventional methods of generating data through manual labeling have consumed significant costs, including time and human resources. Furthermore, verifying incorrect labels requires substantial additional resources, and failure to correct them leads to issues that degrade data quality. Moreover, data often contains noise labels due to errors by annotators, cloud sourcing, or web scraping. Since generating high-quality data requires a long and complex process, the occurrence of noise data is inevitable. Consequently, there is a demand for technology that automatically labels data while removing noise labels. Reference: Republic of Korea Published Patent No. 10-2021-0006247 (Published Jan. 18, 2021) FIG. 1 is a diagram illustrating the operation of processing noise labels using a transition matrix in one embodiment. FIG. 2 is an example for explaining the structure of a model in one embodiment. FIG. 3 is an example for explaining an estimated transition vector in one embodiment. FIG. 4 is an example showing a clean sample and a noise sample in one embodiment. FIG. 5 is a block diagram illustrating an automatic labeling system in one embodiment. FIG. 6 is a flowchart illustrating an automatic labeling method in one embodiment. Hereinafter, embodiments will be described in detail with reference to the attached drawings. FIG. 1 is a diagram illustrating the operation of processing noise labels using a transition matrix in one embodiment. An automatic labeling system can process noise labels using a transition matrix. To efficiently process large amounts of unlabeled data, the automatic labeling system can be designed to add data sequentially. Referring to Figure 2, this is an example to explain the structure of the model. An automatic labeling system can configure an initial model, an identification model, and a final model. Each of these initial model, identification model, and final model can use a ResNet with the same structure. ResNet is used for data classification, and a description of ResNet is disclosed in non-patent literature 1 "He et al. 2015. Deep Residual Learning for Image Recognition. CVPR". An automatic labeling system trains an initial model using labeled data and can assign pseudo-labels to unlabeled data through the trained initial model. The automatic labeling system can train an identification model using existing small amounts of labeled data and merged data containing data with assigned pseudo-labels. The automatic labeling system can obtain a final model through training using a dataset secured by refining noise labels. The automatic labeling system can efficiently process large volumes of unlabeled data. Designed to handle sequentially added data, the system assigns labels to new data using models obtained from previous stages, enabling the acquisition of progressively better pseudo-labels. Referring again to Fig. 1, the automatic labeling system can first train an initial model using given labeled training data. The automatic labeling system can input unlabeled data, that is, unlabeled data, into the trained initial model. The automatic labeling system can assign pseudo-labels to the unlabeled data through the trained initial model. Here, a pseudo-label is a technique mainly used in semi-supervised learning, which is a method of utilizing the label predicted by the model for unlabeled data as if it were the actual label for training. For example, assuming a three-class classification problem classifying dogs, cats, and lions, if the result of inferring one unlabeled data point with the trained classifier is [0.7, 0.2, 0.1], the probability for cats is 0.7, which is the highest, so the pseudo-label can be assigned as cats and the unlabeled data can be used as labeled data. Automatic labeling systems use transition matrices to estimate noise labels generated during the pseudo-labeling process. By using transition matrices to identify and remove noisy data, a refined dataset can be obtained. Here, a transition matrix is a matrix that mathematically represents the relationship between clean labels and observed labels. Clean labels are labels that are the true correct answers for the data, while observed labels refer to observed labels that may have incorrect labels due to noise. More specifically, referring to Figure 3, this is an example illustrating the estimated transition vector. To estimate the noise generated by the performance of the initial model p1 , the automatic labeling system can predict the confusion matrix through inference on labeled valid data using the trained initial model.