US-12625924-B2 - Method and apparatus for data augmentation

US12625924B2US 12625924 B2US12625924 B2US 12625924B2US-12625924-B2

Abstract

An apparatus for data augmentation includes a mode separating unit generating minor class fake data from a latent vector, an embedding vector generating unit generating embedding vectors for major class original data, minor class original data, and the minor class fake data through a metric network, an auxiliary classifying unit classifying a class of the embedding vectors from the embedding vector generating unit, and feedbacking the classified result in the mode separating unit, and a classifying unit determining whether the input data is authentic by receiving the embedding vector of the minor class original data and the embedding vector of the minor class fake data from the embedding vector generating unit, and feedbacking the determined result in the mode separating unit.

Inventors

Min Jung Kim
Young Seon Lee
No Seong Park
Ji Hyeon Hyeong
Ja Young Kim

Assignees

SAMSUNG SDS CO., LTD.
INDUSTRY-ACADEMIC COOPERATION FOUNDATION, YONSEI UNIVERSITY

Dates

Publication Date: 20260512
Application Date: 20221017
Priority Date: 20211018

Claims (19)

1 . An apparatus for data augmentation, comprising: one or more processors; and a memory storing one or more programs, wherein the one or more programs are configured to be executed by the one or more processors, wherein the one or more programs include instructions for a mode separating unit, an embedding vector generating unit, an auxiliary classifying unit and a classifying unit; wherein the mode separating unit is configured to receive input latent vectors and generate one or more minor class candidate data samples, each candidate data sample corresponding to a different feature mode of a minor class, and to select one of the candidate data samples as a minor class fake data sample based on similarity feedback; the embedding vector generating unit is configured to generate embedding vectors through a metric learning network, the embedding vectors comprising an embedding vector of a major class original data sample, an embedding vector of a minor class original data sample, and an embedding vector of the minor class fake data sample, wherein the metric learning network is trained to minimize a distance between embedding vectors belonging to a same class and to maximize a distance between embedding vectors belonging to different classes; the auxiliary classifying unit is configured to classify a class of the embedding vectors by receiving the embedding vector of the major class original data sample, the embedding vector of the minor class original data sample, and the embedding vector of the minor class fake data sample from the embedding vector generating unit, and generate a similarity feedback signal indicating whether the minor class fake data sample is misclassified as the major class; and the classifying unit is configured to determine whether input data is authentic by computing a similarity or distance metric between the embedding vector of the minor class original data sample and the embedding vector of the minor class fake data sample, and generate an authenticity feedback signal for updating generation parameters of the mode separating unit, wherein the mode separating unit is trained using both the similarity feedback signal and the authenticity feedback signal to adjust the generation parameters of the minor class fake data sample.
2 . The apparatus of claim 1 , wherein the mode separating unit comprises a plurality of generating sub-units configured to generate a plurality of minor class candidate data samples different from each other from a latent vector using a generative adversarial network; and a gating network configured to select one of the candidate data samples among the plurality of minor class candidate data samples as the minor class fake data sample.
3 . The apparatus of claim 2 , wherein, among the plurality of generating sub-units, a generating sub-unit that generates the minor class candidate data sample selected by the gating network is trained so that the minor class candidate data sample is different from the major class original data sample based on a feedback signal from the auxiliary classifying unit.
4 . The apparatus of claim 2 , wherein, among the plurality of generating sub-units, a generating sub-unit that generates the minor class candidate data sample selected by the gating network is trained so that the minor class candidate data sample is similar to the minor class original data ample based on an authenticity feedback from the classifying unit.
5 . The apparatus of claim 2 , wherein the gating network is trained to select the minor class candidate data sample having a highest similarity score to the minor class original data sample based on the similarity feedback signal received from the auxiliary classifying unit.
6 . The apparatus of claim 2 , wherein the plurality of generating sub-units are configured to generate the minor class candidate data sample corresponding to one or more sub-classes of the minor class from the latent vector.
7 . The apparatus of claim 1 , wherein the embedding vectors are generated so that data included in the same category are close to each other and data included in different categories are distant from each other through metric learning.
8 . The apparatus of claim 1 , wherein the auxiliary classifying unit is configured to transmit a negative similarity feedback signal to the mode separating unit, when the embedding vector of the minor class candidate data sample is classified into the embedding vector of the major class original data sample or the embedding vector of the minor class fake data sample.
9 . The apparatus of claim 1 , wherein the classifying unit is configured to transmit a negative authenticity feedback signal to the mode separating unit, when the embedding vector of the minor class fake data sample is determined to be fake based on the similarity or distance metric.
10 . A method for data augmentation, the method performed on a computing device including one or more processors and a memory storing one or more programs executed by the one or more processors, the method comprising: performing a mode separation operation of generating one or more minor class candidate data samples from a latent vector and selecting one of the candidate data samples as a minor class fake data sample based on similarity feedback; performing an embedding vector generating operation of generating embedding vectors through a metric learning network, the embedding vectors comprising an embedding vector of a major class original data sample, an embedding vector of a minor class original data sample, and an embedding vector of a minor class fake data sample, wherein the metric learning network is trained to minimize a distance between embedding vectors belonging to a same class and to maximize a distance between embedding vectors belonging to different classes; performing an auxiliary classification operation of classifying a class of the embedding vectors by receiving the embedding vector of the major class original data sample, the embedding vector of the minor class original data sample, and the embedding vector of the minor class fake data sample from the embedding vector generating operation, and generating a similarity feedback signal indicating whether the minor class fake data sample is misclassified as the major class; performing a classification operation of determining whether input data is authentic by computing a similarity or distance metric between the embedding vector of the minor class original data sample and the embedding vector of the minor class fake data sample and generating an authenticity feedback signal for updating parameters of the mode separation operation, wherein the mode separation operation is trained using both the similarity feedback signal and the authenticity feedback signal to adjust generation parameters for the minor class fake data sample.
11 . The method of claim 10 , wherein, in the mode separation operation, a plurality of minor class candidate data samples different from each other are generated from the latent vector using a plurality of generative adversarial networks; and one of the plurality of the candidate data samples is selected as the minor class fake data sample.
12 . The method of claim 11 , wherein the generative adversarial neural network generating the selected one of the candidate data samples is trained so that the minor class candidate data sample is different from the major class original data sample based on a feedback signal received from the auxiliary classification operation.
13 . The method of claim 11 , wherein the generative adversarial neural network generating the selected one of the candidate data samples is trained so that the minor class candidate data sample is similar to the minor class original data sample based on an authenticity feedback received from the classification operation.
14 . The method of claim 11 , wherein, in the mode separation operation, a gating network is trained to select the minor class candidate data sample having a highest similarity score to the minor class original data sample based on the similarity feedback signal received from the auxiliary classification operation.
15 . The method of claim 11 , wherein the plurality of generative adversarial neural networks generate the minor class candidate data sample corresponding to one or more sub-classes of the minor class from the latent vector.
16 . The method of claim 10 , wherein the embedding vectors are generated so that data included in the same category close to each other and data included in different categories are distant from each other through metric learning.
17 . The method of claim 10 , wherein, in the auxiliary classification operation, when the embedding vector of the minor class candidate data sample is classified into the embedding vector of the major class original data sample or the embedding vector of the minor class fake data sample, a negative similarity feedback signal is transmitted to the mode separation operation.
18 . The method of claim 10 , wherein, in the classification operation, when the embedding vector of the minor class candidate data sample is determined to be fake, a negative authenticity feedback signal is transmitted to the mode separation operation.
19 . An apparatus for data augmentation, comprising: one or more processors; and a memory storing one or more programs, wherein the one or more programs are configured to be executed by the one or more processors, wherein the one or more programs include instructions for: a mode separating unit configured to generate one or more minor class candidate data samples from a latent vector and select one of the candidate data samples as a minor class fake data sample based on similarity feedback; an embedding vector generating unit configured to generate embedding vectors through a metric learning network, the embedding vectors comprising an embedding vector of a major class original data sample, an embedding vector of a minor class original data sample, and an embedding vector of a minor class fake data sample, wherein the metric learning network is trained to minimize a distance between embedding vectors belonging to a same class and to maximize a distance between embedding vectors belonging to different classes; an auxiliary classifying unit configured to classify a class of an embedding vector by receiving the embedding vector of the major class original data sample, the embedding vector of the minor class original data sample, and the embedding vector of the minor class fake data sample from the embedding vector generating unit, and generate a similarity feedback signal indicating whether the minor class fake data sample is misclassified as the major class; and a classifying unit configured to determine whether input data is authentic by computing a similarity or distance metric between the embedding vector of the minor class original data sample and the embedding vector of the minor class fake data sample, and generate an authenticity feedback signal for updating generation parameters of the mode separating unit, wherein the mode separating unit is trained using both the similarity feedback signal and the authenticity feedback signal to adjust the generation parameters of the minor class fake data sample.

Description

CROSS-REFERENCE TO RELATED APPLICATION AND CLAIM OF PRIORITY This application claims the benefit under 35 USC § 119 of Korean Patent Application No. 10-2021-0138538, filed on Oct. 18, 2021 in the Korean Intellectual Property Office, the entire disclosure of which is incorporated herein by reference for all purposes. BACKGROUND 1. Field Disclosed embodiments relate to techniques for data augmentation. 2. Description of Related Art Class imbalance indicates a phenomenon in which the amount of data belonging to one class is insufficient, compared to other classes in a classification problem. Class imbalance causes a problem in generating a classifying unit for classifying classes using machine learning models and deep learning-based models. In order to solve this problem, various oversampling techniques for augmenting minor class data have been proposed. One of the most well-known and widely used oversampling techniques is a synthetic minority over-sampling technique (SMOTE), and various modifications have been proposed, starting with this technique, proposed in 2002. However, while SMOTE is effective in augmenting minor classes in a data set composed of only continuous variables, there is a limitation in a mixed data set in which continuous variables and nominal variables are mixed. SUMMARY Disclosed embodiments are to provide a method and apparatus for data augmentation. According to an aspect of the present disclosure, an apparatus for data augmentation may include: one or more processors; and a memory storing one or more programs which are configured to be executed by the one or more processors and include instructions for a mode separating unit, an embedding vector generating unit, an auxiliary classifying unit and a classifying unit. The mode separating unit is configured to generate minor class fake data from a latent vector; the embedding vector generating unit is configured to generate an embedding vector for each of major class original data, minor class original data, and minor class fake data, through a metric network; the auxiliary classifying unit is configured to classify a class of an embedding vector by receiving the embedding vector of the major class original data, the embedding vector of the minor class original data, and the embedding vector of the minor class fake data, and feedbacking the classified result in the mode separating unit; and the classifying unit is configured to determine whether the input data is authentic by receiving the embedding vector of the minor class original data and the embedding vector of minor class fake data from the embedding vector generating unit, and feedbacking the determined result in the mode separating unit. The mode separating unit may include: a plurality of generating units generating different fake data from a latent vector using a generative adversarial network; and a gating network selecting one of the plurality of fake data generated by each of the plurality of generating units as minor class fake data. Among the plurality of generating units, a generating unit generating fake data selected by the gating network may be trained so that the fake data is different from the major class original data based on a feedback of the minor class fake data corresponding to the selected fake data received from the auxiliary classifying unit. Among the plurality of generating units, a generating unit generating fake data selected by the gating network may be trained so that the fake data is similar to the minor class original data based on a feedback of the minor class fake data corresponding to the selected fake data received from the classifying unit. The gating network may be trained to select fake data, similar to the minor class original data, based on the feedback received from the auxiliary classifying unit. The plurality of generating units may generate fake data corresponding to one or more sub-classes of a minor class from a latent vector. The embedding vector may be generated so that data included in the same category are close to each other and data included in different categories are distant from each other through metric learning. When the embedding vector of the minor class fake data is classified into the embedding vector of the major class original data or the embedding vector of the minor class fake data, the auxiliary classifying unit may transmit negative feedback to the mode separating unit. When the embedding vector of the minor class fake data is determined to be fake, the classifying unit may transmit negative feedback to the mode separating unit. According to an aspect of the present disclosure, a method for data augmentation, performed on a computing device including one or more processors and a memory storing one or more programs executed by the one or more processors, may include: a mode separation operation generating minor class fake data from a latent vector; an embedding vector generating operation generating an embedding vecto