KR-20260067299-A - METHOD AND SYSTEM FOR DATA AUGMENTATION FOR LEARNING FROM TABULAR DATA

KR20260067299AKR 20260067299 AKR20260067299 AKR 20260067299AKR-20260067299-A

Abstract

The present invention relates to a data augmentation method and system for learning tabular data. More specifically, the present invention relates to a self-attention mechanism-based data augmentation method and system for contrastive learning of tabular data.

Inventors

어문정
이경은
서민국
조혜승
심예슬
임우형

Assignees

주식회사 LG 경영개발원

Dates

Publication Date: 20260512
Application Date: 20250731
Priority Date: 20241105

Claims (20)

In a computerized method including the following, A step of specifying table data configured to include multiple different columns and at least one record having a value corresponding to each of the multiple columns; A step of calculating the importance of each of the plurality of columns based on the association between the plurality of columns included in the table data; A step of generating augmented table data by augmenting the table data based on the importance calculated above; A step of inputting the table data and the augmented table data into an encoder, respectively; A step of obtaining a first embedding vector corresponding to the table data from the encoder; A step of obtaining a second embedding vector corresponding to the augmented table data from the encoder; and A data augmentation method for learning tabular data, characterized by including the step of training a target model using the first embedding vector and the second embedding vector.
In paragraph 1, To calculate the importance of each of the plurality of columns included in the table data, the method further includes the step of inputting the table data into a pre-configured specific module. In the step of generating the augmented table data mentioned above, Using attention scores for each of the plurality of columns obtained from the specific module above, the importance of each of the plurality of columns is calculated, and A data augmentation method for learning tabular data, characterized by augmenting the table data to generate the augmented table data based on the importance of each of the plurality of columns calculated through the attention score.
In paragraph 2, The above specific module is, Based on the association between the plurality of columns included in the table data above, an attention score for each of the plurality of columns is calculated to serve as a criterion for calculating the importance of each of the plurality of columns, and A data augmentation method for learning tabular data, characterized by generating augmented table data by selectively performing augmentation on at least some of the plurality of columns based on the importance of each of the plurality of columns calculated through the above-mentioned attention score.
In paragraph 3, The calculation of the above importance is, A data augmentation method for learning tabular data characterized by averaging the attention scores calculated from the above specific module to calculate the importance of each of the above plurality of columns.
In paragraph 4, Based on the importance of each of the plurality of columns, the method further includes the step of selecting at least some specific columns among the plurality of columns that are to be augmented. The above specific module is, Randomly select one of the multiple pre-set augmentation techniques, and A data augmentation method for learning tabular data, characterized by generating augmented table data by performing augmentation on a specific column among a plurality of columns using the selected augmentation technique.
In paragraph 5, In the step of selecting the specific column mentioned above, Based on the importance of each of the plurality of columns above, first columns satisfying a first criterion and second columns satisfying a second criterion among the plurality of columns are each specified, and A data augmentation method for learning tabular data characterized by selecting at least some of the first columns and the second columns as the specific columns.
In paragraph 6, The specific column above includes at least some of the second columns satisfying the second criterion, and The above specific module is, A data augmentation method for learning tabular data, characterized by generating augmented table data by performing augmentation on at least some of the second columns satisfying the second criterion using the selected augmentation technique among the plurality of augmentation techniques.
In Paragraph 7, In the step of generating the augmented table data mentioned above, Based on a pre-set selection ratio criterion, at least some of the second columns satisfying the second criterion are selected as the specific columns to be augmented, and A data augmentation method for learning tabular data, characterized by generating an augmented table by performing augmentation on a specific column selected according to the above-mentioned pre-set selection ratio criteria.
In paragraph 5, The step of generating the augmented table data above is, A data augmentation method for learning tabular data, characterized by the step of performing augmentation on a specific column selected based on the attention score to generate the augmented table data.
In paragraph 6, The step of generating the augmented table data above is, A data augmentation method for learning tabular data, characterized by the step of generating augmented table data by performing augmentation on at least some of the second columns satisfying the second criteria in order to maintain the structure of the first columns satisfying the first criteria.
In paragraph 6, A data augmentation method for learning tabular data, characterized in that, for each learning epoch of the model to be learned, one of the plurality of augmentation techniques is randomly selected and applied to perform augmentation on the specific column to generate the augmented table data.
In Paragraph 11, The above specific module is, For each training epoch of the above-mentioned target model, any one of the above-mentioned plurality of augmentation techniques is randomly selected, and A data augmentation method for learning tabular data, characterized by generating augmented table data by performing augmentation on a specific column using the selected augmentation technique.
In paragraph 1, A data augmentation method for learning tabular data, characterized by further including the step of defining a loss function using at least one of the first embedding vector corresponding to the table data and the second embedding vector corresponding to the augmented table data for learning the above-mentioned learning target model.
In Paragraph 13, A step of inputting the first embedding vector and the second embedding vector, respectively, to a projection head; and The method further includes the step of obtaining a first projection vector corresponding to the first embedding vector and a second projection vector corresponding to the second embedding vector from the projection head. The above loss function is, A data augmentation method for learning tabular data, characterized by being defined using the first projection vector corresponding to the first embedding vector and the second projection vector corresponding to the second embedding vector.
In Paragraph 13, The above loss function is, It is defined to learn in a direction that maximizes the similarity between the first embedding vector corresponding to the table data and the second embedding vector corresponding to the augmented table data, and In the step of training the above-mentioned target model, A data augmentation method for learning tabular data characterized by performing contrastive learning on the target model using the above loss function.
In paragraph 15, Based on the above contrast learning, a step of obtaining a contrast-learned model with the above loss function; and A data augmentation method for learning tabular data, characterized by further including the step of performing fine-tuning on the above-mentioned contrast-learned model.
In paragraph 1, A step of performing binning on the above table data; A step of obtaining a plurality of segmented table data as a result of performing the segmentation on the table data; and A data augmentation method for learning tabular data, characterized by further including the step of training a target model using the plurality of segmented table data.
In Paragraph 17, Based on the association between multiple columns included in the table data, the importance of each of the multiple columns is calculated, and Based on the importance calculated above, the table data is augmented to generate augmented table data, and The above plurality of segmented table data and the above augmented table data are each input into an encoder, and From the encoder, a plurality of embedding vectors corresponding to each of the plurality of segmented table data are obtained, and From the encoder above, an embedding vector corresponding to the augmented table data is obtained, and A data augmentation method for learning tabular data, characterized by training a target model using the plurality of embedding vectors corresponding to each of the plurality of segmented table data and the embedding vector corresponding to the augmented table data.
In a system comprising memory configured to store executable instructions and one or more processors configured to perform operations by executing one or more instructions, The above system is, Table data configured to include multiple different columns and at least one record having a value corresponding to each of the multiple columns, and Based on the association between the plurality of columns included in the table data above, the importance of each of the plurality of columns is calculated, and Based on the importance calculated above, augment the table data to generate augmented table data, and The above table data and the above augmented table data are each input into the encoder, and From the encoder above, a first embedding vector corresponding to the table data is obtained, and From the encoder, a second embedding vector corresponding to the augmented table data is obtained, and A data augmentation system for learning tabular data characterized by training a target model using the first embedding vector and the second embedding vector.
A program that is executed by one or more processes in an electronic device and stored on a computer-readable recording medium, The above program is, A step of specifying table data configured to include multiple different columns and at least one record having a value corresponding to each of the multiple columns; A step of calculating the importance of each of the plurality of columns based on the association between the plurality of columns included in the table data; A step of generating augmented table data by augmenting the table data based on the importance calculated above; A step of inputting the table data and the augmented table data into an encoder, respectively; A step of obtaining a first embedding vector corresponding to the table data from the encoder; A step of obtaining a second embedding vector corresponding to the augmented table data from the encoder; and A program stored on a computer-readable recording medium characterized by including instructions for performing a step of training a target model using the first embedding vector and the second embedding vector.

Description

Method and System for Data Augmentation for Learning from Tabular Data The present invention relates to a data augmentation method and system for learning tabular data. More specifically, the present invention relates to a self-attention mechanism-based data augmentation method and system for contrastive learning of tabular data. Tabular data consists of rows and columns and is used in various fields such as finance, medical, manufacturing, healthcare, marketing, and research. Despite this universality, deep learning research has paid relatively less attention to tabular data compared to fields such as computer vision or natural language processing. Recently, Self-Supervised Learning (SSL) has garnered attention as a promising pre-training method for tabular data. Through self-supervised learning, models can extract meaningful features and patterns from unlabeled data and apply them to various downstream tasks. Such self-supervised learning generally relies on Contrastive Learning, a method that exposes the model to various variations of the input data through data augmentation. Contrastive learning is emerging as a powerful self-supervised learning framework and is achieving success in various fields. One of the key elements of contrastive learning is the generation of positive samples through data augmentation, which aims to introduce changes while preserving the intrinsic characteristics of the original data. However, despite these possibilities, applying contrastive learning to tabular data presents a relatively under-explored and challenging task. Specifically, tabular data contains hundreds of features and complex interactions, and failure to process them effectively can lead to errors in critical decision-making processes. In particular, due to the nature of tabular data—which lacks a spatial or sequential structure unlike images or text—applying conventional data augmentation techniques without modification poses a risk of distorting important relationships or compromising their meaning. For instance, conventional augmentation methods based on randomness may overlook interactions between important features in tabular data, potentially leading to a decline in model performance. Therefore, considering the structured and heterogeneous nature of tabular data, a new augmentation technique is required to effectively apply contrastive learning that can produce meaningful results while preserving the core structure of the data. FIG. 1 is a conceptual diagram illustrating a data augmentation system for learning tabular data according to the present invention. FIG. 2 is a flowchart illustrating a data augmentation method for learning tabular data according to the present invention. FIGS. 3a, FIGS. 3b, FIGS. 4a, and FIGS. 4b are conceptual diagrams illustrating a data augmentation method for learning tabular data according to the present invention. FIGS. 5, FIGS. 6 and FIGS. 7 are formulas related to a data augmentation method for learning tabular data according to the present invention. FIGS. 8 and 9 are tables showing an example of the performance of the data augmentation method according to the present invention and the learning results of an artificial intelligence model learned using the learning method according to the present invention. Hereinafter, embodiments disclosed in this specification will be described in detail with reference to the attached drawings. Identical or similar components are assigned the same reference number regardless of the drawing symbols, and redundant descriptions thereof will be omitted. The suffixes "module" and "part" used for components in the following description are assigned or used interchangeably solely for the ease of drafting the specification and do not have distinct meanings or roles in themselves. Furthermore, in describing the embodiments disclosed in this specification, if it is determined that a detailed description of related prior art could obscure the essence of the embodiments disclosed in this specification, such detailed description will be omitted. Additionally, the attached drawings are intended only to facilitate understanding of the embodiments disclosed in this specification; the technical concept disclosed in this specification is not limited by the attached drawings, and it should be understood that they include all modifications, equivalents, and substitutions that fall within the spirit and technical scope of the present invention. Terms including ordinal numbers, such as first, second, etc., may be used to describe various components, but said components are not limited by said terms. These terms are used solely for the purpose of distinguishing one component from another. When it is stated that one component is "connected" or "connected" to another component, it should be understood that while it may be directly connected or connected to that other component, there may also be other components in between. On the other hand, when it is stated t