US-12619783-B2 - Automatic pseudonymization technique recommendation method using artificial intelligence

US12619783B2US 12619783 B2US12619783 B2US 12619783B2US-12619783-B2

Abstract

Provided is an automatic pseudonymization technique recommendation method using artificial intelligence, including receiving a training dataset including a plurality of pieces of data including a column name, a data type, a pseudonymization technique recommendation, adding, a data type vector obtained from the data type to a word vector obtained corresponding to the column name to obtain a numeric vector, and labeling the numeric vector with the pseudonymization technique recommendation to generate a plurality of pieces of training data, training a first learning model using the plurality of pieces of training data so that the first learning model outputs a pseudonymization technique recommendation in response to an input of the numeric vector, obtaining a numeric vector corresponding to each column of the dataset to be pseudonymized, and inputting the numeric vector obtained into the trained first learning model to obtain a pseudonymization technique recommendation for each column of the dataset.

Inventors

Gi Chang SHIM
Yong Kyu Park

Assignees

EASYCERTI INC.

Dates

Publication Date: 20260505
Application Date: 20240822
Priority Date: 20230825

Claims (6)

1 . An automatic pseudonymization technique recommendation method using artificial intelligence, wherein the method is implemented on a computing device comprising a processor, and a memory storing instructions or programs executable by the processor, and comprises: receiving, at the computer device, a training dataset including a plurality of pieces of data including a column name, a data type, and a pseudonymization technique recommendation; connecting, at the computer device, for each of the plurality of pieces of data, a data type vector obtained from the data type with a word vector obtained corresponding to the column name to obtain a numeric vector, and labeling the numeric vector with the pseudonymization technique recommendation to generate a plurality of pieces of training data; training, at the computer device, a first learning model using the plurality of pieces of training data so that the first learning model outputs a pseudonymization technique recommendation in response to an input of the numeric vector; obtaining, at the computer device, a numeric vector corresponding to each column of the dataset to be pseudonymized; inputting, at the computer device, the numeric vector obtained corresponding to each column of the dataset to be pseudonymized into the trained first learning model to obtain a pseudonymization technique recommendation for each column of the dataset to be pseudonymized; and performing pseudonymization on the dataset to be pseudonymized based on the pseudonymization technique recommendation obtained for each column of the dataset to be pseudonymized, wherein the plurality of pieces of data included in the training dataset further includes industry classification information; the numeric vector obtained when generating the training data further includes an industry classification vector obtained from industry classification information; and the obtaining the numeric vector corresponding to each column of the dataset to be pseudonymized includes: inputting a column name corresponding to each column of the dataset to be pseudonymized into a second learning model to obtain a word vector of each column of the dataset to be pseudonymized; obtaining a data type vector corresponding to the data type of each column of the dataset to be pseudonymized; obtaining an industry classification vector corresponding to industry classification information previously set for the dataset to be pseudonymized; and connecting the data type vector and the industry classification vector with the word vector obtained for each column of the dataset to be pseudonymized to obtain a numeric vector corresponding to each column of the dataset to be pseudonymized, wherein the data type includes at least a numeric type and a character type.
2 . The method of claim 1 , further comprising training or retraining the second learning model using a plurality of column names obtained from the training dataset.
3 . The method of claim 2 , wherein the first learning model is a decision tree model, and the second learning model is a word embedding model.
4 . The method of claim 3 , wherein the word embedding model is a FastText model, and the decision tree model is a Gradient Boosting Trees model.
5 . A non-transitory computer-readable recording medium recording a program for executing the automatic pseudonymization technique recommendation method using artificial intelligence as set forth in claim 1 .
6 . A computing device, comprising: a processor; and a memory storing instructions or programs executable by the processor, wherein, when the instructions or programs are executed by the processor, the automatic pseudonymization technique recommendation method using artificial intelligence as set forth in claim 1 is executed.

Description

CROSS-REFERENCE TO RELATED APPLICATION This application claims priority to Korean Patent Applications No. 10-2023-0111703, filed in the Korean Intellectual Property Office on Aug. 25, 2023, the entire contents of which are hereby incorporated by reference. BACKGROUND Field The present disclosure relates to a pseudonymization technique recommendation method, and more particularly to a pseudonymization technique recommendation method using artificial intelligence. Description of Related Art In the related art, common sense, learning, or preferences of an individual user determined the selection of pseudonymization technique for use. Different pseudonymization techniques are selected and used based on user's experiences and preferences, such as, for example, for a column “name”, some users chose masking, some chose bidirectional encryption, some chose a heuristic pseudonymization technique, and so on. The problem with the related method of selecting pseudonymization techniques is that the selection is heavily biased toward the experience and taste of individual users such that every change in personnel in charge led to a burdensome process of job handover and consumption of human resources and time. In addition, it is cumbersome to look up all the processing history to refer to the past techniques used for a particular column. SUMMARY In order to solve one or more problems (e.g., the problems described above and/or other problems not explicitly described herein), the present disclosure provides an automatic pseudonymization technique recommendation method using artificial intelligence. According to some aspects of the disclosure, an automatic pseudonymization technique recommendation method using artificial intelligence implemented on a computing device may include receiving a training dataset including a plurality of pieces of data including a column name, a data type, and a pseudonymization technique recommendation, adding, for each of the plurality of pieces of data, a data type vector obtained from the data type to a word vector obtained corresponding to the column name to obtain a numeric vector, and labeling the numeric vector with the pseudonymization technique recommendation to generate a plurality of pieces of training data, training a first learning model using the plurality of pieces of training data so that the first learning model outputs a pseudonymization technique recommendation in response to an input of the numeric vector, obtaining a numeric vector corresponding to each column of the dataset to be pseudonymized, and inputting the numeric vector obtained corresponding to each column of the dataset to be pseudonymized into the trained first learning model to obtain a pseudonymization technique recommendation for each column of the dataset to be pseudonymized. The obtaining the numeric vector corresponding to each column of the dataset to be pseudonymized may include inputting a column name corresponding to each column of the dataset to be pseudonymized into a second learning model to obtain a word vector of each column of the dataset to be pseudonymized, obtaining a data type vector corresponding to the data type of each column of the dataset to be pseudonymized, and adding the data type vector to the word vector obtained for each column of the dataset to be pseudonymized to obtain a numeric vector corresponding to each column of the dataset to be pseudonymized. The plurality of pieces of data included in the training dataset may further include industry classification information. The numeric vector obtained when generating the training data may further include an industry classification vector obtained from industry classification information. The obtaining the numeric vector corresponding to each column of the dataset to be pseudonymized may include inputting a column name corresponding to each column of the dataset to be pseudonymized into a second learning model to obtain a word vector of each column of the dataset to be pseudonymized, obtaining a data type vector corresponding to the data type of each column of the dataset to be pseudonymized, obtaining an industry classification vector corresponding to industry classification information previously set for the dataset to be pseudonymized, and adding the data type vector and the industry classification vector to the word vector obtained for each column of the dataset to be pseudonymized to obtain a numeric vector corresponding to each column of the dataset to be pseudonymized. The method may further include training or retraining the second learning model using a plurality of column names obtained from the training dataset. The first learning model may be a decision tree model. The second learning model may be a word embedding model. The word embedding model may be a FastText model. The decision tree model may be a Gradient Boosting Trees model. The method may further include performing pseudonymization on the dataset to be pseudonymized