US-12619755-B2 - Artificial data generation for differential privacy
Abstract
An embodiment configures a plurality of parameters, the parameters being usable to generate artificial data from original data, the configuring adjusting a level of privacy in the artificial data. An embodiment fits a distribution type to a variable of the original data. An embodiment adjusts, using a desired level of privacy and the distribution type, a level of noise, wherein the level of noise corresponds to the desired level of privacy. An embodiment generates, using the distribution type and the level of noise, the artificial data, the artificial data achieving the desired level of privacy by including noise data corresponding to the level of noise.
Inventors
- Si Er Han
- Jing Xu
- XIAO MING MA
- Jing James Xu
- Jiang Bo Kang
- Xue Ying ZHANG
- Jun Wang
- Ji Hui Yang
Assignees
- INTERNATIONAL BUSINESS MACHINES CORPORATION
Dates
- Publication Date
- 20260505
- Application Date
- 20231020
Claims (20)
- 1 . A computer-implemented method comprising: generating artificial data for differential privacy datasets and having a configurable degree of privacy protection, the generating comprising: configuring a plurality of parameters, the parameters being usable to generate artificial data from original data, the configuring adjusting a level of privacy in the artificial data; fitting a distribution type to a variable of the original data; executing, as a part of an artificial data generation application, an adjustment code configured to perform adjusting, using a desired level of privacy and the distribution type, a level of noise, wherein the level of noise corresponds to the desired level of privacy, wherein executing the adjustment code sets the level of noise based on a local sensitivity, the local sensitivity being the largest difference between two analysis results of two corresponding datasets wherein the two corresponding datasets are the same except for one record; generating from an execution of the artificial data generation application, using the distribution type and the level of noise, the artificial data, the artificial data achieving the desired level of privacy by including noise data corresponding to the level of noise; and regenerating, by changing at least one of the plurality of parameters, and responsive to determining that a similarity between an original value of the variable and a generated value of the variable is less than a threshold value, new artificial data.
- 2 . The computer-implemented method of claim 1 , wherein configuring the plurality of parameters comprises setting an upper bound parameter of a continuous variable comprising the original data to a first value according to a statistical characteristic of the continuous variable.
- 3 . The computer-implemented method of claim 1 , wherein configuring the plurality of parameters comprises setting a lower bound parameter of a continuous variable comprising the original data to a second value according to a statistical characteristic of the continuous variable.
- 4 . The computer-implemented method of claim 1 , wherein the variable contributes to a privacy aspect of the original data.
- 5 . The computer-implemented method of claim 1 , wherein fitting a distribution type to the variable of the original data further comprises: selecting, from a plurality of distribution type fittings according to a goodness of fit statistic computed on each distribution type fitting, the distribution type.
- 6 . The computer-implemented method of claim 1 , wherein the desired level of privacy is higher than a level of privacy in the original data.
- 7 . A computer program product comprising one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media, the program instructions executable by a processor to cause the processor to perform operations comprising: generating artificial data for differential privacy datasets and having a configurable degree of privacy protection, the generating comprising: configuring a plurality of parameters, the parameters being usable to generate artificial data from original data, the configuring adjusting a level of privacy in the artificial data; fitting a distribution type to a variable of the original data; executing, as a part of an artificial data generation application, an adjustment code configured to perform adjusting, using a desired level of privacy and the distribution type, a level of noise, wherein the level of noise corresponds to the desired level of privacy, wherein executing the adjustment code sets the level of noise based on a local sensitivity, the local sensitivity being the largest difference between two analysis results of two corresponding datasets wherein the two corresponding datasets are the same except for one record; generating from an execution of the artificial data generation application, using the distribution type and the level of noise, the artificial data, the artificial data achieving the desired level of privacy by including noise data corresponding to the level of noise; and regenerating, by changing at least one of the plurality of parameters, and responsive to determining that a similarity between an original value of the variable and a generated value of the variable is less than a threshold value, new artificial data.
- 8 . The computer program product of claim 7 , wherein the stored program instructions are stored in a computer readable storage device in a data processing system, and wherein the stored program instructions are transferred over a network from a remote data processing system.
- 9 . The computer program product of claim 7 , wherein the stored program instructions are stored in a computer readable storage device in a server data processing system, and wherein the stored program instructions are downloaded in response to a request over a network to a remote data processing system for use in a computer readable storage device associated with the remote data processing system, further comprising: program instructions to meter use of the program instructions associated with the request; and program instructions to generate an invoice based on the metered use.
- 10 . The computer program product of claim 7 , wherein configuring the plurality of parameters comprises setting an upper bound parameter of a continuous variable comprising the original data to a first value according to a statistical characteristic of the continuous variable.
- 11 . The computer program product of claim 7 , wherein configuring the plurality of parameters comprises setting a lower bound parameter of a continuous variable comprising the original data to a second value according to a statistical characteristic of the continuous variable.
- 12 . The computer program product of claim 7 , wherein the variable contributes to a privacy aspect of the original data.
- 13 . The computer program product of claim 7 , wherein fitting a distribution type to the variable of the original data further comprises: selecting, from a plurality of distribution type fittings according to a goodness of fit statistic computed on each distribution type fitting, the distribution type.
- 14 . The computer program product of claim 7 , wherein the desired level of privacy is higher than a level of privacy in the original data.
- 15 . A computer system comprising a processor and one or more computer readable storage media, and program instructions collectively stored on the one or more computer readable storage media, the program instructions executable by the processor to cause the processor to perform operations comprising: generating artificial data for differential privacy datasets and having a configurable degree of privacy protection, the generating comprising: configuring a plurality of parameters, the parameters being usable to generate artificial data from original data, the configuring adjusting a level of privacy in the artificial data; fitting a distribution type to a variable of the original data; executing, as a part of an artificial data generation application, an adjustment code configured to perform adjusting, using a desired level of privacy and the distribution type, a level of noise, wherein the level of noise corresponds to the desired level of privacy, wherein executing the adjustment code sets the level of noise based on a local sensitivity, the local sensitivity being the largest difference between two analysis results of two corresponding datasets wherein the two corresponding datasets are the same except for one record; generating from an execution of the artificial data generation application, using the distribution type and the level of noise, the artificial data, the artificial data achieving the desired level of privacy by including noise data corresponding to the level of noise; and regenerating, by changing at least one of the plurality of parameters, and responsive to determining that a similarity between an original value of the variable and a generated value of the variable is less than a threshold value, new artificial data.
- 16 . The computer system of claim 15 , wherein configuring the plurality of parameters comprises setting an upper bound parameter of a continuous variable comprising the original data to a first value according to a statistical characteristic of the continuous variable.
- 17 . The computer system of claim 15 , wherein configuring the plurality of parameters comprises setting a lower bound parameter of a continuous variable comprising the original data to a second value according to a statistical characteristic of the continuous variable.
- 18 . The computer system of claim 15 , wherein the variable contributes to a privacy aspect of the original data.
- 19 . The computer system of claim 15 , wherein fitting a distribution type to the variable of the original data further comprises: selecting, from a plurality of distribution type fittings according to a goodness of fit statistic computed on each distribution type fitting, the distribution type.
- 20 . The computer system of claim 15 , wherein the desired level of privacy is higher than a level of privacy in the original data.
Description
BACKGROUND The present invention relates generally to data privacy. More particularly, the present invention relates to a method, system, and computer program for artificial data generation for differential privacy. A dataset or database is a logical container used to organize and control access to resources such as stored data. A dataset typically includes one or more tables. A table stores data values using a model of labelled columns (also referred to as variables or fields) and rows (also referred to as records). A cell of the table is an intersection of a row and a column. Typically, column labels designate a particular type of data (for example, a table might have columns labelled “Customer ID”, “Name”, “Address”, and “Telephone Number”), and rows hold data for particular individuals (e.g., data for Customer A might be stored in row 1 and data for Customer B might be stored in row 2). Data simulation, or artificial data generation, is a process of generating artificial data that mimics the characteristics and patterns of real-world data. Data simulation is often used to generate training and testing data for use in developing machine learning models and in other situations where insufficient real-world data is available for use. Data simulation is typically performed by fitting a parametric statistical distribution to the observed data, and generating new data points from the fitted distribution. However, statistical analyses of data in a dataset can reveal information about a single individual in the dataset, particularly if an adversary knows information about other individuals in the dataset. Thus, privacy preserving data analysis and data simulation techniques, which attempt to make a dataset usable for analysis or generate artificial data using statistical information about a dataset, without compromising the privacy of any individuals with records in the dataset, have been developed. One method of implementing privacy preserving data analysis is differential privacy, which hides the presence of an individual in a dataset from a user of the dataset by making two output distributions, one with and the other without the individual, be computationally indistinguishable (for all individuals). To achieve this, differential privacy adds random noise to an output. In data simulation, the noise is added to one or more parameters of the fitted distribution. The amount of noise added to a parameter of a fitted distribution is influenced by a privacy budget parameter named epsilon (ε). A smaller value of epsilon corresponds to stronger privacy preservation. However, reducing epsilon also results in decreased accuracy. The illustrative embodiments recognize that selecting an appropriate value for epsilon requires knowledge of the dataset being protected, but it is difficult for an inexperienced data analyst to relate a value of epsilon to a particular protection need. In addition, statistical characteristics of the dataset, such as outliers in the data, are both easier for an adversary to obtain by applying statistical analysis techniques to the dataset and alter the value of epsilon need to protect privacy. In addition, particularly for a dataset with multiple variables, it is important to determine that the quality of simulated data is sufficient for a user's needs. Thus, the illustrative embodiments recognize that there is a need to generate parameters for artificial data generation for differential privacy, as well as measure the quality of the resulting data, in an automated and dataset-dependent manner. SUMMARY The illustrative embodiments provide for artificial data generation for differential privacy. An embodiment includes configuring a plurality of parameters, the parameters being usable to generate artificial data from original data, the configuring adjusting a level of privacy in the artificial data. An embodiment includes fitting a distribution type to a variable of the original data. An embodiment includes adjusting, using a desired level of privacy and the distribution type, a level of noise, wherein the level of noise corresponds to the desired level of privacy. An embodiment includes generating, using the distribution type and the level of noise, the artificial data, the artificial data achieving the desired level of privacy by including noise data corresponding to the level of noise. Other embodiments of this aspect include corresponding computer systems, apparatus, and computer programs recorded on one or more computer storage devices, each configured to perform the actions of the embodiment. Thus, an embodiment provides artificial data generation for differential privacy. In a further embodiment, configuring the plurality of parameters comprises setting an upper bound parameter of a continuous variable comprising the original data to a first value according to a statistical characteristic of the continuous variable. Thus, an embodiment provides additional detail of a parameter used in art