EP-3973429-B1 - COMPATIBLE ANONYMIZATION OF DATA SETS OF DIFFERENT SOURCES

EP3973429B1EP 3973429 B1EP3973429 B1EP 3973429B1EP-3973429-B1

Inventors

MIETTINEN, Timo A.
SAARELA, Janna
PERHEENTUPA, Teemu J.
MILLS, ROBERT
ALI, Mehreen
PENTIKÄINEN, Tuomo

Dates

Publication Date: 20260513
Application Date: 20200520

Claims (15)

A method for creating compatible anonymized data sets, performed automatically on machine learning equipment that operates a machine learning model, the method comprising: defining (210) data types of individual variables of a first data set; identifying (220) quasi-identifiers for the first data set; defining (230) reidentification sensitivity of all or any targeted subset of the individual variables and quasi-identifiers; defining (240) missing data handling rules for the individual variables: defining (250) allowed data transformations including generalization and use of synthesized data; optimizing (260) quasi-identifier selection, use of synthesized data and a choice of data transformations to minimize information loss and maximize privacy metrics based on at least all of: the first data set; the allowed data transformations; and the missing data handling rules; the method further comprising: training (270) the machine learning model using: the first data set according to the defined data types; the optimized quasi-identifier selection; the optimized use of synthesized data; and the choice of data transformations; and anonymizing (280) the first data set using the trained machine learning model.
The method of claim 1, wherein the method uses one or more further data sets in conjunction with the first data set in the actions defined for the first data set.
The method of claim 1 or 2, wherein the defining of the data types of the individual variables of the first data set is based on a dictionary formed by the machine learning model.
The method of any one of the preceding claims, wherein the defining of the missing data handling rules for the individual variables comprises user defined variable-specific statistical imputation strategies.
The method of any one of the preceding claims, wherein the defining of the missing data handling rules for the individual variables comprises defining combinations of quasi-identifier dependent learned rules which can be adjusted during run-time.
The method of any one of the preceding claims, wherein the defining of the reidentification sensitivity of all or any targeted subset of the individual variables and quasi-identifiers comprises determining combined identification capability of the individual variables and the quasi-identifiers and also of the individual quasi-identifiers.
The method of any one of the preceding claims, wherein the training of the machine learning model using the first data set according to the defined data types uses a first portion of the first data set for the training.
The method of claim 7, further comprising using a second portion of the first data set for validating the machine learning model.
The method of any one of the preceding claims, further comprising anonymizing a second data set using the trained machine learning model.
The method of any one of the preceding claims, further comprising using different reference architectures for the different data types.
The method of any one of the preceding claims, further comprising configuring different levels of data protection for different types of data.
The method of any one of the preceding claims, wherein the first data set is a dynamic data set.
The method of any one of the preceding claims, wherein the overall performance of the system is adjusted to the workload by adding more parallel or independent processing units and/or by adding virtual processing resources.
A computer program comprising computer executable program code which when executed by at least one processor causes an apparatus to perform at least the method of any one of the preceding claims.
Machine learning equipment comprising: a communication interface for receiving a first data set; and a processing function configured to perform the method of any one of claims 1 to 13.

Description

TECHNICAL FIELD The present invention generally relates to compatible anonymization of data sets of different sources. BACKGROUND This section illustrates useful background information without admission of any technique described herein representative of the state of the art. Creating large data sets for artificial intelligence or machine learning applications often requires combining data sets from multiple data sources within or between organizations. When the data is not public and especially when the data is sensitive, e.g., personal or the data contains commercially sensitive data, access to different data sets may be particularly restricted. The development of artificial intelligence or a machine learning model also requires source data that is sufficiently anonymized to avoid the risk of the learning becoming distorted by particular irrelevant details that are not statistically relevant but would easily be identified by a machine learning process. It is thus necessary for technical applications of machine learning to appropriately anonymize data sets. Moreover, when there are numerous data sets from different sources, the combining of the data sets is difficult to arrange so that the sensitive data is kept away while the structure of each data set is yet maintained statistically representative and correct. Several methods exist for protecting sensitive data and personal information by anonymizing the data and by replacing identifiers with pseudonyms or by using synthetic data instead of actual data. However, these methods would scale poorly if the data were continuously accumulating or arriving from several different organizations when the sensitive raw data cannot be shared. Also, current anonymization methods cannot guarantee compatibility of the datasets if the anonymization is performed prior to combining data from different sources. US2018012039A1 discloses a device for anonymizing input data. EP3451209A1 discloses an apparatus for anonymizing image content. It is an object of the present invention to solve or mitigate the problems related to prior art and/or to provide new technical alternative(s). SUMMARY According to a first example aspect of the invention there is provided a method for creating compatible anonymized data sets, automatically performing with machine learning equipment that operates a machine learning model: defining data types of individual variables of a first data set;identifying quasi-identifiers for the first data set;defining reidentification sensitivity of all or any targeted subset of the individual variables and quasi-identifiers; defining missing data handling rules for the individual variables;defining allowed data transformations including generalization and use of synthesized data;optimizing quasi-identifier selection, use of synthesized data and a choice of data transformations to minimize information loss and maximize privacy metrics based on at least all of: the first data set;the allowed data transformations;the missing data handling rules;the method further comprising training the machine learning model using: the first data set according to the defined data types; the optimized quasi-identifier selection; the optimized use of synthesized data; and the choice of data transformations;and anonymizing the first data set using the trained machine learning model. In an embodiment, the method uses one or more further data sets in conjunction with the first data set in the acts defined for the first data set. Advantageously, a plurality of data sets may be used in training the machine learning model. The defining of the data types of the individual variables of the first data set may be based on a dictionary formed by the machined learning. The machine learning model may refer to a state of the machine learning equipment in which the machine learning equipment has learned associations of different items. The defining of the missing data handling rules for the individual variables may comprise defined user defined variable-specific statistical imputation strategies and/or defining combinations of quasi-identifier dependent learned rules which can be adjusted during run-time. The first data set may comprise high-dimensional multivariate data with identifiable or sensitive information. The first data set may comprise numeric data. The first data set may alternatively or additionally comprise textual data. The machine learning model may quantify textual data. The quantifying of textual data may comprise counting instances of each substring, or sets of allowed textual values. The quantifying may produce numeric data. The defining of the reidentification sensitivity of all or any targeted subset of the individual variables and quasi-identifiers may comprise determining combined identification capability of the individual variables and the quasi-identifiers and also of the individual quasi-identifiers. The training of the machine learning model using the first data set acco