CN-116186757-B - Method for publishing condition feature selection differential privacy data with enhanced utility

CN116186757BCN 116186757 BCN116186757 BCN 116186757BCN-116186757-B

Abstract

The invention discloses a utility-enhanced conditional feature selection differential privacy data publishing method, which comprises the steps of preprocessing an original data set, carrying out associated feature selection on the preprocessed data set by utilizing conditional mutual information, carrying out micro-aggregation processing on data values corresponding to features meeting a threshold value condition by normalizing data to obtain a plurality of clusters with k scale, calculating to obtain a locally optimal k value by using a contour coefficient, adding noise meeting the condition to each cluster according to redefined feature dependency sensitivity, reallocating privacy budget to realize differential privacy, and finally publishing the disturbed data, wherein the published data can be subjected to task analysis such as counting inquiry, classification and the like. The invention can resist the individual data privacy attack of adversaries with strong background knowledge, and improves the availability of data release on the premise that individual sensitive data is ensured to be private.

Inventors

YE XINXIN
DENG HAI
ZHU YOUWEN

Assignees

南京航空航天大学

Dates

Publication Date: 20260508
Application Date: 20221221

Claims (3)

1. A utility-enhanced conditional feature selection differential privacy data publishing method is characterized by comprising the following steps: s1, preprocessing an original data set, wherein the preprocessing comprises deletion of abnormal values and supplement of missing values; s2, separately processing numerical data and category data, discretizing continuous numerical data, mapping and encoding the category data according to word vectors, and extracting a characteristic domain of the category data from WordNet2.1; S3, dividing the data set into a sensitive feature set S and a standard mark Fu Te feature set Q according to the feature sensitivity of the data set, calculating the condition mutual information between the standard mark feature set Q and the sensitive feature set S, and calculating a mutual information threshold value ; S4, selecting a coincidence threshold The required feature set is normalized to process data values, the data corresponding to the feature set is subjected to micro-aggregation, and different measurement distances are selected when different types of data are subjected to micro-aggregation, and the method specifically comprises the following steps of: s41, according to the condition mutual information threshold value obtained by each calculation The larger the condition mutual information is, the stronger the dependency relationship among the features is, and all feature sets larger than a threshold value are selected , , Representing a dimension of the feature; S42, normalizing the data value of each feature, wherein the calculation formula is as follows: , Wherein, the Representing a certain characteristic (th) Personal value data, the domain range of the feature is [ [ ], Representing normalized value data, and normalizing all characteristic values to be [0,1] through a normalization formula; S43, micro-aggregation is performed by iteratively creating at least Clusters of individual elements , The elements within each cluster are as similar as possible, the elements between different clusters are as different as possible, each cluster is able to select a representative record to represent the centroid and use the centroid to replace other values within the cluster; s5, carrying out micro aggregation to obtain a plurality of micro-aggregates with the size of a plurality of scales Is a cluster of (a) Measuring local optimality using contour coefficients A value comprising the steps of: Given a given having Data set of bar records Micro-aggregation into A plurality of clusters, for each instance in the cluster The profile coefficients of (a) are: Representation of Average distance to intra-cluster instances, B # ) Representation of To not contain The minimum distance of the other intra-cluster instances, ,B( )>>A( ) When S is% ) The closer to +1, the higher the intra-cluster cohesion degree and the lower the inter-cluster coupling degree, the profile coefficient S is selected to be #, the ) A maximum k value; S6, pairing The clustering adds disturbance, and privacy budget allocation is carried out again to realize differential privacy, so that a data set to be distributed is obtained, and the data set can be used for inquiring and classifying tasks.
2. The utility-enhanced conditional feature selection differential privacy data distribution method according to claim 1, wherein in step S2, feature values in the features are mapped to WordNet2.1 in the ontology knowledge after feature domains of the category data are extracted, and a minimum hierarchical structure is obtained by mapping ; The extraction of the feature domains comprises capturing and modeling the feature domains through an ontology for each classified feature, wherein the ontology of each feature domain is extracted from the existing knowledge source and comprises creating the ontology by generalizing and classifying the concepts in the feature domain.
3. The utility enhanced conditional feature selection differential privacy data distribution method according to claim 1, characterized in that in step S6, turbulence is added to n/k clusters, privacy budget allocation is performed again to achieve differential privacy, and a data set to be distributed is obtained, which can be used for query and classification tasks, comprising the steps of: S61, for numerical data, performing distance measurement by using Euclidean distance during micro aggregation, and disturbing the data by using a Laplace mechanism to achieve differential privacy; s62, for category data, mapping the feature value in the feature to WordNet2.1 in the ontology knowledge to obtain a minimum hierarchical structure When the data values are subjected to micro aggregation, semantic distance measurement is used, and an exponential mechanism is used for disturbing data to achieve differential privacy; s63, carrying out new adjustment on the problem of reassignment of the privacy budget, and adjusting the traditional balanced assignment of the total privacy budget to the multidimensional feature to be the assignment of the weighted privacy budget, wherein the specific steps are as follows: the original data set D is provided with D-dimensional features, the total privacy budget is epsilon, and after a differential privacy algorithm of the conditional feature selection micro-aggregation, the weight occupied by each feature is calculated And reallocating the weighted privacy budget to the feature selected by the conditional feature according to the weight value, thereby protecting the privacy data.

Description

Method for publishing condition feature selection differential privacy data with enhanced utility Technical Field The invention relates to the technical field of information security privacy, in particular to a utility-enhanced conditional feature selection differential privacy data publishing method. Background The rapid development of information sharing and knowledge exchange has led to explosive growth of the data generated, which data (personal salary, medical records, consumption habits, preferences, etc.) often contain a large amount of sensitive information. Service providers are more enthusiastic to collect and analyze individual data in order to provide more accurate services, thereby posing a threat to the privacy of individuals or organizations. In fact, the privacy protection object of data release is the corresponding relation between the sensitive data of the user and the identity of the individual, and only the identification between the sensitive information of the individual and the identity is blocked in the tasks of inquiring and analyzing the released data. In order to protect user privacy, traditional privacy models such as k-anonymity and its extension are used in succession. However, the existence of some new attacks, which make the traditional privacy protection model vulnerable, has been consistently demonstrated to protect sensitive information due to the inability to determine the background knowledge that an attacker has learned. Moreover, various relations exist among data or features, and the relations can cause serious privacy leakage through multi-table connection and other repeated identification means. Aiming at the defects of an infinite attack method and the existing privacy protection mechanism, a differential privacy model is proposed by Dwork team of Microsoft institute. Differential privacy is a privacy preserving model based on solid mathematics, which can strictly define privacy preserving utility and provide a method for evaluating quantification. By adding disturbance to the data, the potential user sensitive information in the published data is protected, and even if an attacker has mastered the information of all other records except a certain record in the data set, the attacker still cannot infer the original data. The mathematical definition of differential privacy is such that one random algorithm A satisfies epsilon-differential privacy if and only if all possible outputs S of algorithm A are available for all neighboring databases D 1 and D 2Range (A) satisfies the inequality Pr (A (D 1)∈S)≤exp(ε)×Pr(A(D2)). The difference between D 1 and D 2 is only one record, epsilon is larger than or equal to 0 and is a privacy budget, the privacy protection degree of data can be measured, epsilon is closer to 0, the privacy protection degree is higher, meanwhile, the data disturbance degree is higher, and the error is larger. In addition, the degree of data privacy protection is also related to the sensitivity of the query or classification algorithm, the higher the sensitivity, the higher the noise scale required for differential privacy and the larger the error. In practical application, when strict differential privacy is realized, the distortion degree of data is higher, and the usability is lower. Thus, balancing the privacy and availability of data in differential privacy data distribution and data mining scenarios is an important challenge. In order to cope with the challenges of stronger privacy protection and low data utility in the release of differential privacy data, the differential privacy technology and the existing machine learning model are combined to form a new model, and the advantages of the differential privacy technology and the existing machine learning model are combined and act cooperatively. On one hand, the model can reduce the query sensitivity of a privacy algorithm by utilizing the model characteristics of machine learning so as to improve the data availability, and on the other hand, the algorithm can realize the basic definition of differential privacy and can block the identification between the sensitive information and the identity of an individual. The model is applied to the real data set to generate disturbed data to be distributed, and the data to be distributed can improve the utility of the data on the premise of protecting the sensitive information of individuals. Disclosure of Invention The invention aims to provide a condition feature selection differential privacy data publishing method with enhanced utility, which can realize that user data can be strongly guaranteed in privacy on one hand and can improve the usability of data by reducing the sensitivity of a privacy algorithm on the other hand, thereby improving the usability of tasks such as statistical query, classification analysis and the like. The utility-enhanced conditional feature selection differential privacy data publishing method comprises the following step