CN-116167078-B - Differential privacy synthetic data publishing method based on maximum weight matching
Abstract
The invention discloses a differential privacy synthetic data publishing method based on maximum weight matching, which comprises the steps of preprocessing collected user data through a reliable third-party server, then carrying out information representation on processed data attributes by using a graph model method to obtain an attribute association graph, selecting a group of proper low-dimensional marginal sets according to a maximum weight matching algorithm, then adding noise to the group of low-dimensional marginal sets respectively, enabling the noise to meet differential privacy definition, carrying out post-processing on the noisy low-dimensional marginal sets to obtain a group of standardized low-dimensional marginal sets, carrying out data synthesis according to the group of low-dimensional marginal sets to enable the synthetic data set to be similar to an original data set as much as possible in statistical information, and finally carrying out data publishing on the synthetic data set. By adopting the technical method, the calculation complexity can be reduced while the utility of the synthesized data set is ensured, and the method has better utility for high-dimensional data.
Inventors
- ZHANG MIAO
- DENG HAI
- YE XINXIN
Assignees
- 南京航空航天大学
Dates
- Publication Date
- 20260508
- Application Date
- 20230115
Claims (4)
- 1. The differential privacy synthetic data release method based on maximum weight matching is characterized in that the method obtains a low-dimensional marginal set by applying a maximum weight matching algorithm to a probability map model constructed by an original data set, performs data synthesis after adding proper noise to the low-dimensional marginal set based on a differential privacy method, and releases a synthetic data set after post-processing, thereby realizing privacy protection of user information and improving the utility of the synthetic data set; the method comprises the following processing steps: s1, a server aggregates the collected user data to obtain an initial data set, and a weighted probability graph model is built according to the data set; s2, the server applies a maximum weight matching algorithm according to the generated probability graph model to obtain a group of high-correlation low-dimensional marginal sets; S3, adding proper Gaussian noise to the low-dimensional marginal set according to the definition of the differential privacy model and a reasonable distribution method of privacy budget; S31, distributing privacy budgets, namely obtaining different privacy duty ratios according to expected square deviations of optimized noise scales, and adding proper noise to different low-dimensional marginal sets, wherein the optimization problem is expressed as: Wherein, the Representation of The domain sizes corresponding to the low dimensional marginal sets, Representing privacy budget duty cycles of the corresponding views; s32, differential privacy definition, namely a random mechanism is arranged , Is that The set of all possible outputs, in a given two adjacent data sets And On the other hand, for The random mechanism is said to be if the following inequality is satisfied Is satisfied with -Differential privacy; S33, zero-concentration differential privacy Definition of a random mechanism Given two adjacent data sets And On all that is The random mechanism is called if the following inequality is satisfied Is zero-centralized differential privacy In the following expression exists: Wherein, the Is that And Between two distributions -Renyi divergence, representing the privacy loss random variable; S34, theorem, if random mechanism Satisfying zero-concentration differential privacy Then for any of 0, The random mechanism A is satisfied -Differential privacy; S35, definition of Gaussian mechanism, given Is a sensitive query for the input data set Gaussian mechanism The following equation is satisfied: Wherein, the Is the noise scale, and can be obtained according to definition , Representing sensitive queries Is a sensitivity of (2); S36, according to the definition and theorem, the calculation mode of the added noise scale is as follows: s4, processing the problem of negative occurrence number and inconsistent data of probability data after noise is added; S5, synthesizing the high-dimensional data set by using the low-dimensional marginal set subjected to noise adding processing and post-processing so as to approximate the original data set on statistical information, and finally, releasing the synthesized data set by the server.
- 2. The method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein the aggregation of data in step S1 includes calculating a correlation coefficient according to the association between attributes, and the calculation process is as follows: given data set Generating an attribute map , Representing attribute nodes, in common Attributes, attribute map The weight of the edge connecting the two nodes is the correlation coefficient of the attribute represented by the two nodes, and the calculation expression of the correlation coefficient is as follows: Wherein, the Respectively represent the first And (d) The nodes of the attributes are selected to be the same, Representing attributes And Is used to determine the joint probability of (1), Respectively represent attributes And attributes Is a probability of (2).
- 3. The method for publishing differential privacy synthetic data based on maximum weight matching according to claim 1, wherein the step S2 of solving the low-dimensional marginal set comprises the following procedures: S21, initializing and selecting a low-dimensional marginal set M as an empty set; S22, selecting an attribute map The edge with the largest weight value in the set represents that two connected attribute nodes have higher correlation, and the attribute node pair is taken as a selected low-dimensional marginal set Added to In the collection, and delete the attribute map All the edges that are connected to these attribute nodes; s23, repeating the step S22 until the attribute map And if no attribute nodes exist, obtaining a low-dimensional marginal set with the maximum weight.
- 4. The differential privacy synthesis data distribution method based on maximum weight matching according to claim 1, wherein step S5 comprises: s51, initializing a synthetic data set Synthesizing according to the target distribution, namely synthesizing by noise distribution; s52, adding for the initial count value of the unit is smaller than the target distribution count value A secondary record, wherein, Represents the target count value of the target, Representing an initial count value of the number of bits, Represents the attenuation factor by the following calculation method , Representing an initial value, k representing an attenuation rate, t representing the number of iterations, and s representing a step size; s53, for the initial count value of the unit is larger than the target distribution count value, reducing Secondary record, again, c t represents the target count value, c s represents the initial count value, The calculation mode of (a) is that 。
Description
Differential privacy synthetic data publishing method based on maximum weight matching Technical Field The invention belongs to the field of information security, and particularly relates to a differential privacy synthetic data publishing method based on maximum weight matching. Background With the rapid development of information technology, online registration, online shopping, online trip and the like are gradually integrated into the life of people, various organizations (such as hospitals, bus buses and the like) can easily obtain detailed information of users, a large amount of user data is accumulated by a plurality of organizations, and effective data support can be provided for subsequent tasks such as prediction analysis and the like by carrying out statistical analysis on the data, so that great research value is brought to the data. To meet the needs of research and innovation, related organizations may release the obtained data information, which often includes private information of individuals, and if the private data is not properly protected, the private data of users is easily revealed, which causes immeasurable loss. Therefore, research into a data distribution method based on privacy protection is indispensable. Methods for protecting data privacy are called disclosure restrictions, and these technologies aim to provide protection for sensitive information, and issue data information to the public so that researchers can perform statistical analysis on the data, and one common method for protecting private data in the issued data is anonymization method, namely deleting sensitive information in the data, but many researches have proved that deleting sensitive information alone cannot effectively realize privacy protection, and an attacker can still obtain sensitive information through identification and analysis of other attributes, so that strong privacy guarantee cannot be provided for the data issuing process. One reliability scheme of current privacy protection is a differential privacy model, which does not care how much background knowledge an attacker has, and achieves privacy protection effect by adding proper noise to query or analysis results, thus providing reliable privacy guarantee for data distribution. The differential privacy model provides clear definition and effective proof in mathematics, and the model can be simply understood as adding or deleting a certain record in the adjacent data set, and is insensitive to the calculation processing result of the data set, that is, the probability of personal information being identified in a small range of the data processed by the differential privacy model. In recent years, many research schemes for releasing private data based on a differential privacy model, such as a haar wavelet transformation mode, a histogram mode, a division mode and the like, are available, however, these methods are designed for specific tasks, require a certain expertise in processing by a third party central server, and have a certain difficulty in fully utilizing data. Therefore, a synthetic data release scheme based on differential privacy is provided, the synthetic data set can approximate to the original data, replace the original data to perform analysis tasks, achieve the effect of protecting privacy, and can ensure the accuracy of the synthetic data set. However, the computational complexity of the currently proposed synthetic data distribution scheme based on the differential privacy method tends to be high, and a large amount of noise is added to the high-dimensional data set, resulting in the synthetic data set not being usable. Disclosure of Invention The invention provides a method for data release based on differential privacy synthetic data with maximum weight matching, which is characterized in that a probability map model constructed by an original data set is subjected to maximum weight matching algorithm to obtain low-dimensional distribution, the low-dimensional distribution is subjected to post-processing to synthesize data after proper noise is added based on the differential privacy method, and then a synthetic data set is released, so that privacy protection of user information can be achieved, and the utility of the synthetic data set is improved. In order to achieve the above object, the present invention adopts the following technical scheme. A differential privacy synthetic data release method based on maximum weight matching is characterized in that a probability map model constructed by an original data set is subjected to maximum weight matching algorithm to obtain a low-dimensional marginal set, appropriate noise is added to low-dimensional distribution based on the differential privacy method, data synthesis is performed through post-processing, and then a synthetic data set is released, so that privacy protection of user information can be achieved, and the utility of the synthetic data set is improved. The method