CN-121997364-A - Dynamic desensitization method for data sharing scene
Abstract
The application provides a dynamic desensitization method for a data sharing scene, which comprises the steps of receiving an original data set and set target data availability, converting various data in the original data set into structured data, calculating standard deviation of the structured data based on the structured data of the k data, determining initial privacy parameters of the k data based on the standard deviation of the k data and preset privacy parameter coefficients, carrying out desensitization processing on the k data based on the initial privacy parameters of the k data, determining information entropy and theoretical maximum entropy of the k data, further calculating entropy retention rate of the k data, dynamically adjusting privacy parameter coefficients of the k data based on a difference value between the entropy retention rate of the k data and the target data availability by adopting a PID (proportion integration differentiation) control algorithm until the difference value between the entropy retention rate of the k data and the target data availability is smaller than a first preset threshold, and determining final privacy parameter coefficients of the k data.
Inventors
- DU HAOFENG
- MENG QINGLEI
- LIU JIANXIONG
- WANG NAN
- WANG CHONG
- XU WEIZE
Assignees
- 航天科工网络信息发展有限公司
Dates
- Publication Date
- 20260508
- Application Date
- 20251223
Claims (10)
- 1. A dynamic desensitization method for a data sharing scene, comprising: s101, receiving an original data set and target data availability set by a user, and respectively converting various data in the original data set into structured data; s102, calculating standard deviation of the kth class data based on the structured data of the kth class data; S103, determining initial privacy parameters of the kth class data based on standard deviation of the kth class data and a preset privacy parameter coefficient, and then performing desensitization processing on the kth class data based on the initial privacy parameters of the kth class data; S104, based on a plurality of states of the k-th data after desensitization, calculating the real probability distribution of each state respectively, and further determining the information entropy of the k-th data based on the real probability distribution of the plurality of states in the k-th data; S105, assuming uniform distribution of multiple states in the kth class data, respectively calculating ideal probability distribution of each state, and further determining theoretical maximum entropy of the kth class data based on the ideal probability distribution of the multiple states in the kth class data; S106, calculating entropy retention rate of the kth class data based on information entropy and theoretical maximum entropy of the kth class data, wherein the entropy retention rate represents data availability; S107, dynamically adjusting privacy parameter coefficients of the kth class data by adopting a PID control algorithm based on the difference between the entropy retention rate of the kth class data and the availability of the target data; S108, repeatedly executing the steps S103 to S107 until the difference value between the entropy retention rate of the kth class data and the availability of the target data is smaller than a first preset threshold value, determining the final privacy parameter coefficient of the kth class data, and then performing desensitization processing on the kth class data; Wherein k is a positive integer greater than 1.
- 2. The method for dynamic desensitization of data sharing scenario according to claim 1, wherein said determining the final privacy coefficient of the kth class of data in step S108 further includes: s109, the kth class data comprises a plurality of privacy fields, the true probability distribution of the ith privacy field is calculated, and the information entropy of the ith privacy field is determined based on the true probability distribution of the ith privacy field; s110, the kth class data comprises a plurality of non-privacy fields, and under the condition that the jth non-privacy field is known by calculation, the conditional entropy of the ith privacy field is calculated; S111, calculating the entropy attenuation ratio of the j-th non-privacy field to the i-th privacy field based on the information entropy of the i-th privacy field and the conditional entropy of the i-th privacy field under the condition that the j-th non-privacy field is known; S112, repeating the steps S109 to S111 until the entropy attenuation ratio of the j-th non-privacy field to all the privacy fields is determined, and further determining the privacy contribution degree of the j-th non-privacy field; S113, repeating the steps S109 to S112 until the privacy contribution degree of all the non-privacy fields is determined, and further determining the average value of the privacy contribution degree of all the non-privacy fields; S114, presetting an adjustment factor, and for the j-th non-privacy field, calculating an adjustment coefficient of the j-th non-privacy field based on the privacy contribution degree of the j-th non-privacy field and the average value of the privacy contribution degrees of all the non-privacy fields; s115, calculating the data standard deviation of the j-th non-privacy field, and multiplying the final privacy parameter coefficient of the k-th data by the data standard deviation of the j-th non-privacy field to obtain the basic privacy parameter of the j-th non-privacy field; S116, determining final privacy parameters of the j non-privacy fields based on tuning coefficients of the j non-privacy fields and basic privacy parameters of the j non-privacy fields, and performing desensitization processing on the j non-privacy fields based on the final privacy parameters of the j non-privacy fields. Wherein i and j are positive integers greater than 1.
- 3. The method for dynamic desensitization of data sharing scenarios according to claim 2, wherein after said step S116, said method further comprises: s117, after desensitizing all the non-privacy fields, evaluating the data availability of the kth class of data to obtain an availability evaluation result of the kth class of data; S118, repeating the steps S101 to S117 until the difference value between the k-type data availability evaluation result and the target data availability is smaller than a second preset threshold value.
- 4. The dynamic desensitization method for data sharing scene according to claim 1, wherein said converting each type of data in said original data set into structured data comprises: For numerical data, preserving an original numerical format; for category type data, performing tag coding to convert the category type data into discrete values; for time type data, splitting the time type data into a plurality of independent numerical characteristics; for text data, word segmentation processing is carried out, and numerical value distribution is converted based on word frequency statistics; for image data, carrying out graying treatment and then counting the distribution of pixel values; and carrying out frequency domain conversion and feature extraction on the audio type data, and then counting distribution of spectrum features.
- 5. The dynamic desensitization method for data sharing scene according to claim 1, wherein the preset rule of the preset privacy parameter coefficients is: when the target data availability is smaller than 0.7, the preset privacy parameter coefficient is set to be 1.0; When the target data availability is greater than or equal to 0.7, the preset privacy parameter coefficient is set to 0.5.
- 6. A data sharing scenario-oriented dynamic desensitization method according to claim 3, wherein said first preset threshold and said second preset threshold are each one thousandth of said target data availability.
- 7. The dynamic desensitization method according to claim 1, wherein the average value of tuning coefficients of the non-privacy parameters in the kth class of data is 1.
- 8. A dynamic desensitization system for a data sharing scenario, the system comprising: The data receiving and processing module is used for receiving the original data set and the target data availability set by the user and respectively converting various data in the original data set into structured data; the standard deviation calculation module is used for calculating the standard deviation of the kth class data based on the structured data of the kth class data; the initial desensitization module is used for determining the initial privacy parameters of the kth class data based on the standard deviation of the kth class data and the preset privacy parameter coefficient, and then carrying out desensitization processing on the kth class data based on the initial privacy parameters of the kth class data; The information entropy calculation module is used for respectively calculating the true probability distribution of each state based on a plurality of states of the k-th data after desensitization, and further determining the information entropy of the k-th data based on the true probability distribution of the plurality of states in the k-th data; the maximum entropy calculation module is used for respectively calculating ideal probability distribution of each state on the assumption that a plurality of states in the kth class data are uniformly distributed, and further determining theoretical maximum entropy of the kth class data based on the ideal probability distribution of the plurality of states in the kth class data; The entropy retention rate calculation module is used for calculating the entropy retention rate of the kth class of data based on the information entropy and the theoretical maximum entropy of the kth class of data, wherein the entropy retention rate represents the availability of the data; The PID control adjustment module is used for dynamically adjusting the privacy parameter coefficient of the kth class data by adopting a PID control algorithm based on the difference value between the entropy retention rate of the kth class data and the availability of the target data; the iteration convergence judging module is used for repeatedly executing the steps S103 to S107 until the difference value between the entropy retention rate of the kth class data and the availability of the target data is smaller than a first preset threshold value, determining the final privacy parameter coefficient of the kth class data, and then performing desensitization processing on the kth class data; Wherein k is a positive integer greater than 1.
- 9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the data sharing scenario oriented dynamic desensitization method according to any one of claims 1-7 when executing the program.
- 10. A computer readable storage medium, characterized in that the computer readable storage medium comprises a storage computer program or instructions which, when executed, cause the method of any of claims 1-7 to be performed.
Description
Dynamic desensitization method for data sharing scene Technical Field The application relates to the technical field of data security, in particular to a dynamic desensitization method, a system, electronic equipment and a storage medium for a data sharing scene. Background The main current desensitization method for guaranteeing the availability of the concerned data and improving the privacy of the data is mainly a method based on natural language processing, such as BERT (business oriented language) network, LSTM (local area network) network and the like. The desensitization method based on natural language processing can accurately identify sensitive information in data and perform data desensitization on the sensitive information, and can guarantee the usability of the data to the greatest extent, but the method does not consider the relevance between non-sensitive information and the sensitive information, and is easy to be subjected to inference attack, so that the privacy information is restored. In order to prevent inference attacks, the advent of desensitization methods based on the generation of an antagonistic network has emerged gradually in recent years. The method mainly comprises the steps of countertraining through a generator and a discriminator, generating unreal data, and ensuring that the distribution of the generated data is as close to the original data as possible. Although GAN-based desensitization methods can preserve some of the important features of the original data after desensitization, it is difficult to accurately measure whether the desensitized data can meet usability requirements. In order to solve the problem that the usability of the GAN-based desensitization method is difficult to measure, the data desensitization method based on usability evaluation is attracting attention. Researchers conduct automated inspections through predefined data quality rules. The availability of the data is evaluated according to rules of the integrity, accuracy, consistency and the like of the data. However, the rule-based desensitization method needs to define rules according to scenes, is greatly limited by use scenes and data types, and lacks versatility. Therefore, in view of the above problems, the present application proposes a dynamic desensitization method for a data sharing scenario. Disclosure of Invention The embodiment of the application provides a dynamic desensitization method for a data sharing scene, which realizes accurate and self-adaptive optimal balance between data privacy protection intensity and usability through information theory quantitative evaluation and closed-loop dynamic adjustment. In order to achieve the above purpose, the application adopts the following technical scheme: in a first aspect, the present application provides a dynamic desensitization method for a data sharing scenario, the method comprising: S101, receiving an original data set and target data availability set by a user, and respectively converting various data in the original data set into structured data; s102, calculating standard deviation of the kth class data based on the structured data of the kth class data; S103, determining initial privacy parameters of the kth class data based on standard deviation of the kth class data and a preset privacy parameter coefficient, and then performing desensitization processing on the kth class data based on the initial privacy parameters of the kth class data; S104, based on a plurality of states of the k-th data after desensitization, calculating the real probability distribution of each state respectively, and further determining the information entropy of the k-th data based on the real probability distribution of the plurality of states in the k-th data; S105, assuming uniform distribution of multiple states in the kth class data, respectively calculating ideal probability distribution of each state, and further determining theoretical maximum entropy of the kth class data based on the ideal probability distribution of the multiple states in the kth class data; S106, calculating the entropy retention rate of the kth class data based on the information entropy and the theoretical maximum entropy of the kth class data, wherein the entropy retention rate represents the availability of the data; s107, dynamically adjusting privacy parameter coefficients of the kth class data by adopting a PID control algorithm based on the difference between the entropy retention rate of the kth class data and the availability of the target data; S108, repeatedly executing the steps S103 to S107 until the difference value between the entropy retention rate of the kth class data and the availability of the target data is smaller than a first preset threshold value, determining the final privacy parameter coefficient of the kth class data, and then performing desensitization processing on the kth class data; wherein k is a positive integer greater than 1. In a second