CN-121980617-A - Method, device and system for automatically identifying sensitive data to perform de-identification processing
Abstract
The invention discloses a method, a device and a system for automatically identifying sensitive data to perform de-identification processing, which relate to the field of data security and solve the problem of insufficient security of the existing sensitive data, and comprise the following steps of S1, acquiring a multi-source heterogeneous data source and generating a data unit to be processed and metadata; the method comprises the steps of S2, constructing a dynamic sensitive data ontology library, carrying out sensitive data identification on a data unit to be processed to obtain a sensitive entity, recording identification confidence and judging the data unit to be processed by the identification confidence, S3, carrying out de-identification processing on the sensitive entity, carrying out consistency verification on the sensitive entity, S4, carrying out re-identification risk quantification evaluation on a processing completion unit, and implementing a closed loop optimization mechanism, and S5, outputting data which pass through compliance verification to a target platform, and carrying out full life cycle management and control on output data.
Inventors
- ZHANG YADONG
- DAI MIN
- HUANG YANGCHENG
- Zhu jianuo
Assignees
- 北京腾云天下科技有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260407
Claims (10)
- 1. A method for automatically identifying sensitive data for de-identification processing, comprising: Step S1, acquiring a multi-source heterogeneous data source, and performing standard preprocessing to generate a data unit to be processed; step S2, acquiring a dynamic sensitive data ontology library, identifying sensitive data of the data unit to be processed according to the dynamic sensitive data ontology library to obtain sensitive entities, recording the identification confidence coefficient of each sensitive entity, judging the data unit to be processed based on the identification confidence coefficient, grading based on the judgment result, and generating a full dataset sensitive data distribution statistical report by the grading result; Step S3, performing de-identification processing on the sensitive entity based on the metadata to obtain a processing completion unit, and performing data consistency verification with the sensitive entity; S4, performing re-identification risk quantitative evaluation on the processing completion unit, and implementing a closed-loop optimization mechanism according to an evaluation result; And S5, outputting the data passing through the compliance verification to a target platform, synchronously outputting a full-link processing log, a re-identification risk assessment report, a compliance verification report and a full-data set sensitive data distribution statistical report to the target platform, and performing full life cycle management and control on the output data.
- 2. The method for automatically identifying sensitive data for de-identification according to claim 1, wherein the specific steps of step S1 are as follows: Step S11, counting the structure types of the multi-source heterogeneous data sources, and classifying the multi-source heterogeneous data sources according to different structure types to obtain single-structure data sources; The standardized model sequentially receives the single-structure data source, performs standardized processing on the single-structure data source to generate single-structure standard data, inputs the single-structure standard data into the identification classification model, and generates a plurality of data fragments and data summaries corresponding to the data fragments; step S12, generating data fragments and data summaries of all single structure data sources, counting the data fragments to obtain data units to be processed, integrating the data summaries through an aggregation container to obtain metadata of multi-source heterogeneous data sources, and storing and recording the data units to be processed and the metadata.
- 3. The method for automatically identifying sensitive data for de-identification according to claim 1, wherein the specific steps of step S2 are as follows: Step S21, acquiring national laws and regulations and industry regulations related to sensitive data to obtain sensitive data regulations, importing the sensitive data regulations as a knowledge base into a large model for training to obtain a dynamic sensitive data ontology base, acquiring updated laws and regulations and industry regulations by the large model in real time, and updating the dynamic sensitive data ontology base in real time; Step S22, sensitive data identification is carried out on the data units to be processed through the dynamic sensitive data ontology base to obtain sensitive entities, identification confidence coefficient of each sensitive entity is recorded, artificial verification is carried out according to the identification confidence coefficient, iteration optimization is carried out on a large model in the dynamic sensitive data ontology base, meanwhile, the sensitive entities are classified to generate entity tags, statistics is carried out on the sensitive entities and the entity tags, and a full dataset sensitive data distribution statistical report is constructed.
- 4. A method for automatically identifying sensitive data for de-identification according to claim 3, wherein the specific steps of step S22 are as follows: step S221, based on a dynamic sensitive data ontology library, acquiring sensitive data in a fixed format, generating a retrieval regular expression according to the fixed format, primarily identifying a data unit to be processed through the retrieval regular expression to obtain a fixed structure sensitive entity, and setting the confidence coefficient of the fixed structure sensitive entity as 100%; Step S222, invoking a special large model for the pre-training and fine-tuning sensitive information identification, receiving a data unit to be processed, comparing the data unit to be processed with a dynamic sensitive data ontology library, deeply identifying sensitive data without fixed formats in the data unit to be processed to obtain a sensitive entity without fixed structures, and recording the similarity of the sensitive entity without fixed structures for comparison through the special large model for the sensitive information identification as the confidence of the sensitive entity without fixed structures.
- 5. The method for de-labeling automatically identifying sensitive data as in claim 4, wherein the step S222 is followed by the steps of: Step S223, counting the sensitive entities with fixed structures and the sensitive entities without fixed structures to obtain sensitive entities, recording the confidence coefficient of each sensitive entity, setting a confidence coefficient threshold value, comparing the confidence coefficient of each sensitive entity with the confidence coefficient threshold value, recording the sensitive entities with the confidence coefficient higher than the confidence coefficient threshold value, transmitting the sensitive entities with the confidence coefficient lower than the confidence coefficient threshold value to a worker for manual rechecking, and carrying out iterative optimization on the large model according to the rechecking result; Step S224, performing hierarchical matching on the checked sensitive entity and the dynamic sensitive data ontology library to obtain a sensitive grade, generating an entity label by combining the confidence coefficient of the sensitive entity and the analysis state of the sensitive entity, and performing statistics on the sensitive entity and the entity label to construct a full dataset sensitive data distribution statistical report.
- 6. The method for automatically identifying sensitive data for de-identification according to claim 1, wherein the specific steps of step S4 are as follows: s41, performing re-identification risk quantitative evaluation on the processing completion unit, generating a re-identification risk quantitative evaluation report, backing the processing completion unit according to an evaluation result, re-performing de-identification processing, recording the backing times of the processing completion unit, and performing early warning on staff based on the backing times; Step S42, acquiring a dynamic sensitive data ontology library, checking sensitive data of a processing completion unit, analyzing the processing omission condition of the sensitive data, tracing the processing procedure of the processing completion unit, analyzing the processing tracing condition of the processing completion unit, backing the processing completion unit based on the processing omission condition and the processing tracing condition, and generating a compliance check report.
- 7. The method for automatically identifying sensitive data for de-identification according to claim 6, wherein the specific steps of step S41 are as follows: step S411, collecting anonymization indexes of a processing completion unit, setting a threshold value of the anonymization indexes to obtain an anonymization index threshold value, checking the anonymization indexes of the processing completion unit according to the anonymization index threshold value, and if the anonymization indexes of the processing completion unit do not meet the threshold value, returning the processing completion unit to carry out anonymization again; Step S412, transmitting the processing completion units to an attack model, simulating the associated attack by combining the public auxiliary data set to obtain re-identification probability values of the processing completion units, counting the re-identification probability values of all the processing completion units, setting up a risk division threshold according to the data distribution of the re-identification probability values of the processing completion units, carrying out risk division on the processing completion units, and recording the risk level of the processing completion units; Step S413, according to the risk level, carrying out rollback on the processing completion unit, executing closed loop optimization, carrying out de-identification processing on the processing completion unit again, counting the rollback times of the processing completion unit, carrying out early warning on staff according to the rollback times, and carrying out manual auxiliary processing on the processing completion unit by the staff.
- 8. The method for de-labeling automatically identifying sensitive data as in claim 7, wherein the specific steps of step S412 are as follows: Counting the re-identification probability values of the processing completion unit, marking the re-identification probability values as zbs, traversing all the re-identification probability values, and recording the occurrence frequency of each re-identification probability value as cpl (zbs); Counting the occurrence frequency of the re-identification probability value based on the re-identification probability value to obtain a frequency distribution list of the processing completion unit; Setting two pointers, wherein the initial positions of the two pointers are the position of the first element and the position of the last element in the frequency distribution list respectively, and the pointers are respectively marked as a left pointer and a right pointer; dividing the frequency distribution list into three sub-lists based on the left pointer and the right pointer, and marking the sub-lists as a low-order list, a middle-order list and a high-order list, respectively carrying out variance statistics on cpl (zbs) in the low-order list, the middle-order list and the high-order list to obtain a low-order variance, a middle-order variance and a high-order variance, and obtaining a distribution judgment value from the low-order variance, the middle-order variance and the high-order variance; And extracting the left pointer and the right pointer corresponding to the largest distribution judgment value, and taking the re-identification probability value corresponding to the left pointer and the right pointer in the frequency distribution list as a risk division threshold.
- 9. An apparatus for automatically identifying sensitive data for de-identification processing, the apparatus comprising a memory and a processor coupled to the memory, wherein the memory is configured to store a set of program code, and wherein the processor is configured to invoke the stored program code to perform the method of any of claims 1-8.
- 10. A system for automatically identifying sensitive data for de-identification, adapted to a method for automatically identifying sensitive data for de-identification according to any one of claims 1-8, the processing system comprising: the data acquisition and processing module is used for acquiring multi-source heterogeneous data sources, carrying out standardized pretreatment on the data based on the multi-source heterogeneous data sources to generate data units to be processed, and extracting and storing metadata of the multi-source heterogeneous data sources; the sensitive data identification module is used for constructing a dynamic sensitive data ontology base, carrying out sensitive data identification on the data units to be processed according to the dynamic sensitive data ontology base to obtain sensitive entities, recording the identification confidence coefficient of each sensitive entity, judging the data units to be processed according to the identification confidence coefficient, grading the data units based on the judgment result, and generating a full data set sensitive data distribution statistical report according to the grading result; The de-identification processing module is used for performing de-identification processing on the sensitive entity based on the metadata to obtain a processing completion unit, and performing data consistency check on the processing completion unit and the sensitive entity; The evaluation and verification module is used for carrying out re-identification risk quantification evaluation on the processing completion unit, generating a re-identification risk quantification evaluation report, and implementing a closed loop optimization mechanism according to the evaluation condition; And the full-period management and control module is used for outputting the data passing the compliance verification to a target platform, synchronously outputting a full-link processing log, a re-identification risk assessment report, a compliance verification report and a full-data set sensitive data distribution statistical report to the target platform, and carrying out full-life period management and control on the output data.
Description
Method, device and system for automatically identifying sensitive data to perform de-identification processing Technical Field The invention belongs to the field of data security, and particularly relates to a sensitive data identification processing technology, in particular to a method, a device and a system for automatically identifying sensitive data for de-identification processing. Background When the sensitive data is identified and processed in the prior art, the following defects exist: 1. The sensitive data multilayer identification method with the publication number of CN113642030A is mainly used for identifying and processing sensitive data, and mainly adopts a single identification mode of regular rule and keyword matching by extracting specific classification of the sensitive data, is effective in the identification range of the sensitive data, can only identify structured sensitive data with fixed formats such as an identity card number, a mobile phone number and the like, cannot effectively identify sensitive information without fixed formats (such as medical record condition description, contract business clauses, chat record trace track and the like) in unstructured data, and is a partial single deep learning model scheme, lacks fusion verification of rules and models, has high false alarm rate and cannot meet the compliance requirement. 2. The existing sensitive data identification and processing has low automation degree, the sensitive data identification, de-identification processing and compliance verification are mostly independent tools, the data import, the rule configuration and the result verification are required to be completed through manual intervention, the processing efficiency is low, the labor cost is high, a full-flow audit mark-remaining mechanism is lacked, and the compliance traceability requirement cannot be met. 3. The existing scheme only completes de-identification operation, does not carry out re-identification risk quantitative evaluation on the processing result, is extremely easy to cause the problem of false de-identification, causes enterprises to face compliance punishment, and meanwhile cannot automatically optimize the processing strategy according to the risk result, and has low manual adjustment efficiency. Therefore, we propose a method, device and system for automatically identifying sensitive data to perform de-identification processing. Disclosure of Invention Aiming at the defects of the prior art, the invention aims to provide a method, a device and a system for automatically identifying sensitive data to perform de-identification processing. In order to achieve the aim, the invention adopts the following technical scheme that the method for automatically identifying sensitive data to perform de-identification treatment comprises the following specific working processes: Step S1, acquiring a multi-source heterogeneous data source, and performing standard preprocessing to generate a data unit to be processed; step S2, constructing a dynamic sensitive data ontology base, identifying sensitive data of the data units to be processed according to the dynamic sensitive data ontology base to obtain sensitive entities, recording the identification confidence coefficient of each sensitive entity, judging the data units to be processed according to the identification confidence coefficient, grading the data units based on the judgment result, and generating a full data set sensitive data distribution statistical report according to the grading result; Step S3, performing de-identification processing on the sensitive entity based on the metadata to obtain a processing completion unit, and performing data consistency verification with the sensitive entity; S4, performing re-identification risk quantitative evaluation on the processing completion unit, and implementing a closed-loop optimization mechanism according to an evaluation result; And S5, outputting the data passing through the compliance verification to a target platform, synchronously outputting a full-link processing log, a re-identification risk assessment report, a compliance verification report and a full-data set sensitive data distribution statistical report to the target platform, and performing full life cycle management and control on the output data. Further, the specific steps of the step S1 are as follows: Step S11, counting the structure types of the multi-source heterogeneous data sources, and classifying the multi-source heterogeneous data sources according to different structure types to obtain single-structure data sources; The standardized model sequentially receives the single-structure data source, performs standardized processing on the single-structure data source to generate single-structure standard data, inputs the single-structure standard data into the identification classification model, and generates a plurality of data fragments and data summaries corresponding to the da