CN-121980611-A - AI-based data desensitization processing method

CN121980611ACN 121980611 ACN121980611 ACN 121980611ACN-121980611-A

Abstract

The invention discloses an AI-based data desensitization processing method, which comprises the following steps of firstly preprocessing multi-mode data and intelligently identifying sensitive fields, secondly intelligently matching desensitization strategies, thirdly, multidimensional checking and self-adaptive optimization of a desensitization effect, wherein the sensitive fields are high in identification precision and excellent in efficiency, the rule engine is adopted to realize quick identification of sensitive fields with high confidence coefficient by adopting an identification framework fused by the rule engine and a fine-tuning AI model, the identification efficiency is improved, the fine-tuning AI model is trained through an industry exclusive sample set, the fine-tuning AI model has good industry suitability, the identification precision under a complex scene is improved, the identification conflict is eliminated by combining with a D-S evidence theory, and the overall identification accuracy is not lower than 96 percent and is remarkably superior to the existing single identification method.

Inventors

ZHU JINZHOU
WANG KAI
GU LI

Assignees

独角鲸(湖州南浔)数据科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260123

Claims (10)

1. An AI-based data desensitization processing method, comprising the steps of: Firstly, preprocessing multi-mode data and intelligently identifying sensitive fields, performing standardized preprocessing on the input multi-mode data, and completing the identification and sensitivity classification of the sensitive fields by adopting an identification framework fused by a rule engine and a fine-tuning AI model; Intelligent matching of the desensitization strategy, calling a desensitization strategy knowledge base based on the attribute of the sensitive field obtained in the first step and the preset use scene of the data, dynamically matching the optimal desensitization strategy through a reinforcement learning algorithm, and executing desensitization treatment; and thirdly, performing multidimensional verification and self-adaptive optimization on the desensitization effect, sequentially performing compliance verification, data utility evaluation and attack test on the desensitized data, outputting the desensitized data if the verification passes, feeding back to the second step to adjust the desensitization strategy if the verification fails, and re-performing the desensitization treatment until the verification passes.
2. The AI-based data desensitization processing method of claim 1, wherein the multi-modal data comprises structured data, semi-structured data and unstructured data, the standardized preprocessing comprises the steps of field type unification and null value filling of the structured data, format analysis and label standardization of the semi-structured data, text information extraction of the unstructured data through OCR technology, word segmentation and vector conversion, a Word2Vec model is adopted for vector conversion, and the output dimension is 256 dimensions.
3. The data desensitization processing method based on the AI is characterized in that an identification framework fused by a rule engine and a fine adjustment AI model in the first step is characterized in that the rule engine is pre-configured with 35 types of high-confidence regular expressions and sensitive keyword dictionaries, pre-processed data are initially identified, high-confidence sensitive fields are output, the rest data are input into the fine adjustment AI model for deep identification, the fine adjustment AI model is obtained based on BERT model optimization, fine adjustment training is carried out through an industry specific sensitive data sample set, the training iteration number is 100, the learning rate is 2e-5, recognition results are fused, recognition conflicts are eliminated through a D-S evidence theory, a sensitive field list and corresponding sensitivity grades are finally output, and the sensitivity grades are classified into four grades of being confidential, internal and public.
4. The AI-based data desensitization processing method as set forth in claim 3, wherein the industry-specific sensitive data sample set comprises labeling samples of three core industries of finance, government affairs and medical treatment, the sample size of each industry is not less than 10 ten thousand, the labeling content of each sample comprises a sensitive field type, a position and a sensitivity level, a conflict coefficient threshold of the D-S evidence theory is set to be 0.7, a weighted average method is adopted to fuse identification results when the conflict coefficient is less than or equal to 0.7, and the identification results of the fine-tuning AI model are determined to be correct when the conflict coefficient is greater than 0.7.
5. The AI-based data desensitization processing method as set forth in claim 1, wherein the desensitization policy knowledge base in the second step comprises five core desensitization policies including masking, replacing, encrypting, generalizing, and differential privacy noise injection, each policy corresponding to adaptation rules of different sensitivity levels and data types, wherein a state space of the reinforcement learning algorithm is a combination of a sensitivity field level, a data type, and a usage scene, an action space is five desensitization policies, and a reward function is defined as a weighted sum of a compliance score and a data utility score, and weight ratio is 0.6 and 0.4, respectively.
6. The AI-based data desensitization processing method as set forth in claim 5, wherein the differential privacy noise injection strategy is specifically configured to allocate a privacy budget based on the sensitivity level determined in the first step, wherein the privacy budget comprises epsilon=0.1, epsilon=0.3, and epsilon=0.5, and wherein the noise is generated by using a Laplace mechanism, the noise intensity is inversely proportional to the privacy budget, the noise injection position is a non-critical feature bit of the data, and the statistical distribution characteristic of the data is ensured to be unchanged.
7. The AI-based data desensitization processing method according to claim 1, wherein the compliance verification in the third step is specifically that a preset compliance rule base is called, the compliance rule base covers related requirements of "network security level protection System 2.0" GB/T35273 personal information security Specification "and GDPR, a semantic matching algorithm is adopted to compare desensitization data with compliance rules, if sensitive information violating the rules exists, the verification is judged not to be passed, otherwise, the passing is judged to be passed, and a compliance passing rate threshold is set to 100%.
8. The AI-based data desensitization processing method of claim 1, wherein the data utility evaluation in the third step is specifically that corresponding utility evaluation indexes are selected according to preset use scenes of data, wherein evaluation indexes of a statistical analysis scene are mean deviation rate, variance deviation rate and quantile deviation rate, deviation rate thresholds are set to be 5%, evaluation indexes of a machine learning modeling scene are model accuracy rate reduction amplitude, reduction amplitude threshold is set to be 10%, and utility evaluation is passed if all evaluation indexes meet threshold requirements, or else not passed.
9. The AI-based data desensitization processing method according to claim 1, wherein the attack test in the third step adopts two modes of a simulated inference attack and an association attack, wherein the simulated inference attack is implemented by constructing an attack model, attempting to deduce original sensitive information from desensitized data, setting a deduction success rate threshold to be 3%, the association attack is implemented by associating external public data, attempting to match the original sensitive information, setting a matching success rate threshold to be 5%, and if the success rates of the two attack modes are lower than the corresponding threshold, the attack test is passed, otherwise, the attack test is not passed.
10. The AI-based data desensitization processing method as set forth in claim 1, further comprising a desensitization process certification step, wherein fingerprint information of data before and after desensitization is calculated by adopting an SHA-256 hash algorithm, a data fingerprint is generated by combining a processing time stamp and an operator ID, digital signature is performed by an RSA algorithm, and then the digital signature is written into a alliance chain certification, so that traceability and tamper resistance of the desensitization process are ensured.

Description

AI-based data desensitization processing method Technical Field The invention belongs to the crossing field of data security and artificial intelligence technology, and particularly relates to a data desensitization processing method based on AI. Background Along with the rapid development of digital economy, data becomes a core production element and plays an important role in the fields of government administration, enterprise operation, scientific research innovation and the like. However, the data is exposed to serious risk of privacy disclosure during sharing, circulation and use, especially disclosure of sensitive data including personal identity information, financial information, medical records and the like, which not only damages legal rights of individuals, but also may violate related laws and regulations. The data desensitization technology is used as a key means for guaranteeing the privacy safety of data, and the privacy protection and the data use are balanced by processing sensitive information. The existing data desensitization method is mainly divided into two types, namely a traditional desensitization method based on rules, sensitive fields are identified through rules such as preset regular expressions, keyword matching and the like, and then fixed desensitization strategies are adopted for processing. The method has the advantages of high processing speed, obvious defects, weak recognition capability on unstructured data, incapability of coping with sensitive content of semantic change, high omission rate and false detection rate, and the other type of the method is an AI-based desensitization method, which utilizes machine learning and deep learning models to improve the recognition precision of sensitive fields, but most of the methods have poor model generalization capability, uncombined industry scene characteristics, and rigid desensitization strategy matching, so that the data compliance and the use value are difficult to consider. In addition, the effect verification mechanism of the existing desensitization method is single, the effectiveness of desensitization is verified by simple data comparison, the effectiveness of the data in an actual use scene is not considered, and the data after desensitization can have hidden danger of privacy disclosure or lose analysis use value due to excessive desensitization because the data is not tested against potential attack risks. For example, the chinese patent application with publication No. CN117520774a discloses a method for detecting the effectiveness of data desensitization, which judges the effectiveness of desensitization by comparing the desensitization data table with the original data table, verifying the integrity of the desensitization field, and identifying the sensitive data, but the method only focuses on the integrity of desensitization, does not evaluate the use value of the data, and does not consider the differentiated requirements of different industries and different use scenes on the desensitization strategy, so that the applicability is limited. The disclosed data desensitization method based on the differential privacy algorithm realizes the dynamic allocation of privacy budget, but does not combine the advantages of a rule engine in a sensitive field identification link, has low identification efficiency, does not establish a perfect attack test mechanism, and cannot fully ensure the security of desensitized data. Therefore, developing an AI-based data desensitization processing method capable of accurately identifying sensitive fields, intelligently matching desensitization strategies and considering compliance and data use value becomes a technical problem to be solved urgently in the current data security field, Therefore, an AI-based data desensitization processing method is designed. Disclosure of Invention Aiming at the situation, in order to overcome the defects of the prior art, the invention provides the data desensitization processing method based on the AI, which effectively solves the problems proposed by the background technology. In order to achieve the above purpose, the invention provides a data desensitization processing method based on AI, comprising the following steps: Firstly, preprocessing multi-mode data and intelligently identifying sensitive fields, performing standardized preprocessing on the input multi-mode data, and completing the identification and sensitivity classification of the sensitive fields by adopting an identification framework fused by a rule engine and a fine-tuning AI model; Intelligent matching of the desensitization strategy, calling a desensitization strategy knowledge base based on the attribute of the sensitive field obtained in the first step and the preset use scene of the data, dynamically matching the optimal desensitization strategy through a reinforcement learning algorithm, and executing desensitization treatment; and thirdly, performing multidimensional ver