CN-122019855-A - Sensitive data identification and classification method for enterprise data management

CN122019855ACN 122019855 ACN122019855 ACN 122019855ACN-122019855-A

Abstract

The invention provides a sensitive data identification and classification method for enterprise data management, which relates to the technical field of data processing, and aims to solve the problems that the method is applicable to complex enterprise data management scenes in a targeted and truly applicable manner by acquiring and preprocessing enterprise multi-type source data containing structured, semi-structured and unstructured data, performing distributed training by utilizing a plurality of algorithm models selected for different data types based on the preprocessed data to obtain a plurality of candidate identification models, comprehensively evaluating the candidate models by adopting a multi-dimensional evaluation system covering classification performance, efficiency and resource consumption, model robustness and service applicability, optimizing a target identification model according to an evaluation result, and applying the target identification model to sensitive information identification and classification of the enterprise data.

Inventors

WU YANAN
ZHOU YING
LI HAOYU
LI YAN
LI QIANG

Assignees

中国联合网络通信有限公司软件研究院

Dates

Publication Date: 20260512
Application Date: 20251209

Claims (9)

1. A method for identifying and classifying sensitive data for enterprise data governance, the method comprising: Step S100, acquiring enterprise multi-type source data, preprocessing the enterprise multi-type source data to generate standardized training data, wherein the enterprise multi-type source data comprises structured data, semi-structured data and unstructured data; Step S200, based on the standardized training data, selecting different algorithm models for distributed training according to different types of data, selecting N algorithm models from a preset algorithm model library for training, and generating N candidate recognition models; step S300, performing multi-dimensional evaluation on N candidate recognition models to generate evaluation results, wherein the multi-dimensional evaluation comprises a classification performance evaluation dimension, an efficiency and resource consumption evaluation dimension and a model robustness evaluation dimension; And step S400, determining a target recognition model from N candidate recognition models based on the evaluation result, and recognizing and grading sensitive data based on the target recognition model to generate a recognition grading result.
2. The method for identifying and grading sensitive data for enterprise data governance of claim 1, wherein said obtaining enterprise multi-type source data, preprocessing the enterprise multi-type source data, generating standardized training data, comprises: defining and constructing a label system for machine learning model identification according to industry data classification grading standard documents, wherein the label system comprises classification labels and grading labels; the method comprises the steps of obtaining enterprise multi-type source data, and marking the enterprise multi-type source data based on a tag system, wherein the classification tags comprise sensitive data tags and non-sensitive data tags, and the classification tags comprise public tags, internal tags and confidential tags; Preprocessing the marked enterprise multi-type source data to generate standardized training data.
3. The method for identifying and ranking sensitive data for corporate data governance of claim 1, wherein the data types of the standardized training data include text data, image data and audio data; Selecting and loading a bidirectional encoder characterization or long-short-term memory network model based on a transformer as a basic model aiming at text data; selecting and loading a convolutional neural network model as a basic model for image data; a recurrent neural network model or a spectrogram-based convolutional neural network model is selected and loaded for the audio data as a base model.
4. The method for identifying and classifying sensitive data for enterprise data governance of claim 3, wherein said standardized training data is trained in parallel using a distributed training framework, wherein a hybrid accuracy training and configuration gradient are enabled during the training process, and wherein a training hyper-parameter is configured for each training process.
5. The method for identifying and grading sensitive data for enterprise data governance according to claim 1, wherein the indexes adopted by the classification performance evaluation dimension comprise accuracy, precision, recall, F1 score and area under an AUC-ROC curve, the indexes adopted by the efficiency and resource consumption evaluation dimension comprise training time, prediction time, memory occupancy rate and GPU use rate, and the model robustness evaluation dimension is obtained through K-fold cross-validation or deviation-variance analysis.
6. The method for identifying and grading sensitive data for enterprise data governance according to claim 1, wherein said multidimensional assessment further comprises a business availability assessment dimension, said business availability assessment dimension being obtained by means of model interpretability analysis, business index comparison, a/B testing.
7. A method of enterprise data governance oriented sensitive data recognition and classification in accordance with claim 3 wherein said text data is divided into structured text data and unstructured text data, said training of unstructured text data being adjusted based on a pre-training model LLaMA-2 or DEEPSEEKR 1.
8. The method for identifying and classifying sensitive data for enterprise data governance according to claim 7, wherein determining a target identification model from N candidate identification models based on the evaluation result, identifying and classifying sensitive data based on the target identification model for enterprise data to be processed, and generating an identification classification result includes: acquiring an evaluation result of each evaluation dimension, calculating a multi-dimensional score of each candidate recognition model based on a predefined weighted score strategy, and selecting a candidate recognition model with the highest multi-dimensional score as a target recognition model; and scanning and extracting features of the enterprise data to be processed based on the target recognition model to generate a recognition grading result.
9. The method for identifying and grading sensitive data for enterprise data governance of claim 8, wherein the identification grading result includes data identification, sensitivity type, sensitivity level and predictive confidence; and storing the identification grading result into a preset database and establishing an index.

Description

Sensitive data identification and classification method for enterprise data management Technical Field The invention relates to the technical field of data processing, in particular to a sensitive data identification and classification method for enterprise data management. Background In the field of enterprise data management, accurate identification and classification of sensitive data are core preconditions of guaranteeing data safety and meeting compliance requirements, a method based on rule matching is usually adopted in the prior art, or a special identification model is trained aiming at a specific service scene (such as processing only a database table or a specific format document), along with the increasing complexity of enterprise data forms, the method comprises the steps of structuring data in a database, semi-structuring data in a log and an interface, and unstructured data such as documents and pictures, on the one hand, the rule method is difficult to adapt to changeable data content and formats, the maintenance cost is high and the coverage rate is limited, on the other hand, the method cannot process heterogeneous data assets in an enterprise under a unified frame aiming at the model generalization capability of a single data type or scene, and in recent years, although the advanced artificial intelligence technology represented by a large language model shows strong characterization capability, the method is directly applied to sensitive information identification of massive and multi-mode data of the enterprise, the problems of huge computation power consumption, high processing delay, lack of model optimization and systematic guidance and the like generally exist. Meanwhile, the existing solution is often focused on basic indexes such as accuracy of an algorithm, and the like, and lacks a comprehensive evaluation system for model processing efficiency, resource consumption, robustness and applicability under a real service scene, so that the model landing effect is uncertain, and is difficult to be tightly combined with an enterprise actual data management flow and safety control requirements, and the following technical problems exist in the prior art when the existing solution is used: Firstly, the existing sensitive data identification method is mostly dependent on rules or isolated models designed for specific data types or business scenes, and lacks a comprehensive treatment framework capable of uniformly and efficiently treating enterprise internal structured, semi-structured, unstructured and other multi-type data, so that the method has poor adaptability, is complex to maintain and is difficult to expand; The method based on self-adaptive learning or large model has the problems of large consumption of computing resources and low processing timeliness when processing massive and multi-mode enterprise data, and meanwhile, the evaluation of model performance is often limited to basic classification indexes, and a set of multi-dimensional comprehensive evaluation and optimization selection mechanism which is systematic and covers algorithm efficiency, resource efficiency, robustness and service applicability is lacked, so that the reliability, economy and interpretability of the model in actual service deployment cannot be ensured. Disclosure of Invention In order to achieve the purpose, the invention is realized by the following technical scheme that the sensitive data identification and classification method for enterprise data management comprises the following steps: Step S100, acquiring enterprise multi-type source data, preprocessing the enterprise multi-type source data to generate standardized training data, wherein the enterprise multi-type source data comprises structured data, semi-structured data and unstructured data; Step S200, based on the standardized training data, selecting different algorithm models for distributed training according to different types of data, selecting N algorithm models from a preset algorithm model library for training, and generating N candidate recognition models; step S300, performing multi-dimensional evaluation on N candidate recognition models to generate evaluation results, wherein the multi-dimensional evaluation comprises a classification performance evaluation dimension, an efficiency and resource consumption evaluation dimension and a model robustness evaluation dimension; And step S400, determining a target recognition model from N candidate recognition models based on the evaluation result, and recognizing and grading sensitive data based on the target recognition model to generate a recognition grading result. Further, the obtaining the enterprise multi-type source data, preprocessing the enterprise multi-type source data, and generating the standardized training data includes: defining and constructing a label system for machine learning model identification according to industry data classification grading standard documents,