CN-122020711-A - Mass data desensitization method and system for distributed storage

CN122020711ACN 122020711 ACN122020711 ACN 122020711ACN-122020711-A

Abstract

The invention relates to the technical field of data processing, in particular to a mass data desensitization method and system for distributed storage. The method comprises the steps of extracting a numerical value most value of a target field in a file range, constructing a statistical vector by the numerical value most value of the target field and the total line number, constructing a line group numerical value interval of a line group corresponding to the target field, extracting prefix data of the numerical value most value in each line group numerical value interval, determining interval overlapping indexes according to a start event point and a stop event point in the line group numerical value interval, carrying out span correction according to differences among the numerical values most value in the statistical vector, determining sparseness indexes, determining prefix uniqueness in prefix data corresponding to all line group numerical value intervals, synthesizing the interval overlapping indexes, the sparseness indexes and the prefix uniqueness, generating semantic labels corresponding to the target field, and dynamically injecting desensitization operators according to the semantic labels of the target field during inquiry. The invention improves the precision of desensitizing the sensitive data.

Inventors

LI ZHUOHAO
WU HUIHAO

Assignees

深圳市青云端信息科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260119

Claims (10)

1. The mass data desensitization method for distributed storage is characterized by comprising the following steps: Reading file metadata blocks at the tail of a column type storage file, traversing column fields in the file metadata blocks, taking any column field as a target field, extracting a numerical value maximum value of the target field in a file range, constructing a statistical vector by the numerical value maximum value of the target field and the total number of rows, constructing a row group numerical value interval of a row group corresponding to the target field, extracting prefix data of the numerical value maximum value in each row group numerical value interval; Determining interval overlapping indexes according to the initial event points and the end event points in the line group numerical intervals, performing span correction according to the difference between numerical values in the statistical vector, and determining sparseness indexes; And dynamically injecting a desensitization operator in the query according to the semantic label of the target field.
2. The distributed storage oriented mass data desensitization method of claim 1, wherein said extracting the numerical value of the target field in the file range further comprises mapping the non-numerical value field into a numerical value interval.
3. The distributed storage oriented mass data desensitization method according to claim 1, wherein said determining an interval overlap index according to a start event point and an end event point in a row group value interval comprises: Acquiring a starting event point and a terminating event point in each row group numerical interval; Ordering the event points according to the corresponding value of the event points in the value intervals of all the row groups to obtain an event point sequence; and determining the current interval overlapping index according to the overlapping area between the adjacent event points in the event point sequence.
4. The mass data desensitization method for distributed storage according to claim 1, wherein said performing span correction according to the difference between the numerical values in the statistical vector, determining the sparseness index comprises: When the numerical value maximum values in the same statistical vector have differences, comparing the difference values between the numerical value maximum values with the total number of the corresponding fields to obtain a sparseness index.
5. The mass data desensitization method for distributed storage according to claim 4, wherein said comparing the difference value between the numerical values and the total number of corresponding fields to obtain a sparseness index comprises: taking the difference between the maximum value and the minimum value as a global value correction span; and correcting the difference between the span and the total number of the fields according to the global value to obtain the sparseness index.
6. The mass data desensitization method for distributed storage according to claim 1, wherein said determining prefix uniqueness in prefix data corresponding to all row group value intervals comprises: constructing a prefix set by prefix data corresponding to all the line group numerical intervals; And calculating the number ratio of the non-repeated prefix data in the prefix set as prefix uniqueness.
7. The mass data desensitization method for distributed storage according to claim 1, wherein the generating a semantic tag corresponding to a target field by integrating a section overlap index, a sparseness index and prefix uniqueness comprises: judging whether a first exemption condition is met or not according to the interval overlapping index; when the first exemption condition is met, generating a sequence label for the target field; When the first exemption condition is not met, judging whether a second exemption condition is met according to the sparseness index; generating a sparse label for the target field when the second exemption condition is met; When the second exemption condition is not met, judging whether a third desensitization condition is met according to prefix uniqueness; generating a sensitive tag for the target field when the third desensitization condition is satisfied; when the third desensitization condition is not satisfied, an unknown label is generated for the target field.
8. The distributed storage oriented mass data desensitization method according to claim 7, wherein dynamically injecting desensitization operators at query time according to semantic tags of target fields comprises: Generating access control policy attributes according to semantic tags of the target fields; When the access control strategy attribute is forced shielding or default protection, the calculation engine executes planned rewriting, and the engine inserts a dynamic shielding operator after reading the operator and before outputting the operator as a result, wherein the dynamic shielding operator comprises a desensitization function.
9. The distributed storage oriented mass data desensitization method according to claim 8, wherein generating access control policy attributes from semantic tags of target fields comprises: when the semantic tags are sequence tags and sparse tags, judging that the corresponding access control strategy attribute is an exemption strategy; When the semantic tag is a sensitive tag, judging that the corresponding access control strategy attribute is forced shielding; When the semantic tag is an unknown tag, the corresponding access control strategy attribute is judged to be the default protection.
10. A mass data desensitization system for distributed storage, the system comprising the following modules: the system comprises a prepositive processing module, a target field, a statistical vector, a line group value interval, a prefix data and a data processing module, wherein the prepositive processing module is used for reading a file metadata block at the tail part of a column type storage file and traversing column fields in the file metadata block; The characteristic analysis module is used for determining an interval overlapping index according to a starting event point and a stopping event point in a line group numerical interval, carrying out span correction according to the difference between numerical values in the statistical vector, and determining a sparseness index; The data desensitization module is used for integrating the interval overlapping index, the sparseness index and the prefix uniqueness to generate a semantic tag corresponding to the target field, and dynamically injecting a desensitization operator in the query process according to the semantic tag of the target field.

Description

Mass data desensitization method and system for distributed storage Technical Field The invention relates to the technical field of data processing, in particular to a mass data desensitization method and system for distributed storage. Background In big data infrastructure of finance, telecom and internet industries, enterprises commonly adopt a distributed file system based on a columnar storage format to construct PB-level data lakes. Column storage stores mass data in row component blocks, and maintains column-level statistical metadata at the tail of a file. With the increasing demands of data privacy compliance, administrators must implement desensitization protection for sensitive personal information stored therein. In the desensitization process, the condition that the machine main key is difficult to distinguish from sensitive data exists, and now, machine-generated business main keys exist widely in a data warehouse, and the data are coded with sensitive social attributes in character sets, lengths and formats. The conventional regular matching technology based on the content is difficult to effectively distinguish the service ID serving as a plurality of bin association keys from each other, and the service ID serving as a plurality of bin association keys is easily misjudged as sensitive data to be desensitized, so that the association analysis of a data table is invalid, and the core availability of a data warehouse is destroyed. Disclosure of Invention In order to solve the technical problem of low desensitization precision in desensitizing sensitive data, the invention aims to provide a mass data desensitizing method and system for distributed storage, and the adopted technical scheme is as follows: in a first aspect, an embodiment of the present invention provides a method for desensitizing mass data oriented to distributed storage, where the method includes: Reading file metadata blocks at the tail of a column type storage file, traversing column fields in the file metadata blocks, taking any column field as a target field, extracting a numerical value maximum value of the target field in a file range, constructing a statistical vector by the numerical value maximum value of the target field and the total number of rows, constructing a row group numerical value interval of a row group corresponding to the target field, extracting prefix data of the numerical value maximum value in each row group numerical value interval; Determining interval overlapping indexes according to the initial event points and the end event points in the line group numerical intervals, performing span correction according to the difference between numerical values in the statistical vector, and determining sparseness indexes; And dynamically injecting a desensitization operator in the query according to the semantic label of the target field. Preferably, the extracting the numerical value of the target field in the file range further comprises mapping the non-numerical value field into a numerical value interval. Preferably, the determining the interval overlap index according to the start event point and the end event point in the row group value interval includes: Acquiring a starting event point and a terminating event point in each row group numerical interval; Ordering the event points according to the corresponding value of the event points in the value intervals of all the row groups to obtain an event point sequence; and determining the current interval overlapping index according to the overlapping area between the adjacent event points in the event point sequence. Preferably, the span correction is performed according to the difference between the numerical values in the statistical vector, and the determining the sparseness index includes: When the numerical value maximum values in the same statistical vector have differences, comparing the difference values between the numerical value maximum values with the total number of the corresponding fields to obtain a sparseness index. Preferably, the comparing the difference value between the numerical values and the total number of the corresponding fields to obtain the sparseness index includes: taking the difference between the maximum value and the minimum value as a global value correction span; and correcting the difference between the span and the total number of the fields according to the global value to obtain the sparseness index. Preferably, the determining prefix uniqueness in prefix data corresponding to all row group value intervals includes: constructing a prefix set by prefix data corresponding to all the line group numerical intervals; And calculating the number ratio of the non-repeated prefix data in the prefix set as prefix uniqueness. Preferably, the generating the semantic tag corresponding to the target field by integrating the interval overlapping index, the sparseness index and the prefix uniqueness includes: judging whether a first