Search

CN-121980597-A - Unified data lake metadata management method and system

CN121980597ACN 121980597 ACN121980597 ACN 121980597ACN-121980597-A

Abstract

The invention discloses a unified data lake metadata management method and system, which comprise the steps of obtaining and screening multidimensional information of metadata in a data lake to determine sensitive information, evaluating the sensitivity of the sensitive information to the metadata to obtain sensitivity evaluation values, classifying the metadata based on the sensitivity evaluation values to obtain a plurality of metadata sets, determining the security level of the metadata sets based on the sensitivity evaluation values of the metadata in the metadata sets, determining the security management strategy of the metadata sets based on the security evaluation values, carrying out simulation management on the metadata in the data lake based on the security management strategy, determining the simulation evaluation value of the security management strategy based on simulation results, and judging whether the security management strategy is regulated according to the simulation evaluation value so as to uniformly manage the metadata in the data lake. The invention constructs an intelligent, accurate, self-adaptive and efficient data lake metadata safety management ecological system, and fundamentally improves the overall safety protection capability and management level of the data lake.

Inventors

  • HAN SHUO
  • ZHANG QIANG
  • YE CHENG
  • GUO HAO
  • CHEN LEI
  • WANG LIJUAN
  • WU FANG
  • Zhi Binbin
  • SUN JIYAO
  • LIU YUJIE

Assignees

  • 华能信息技术有限公司

Dates

Publication Date
20260505
Application Date
20251128

Claims (10)

  1. 1. A method for unified data lake metadata management, comprising: acquiring multidimensional information of metadata in a data lake, screening the multidimensional information, and determining sensitive information in the multidimensional information; Evaluating the sensitivity of the metadata based on the sensitivity information to obtain a sensitivity evaluation value, and classifying the metadata based on the sensitivity evaluation value to obtain a plurality of metadata sets; determining a security level of the metadata set based on the sensitivity evaluation value of the metadata in the metadata set, and determining a security management policy of the metadata set based on the security level; Performing simulation management on metadata in the data lake based on the security management strategy, and determining a simulation evaluation value of the security management strategy based on a simulation result; And judging whether to adjust the security management strategy according to the simulation evaluation value, and uniformly managing metadata in the data lake based on the security management strategy before or after adjustment.
  2. 2. The method for managing metadata in a unified data lake according to claim 1, wherein the steps of obtaining multidimensional information of metadata in a data lake, filtering the multidimensional information, and determining sensitive information in the multidimensional information include: Acquiring multidimensional information of metadata in a data lake, and carrying out information analysis on the multidimensional information to obtain a plurality of pieces of information; determining information content of each piece of information, and calculating content similarity between the information content and each piece of sensitive content in a preset sensitive content database; and screening out the information with the content similarity larger than a preset threshold value from the information, and determining the screened information as sensitive information in the multidimensional information.
  3. 3. The unified data lake metadata management method according to claim 2, wherein the evaluating the sensitivity of metadata based on the sensitivity information to obtain the sensitivity evaluation value comprises: Respectively evaluating and valuing sensitive information contained in the metadata to obtain sub-sensitive values of each sensitive information; determining content similarity corresponding to the sensitive information, and carrying out normalization processing on the content similarity to obtain the weight of each sensitive information; And calculating based on the sensitivity value of each sensitive information and the corresponding weight to obtain a sensitivity degree evaluation value of the metadata.
  4. 4. The unified data lake metadata management method of claim 3 wherein the formula for calculating the sensitivity level evaluation value of the metadata is: , wherein S is a sensitivity degree evaluation value of metadata, alpha i is a weight of the ith sensitive information, and Pi is a sub-sensitive value of the ith sensitive information.
  5. 5. A method for managing metadata in a unified data lake according to claim 3, wherein classifying metadata based on the sensitivity level evaluation value to obtain a plurality of metadata sets comprises: establishing a data set according to the sensitivity evaluation value of the metadata, and randomly selecting k initial clustering centers of the data set; Calculating the Euclidean distance from the sensitivity degree evaluation value in the data set to the initial clustering center, and dividing each metadata into corresponding clustering clusters according to the Euclidean distance from the sensitivity degree evaluation value in the data set to the initial clustering center; calculating a sensitivity degree evaluation average value in each cluster, and re-determining a cluster center according to the sensitivity degree evaluation average value in each cluster; repeating the steps until no change occurs in the clustering center or the iteration times reach a preset iteration threshold value, and obtaining k clustering clusters; and classifying the metadata belonging to the same cluster into the same metadata set to obtain a plurality of metadata sets.
  6. 6. The method for managing metadata in a unified data lake according to claim 5, wherein the determining the security level of the metadata set based on the sensitivity level evaluation value of the metadata in the metadata set comprises: Adding all the sensitivity evaluation values of the metadata in the metadata set to obtain a sensitivity evaluation sum value, and determining the security level of the metadata set based on the sensitivity evaluation sum value; Presetting a corresponding relation between a preset safety level and a sensitivity level evaluation sum value interval, wherein the corresponding relation between the preset safety level and the sensitivity level evaluation sum value interval is related to the corresponding preset safety level for each sensitivity level evaluation sum value interval; And acquiring a sensitivity degree evaluation sum value of the metadata set, selecting a preset security level corresponding to the sensitivity degree evaluation sum value interval to determine the security level of the metadata set based on the mapping relation of the sensitivity degree evaluation sum value interval to which the sensitivity degree evaluation sum value belongs in the preset security level-sensitivity degree evaluation sum value interval corresponding relation.
  7. 7. The method for managing metadata in a unified data lake according to claim 6, wherein the security management policy for determining the metadata set based on the security level comprises: if the security level is the first security level, the security management policy of the metadata set is to open access control to all users, encryption storage or transmission is not needed, and data desensitization is not needed; If the security level is the second security level, the security management policy of the metadata set is that only the internal members of the organization can access control, the storage or transmission uses common encryption, and the data is required to be desensitized for the unauthorized user; If the security level is the third security level, the security management policy of the metadata set is that the metadata set can be controlled by the access to be approved, the storage or transmission use is strongly encrypted, and any unauthorized environment needs data desensitization.
  8. 8. The method for managing metadata in a data lake according to claim 7, wherein the performing simulation management on metadata in a data lake based on a security management policy, and determining a simulation evaluation value of the security management policy based on a simulation result, comprises: determining a preset simulation environment and simulation attack, inputting metadata under the management of a security management strategy into the simulation environment for simulation management, and performing simulation attack on the metadata in the simulation environment based on the simulation attack; Determining a simulation result, wherein the simulation result comprises an attack success rate, metadata leakage and breakthrough time, and evaluating and taking values of the attack success rate, the metadata leakage and the breakthrough time respectively; And respectively determining preset weights of the attack success rate, the metadata leakage amount and the breakthrough time, and carrying out weighted addition calculation on the evaluation values of the attack success rate, the metadata leakage amount and the breakthrough time and the corresponding preset weights to obtain the simulation evaluation value of the security management strategy.
  9. 9. The method for managing metadata in a data lake according to claim 8, wherein the determining whether to adjust the security management policy according to the simulation evaluation value, and performing unified management on metadata in the data lake based on the security management policy before or after the adjustment, comprises: Determining a preset simulation evaluation threshold value, and judging whether to adjust the safety management strategy based on the relation between the simulation evaluation value and the simulation evaluation threshold value; If the simulation evaluation value is smaller than the simulation evaluation threshold value, the security management strategy is required to be adjusted, and metadata in the data lake are uniformly managed based on the adjusted security management strategy; if the simulation evaluation value is greater than or equal to the simulation evaluation threshold value, the security management strategy is not adjusted, and the metadata in the data lake are directly and uniformly managed based on the unadjusted security management strategy.
  10. 10. A unified data lake metadata management system, comprising: The acquisition module is used for acquiring multidimensional information of metadata in the data lake, screening the multidimensional information and determining sensitive information in the multidimensional information; the classifying module is used for evaluating the sensitivity of the metadata based on the sensitivity information to obtain a sensitivity evaluation value, and classifying the metadata based on the sensitivity evaluation value to obtain a plurality of metadata sets; The determining module is used for determining the security level of the metadata set based on the sensitivity evaluation value of the metadata in the metadata set and determining the security management strategy of the metadata set based on the security level; The simulation module is used for performing simulation management on the metadata in the data lake based on the security management strategy and determining a simulation evaluation value of the security management strategy based on a simulation result; And the adjustment module is used for judging whether to adjust the security management strategy according to the simulation evaluation value and uniformly managing the metadata in the data lake based on the security management strategy before or after adjustment.

Description

Unified data lake metadata management method and system Technical Field The invention relates to the field of data management, in particular to a method and a system for managing unified data lake metadata. Background With the penetration of enterprise data-driven decisions, the data lake becomes a core infrastructure for carrying massive and multi-element data by virtue of strong storage capacity and flexible architecture. However, the open mode of data lakes "store-before-administer" brings convenience and, at the same time, also creates serious security challenges. The number of data assets is enormous and the variety is complex, and the large amount of metadata contained therein may itself expose sensitive information. However, the traditional data security management method mainly depends on static strategies and manual classification, when facing metadata which are instantaneously changed and huge in scale in a data lake, the problems of delay of sensitive information discovery, rough security level division, experience dependence of strategy deployment, lack of effective verification mechanism and the like are solved, a large amount of dead zones and hysteresis exist in enterprise data security protection, data security and use efficiency are difficult to be balanced accurately, and even internal data leakage risks are caused due to improper access right setting. Disclosure of Invention In order to solve the technical problems, the invention provides a unified data lake metadata management method and system, comprising the following steps: acquiring multidimensional information of metadata in a data lake, screening the multidimensional information, and determining sensitive information in the multidimensional information; Evaluating the sensitivity of the metadata based on the sensitivity information to obtain a sensitivity evaluation value, and classifying the metadata based on the sensitivity evaluation value to obtain a plurality of metadata sets; determining a security level of the metadata set based on the sensitivity evaluation value of the metadata in the metadata set, and determining a security management policy of the metadata set based on the security level; Performing simulation management on metadata in the data lake based on the security management strategy, and determining a simulation evaluation value of the security management strategy based on a simulation result; And judging whether to adjust the security management strategy according to the simulation evaluation value, and uniformly managing metadata in the data lake based on the security management strategy before or after adjustment. Further, the acquiring the multidimensional information of the metadata in the data lake and screening the multidimensional information to determine the sensitive information in the multidimensional information includes: Acquiring multidimensional information of metadata in a data lake, and carrying out information analysis on the multidimensional information to obtain a plurality of pieces of information; determining information content of each piece of information, and calculating content similarity between the information content and each piece of sensitive content in a preset sensitive content database; and screening out the information with the content similarity larger than a preset threshold value from the information, and determining the screened information as sensitive information in the multidimensional information. Further, the evaluating the sensitivity of the metadata based on the sensitivity information to obtain a sensitivity evaluation value includes: Respectively evaluating and valuing sensitive information contained in the metadata to obtain sub-sensitive values of each sensitive information; determining content similarity corresponding to the sensitive information, and carrying out normalization processing on the content similarity to obtain the weight of each sensitive information; And calculating based on the sensitivity value of each sensitive information and the corresponding weight to obtain a sensitivity degree evaluation value of the metadata. Further, the calculation formula of the sensitivity evaluation value of the metadata is as follows: , wherein S is a sensitivity degree evaluation value of metadata, alpha i is a weight of the ith sensitive information, and Pi is a sub-sensitive value of the ith sensitive information. Further, the classifying the metadata based on the sensitivity evaluation value to obtain a plurality of metadata sets includes: establishing a data set according to the sensitivity evaluation value of the metadata, and randomly selecting k initial clustering centers of the data set; Calculating the Euclidean distance from the sensitivity degree evaluation value in the data set to the initial clustering center, and dividing each metadata into corresponding clustering clusters according to the Euclidean distance from the sensitivity degree evaluation valu