US-12626000-B2 - Advanced policy attribute derivation for data management using content-based datasets

US12626000B2US 12626000 B2US12626000 B2US 12626000B2US-12626000-B2

Abstract

Embodiments applying data protection and control policies using content-based datasets by scanning data objects stored in the system to determine grouped data that is processed similarly with respect to data protection and access control operations defined by a policy. A dataset is produced comprising metadata the scanned data objects of the grouped data. The actions performed on the dataset will affect only the corresponding data objects referenced by the metadata. A policy attribute derivation (PAD) process determines a change in the policy affecting a subset of data objects of the dataset and dictating changed data protection and access control operations applied to this subset, and tags the change in the policy as a PAD tag to the dataset to affect the application of the changed data protection and access control operations only to the subset and not any remaining data objects of the dataset.

Inventors

Adam Brenner
Jehuda Shemer
Steven Sadhwani
Valerie Lotosh
Erez Sharvit

Assignees

DELL PRODUCTS L.P.

Dates

Publication Date: 20260512
Application Date: 20221028

Claims (6)

1 . A method of applying data protection and control policies in a data processing system using defined content-based datasets, comprising: storing metadata of content data stored in disparate storage environments in a scanning database; creating a protection policy comprising a query representing data to be protected by the protection policy; executing the query comprising metadata selectors as dataset tags for matching against the scanning database to generate a dataset comprising a logical collection of metadata for unstructured data objects grouped together by one or more filters, and that represents data categorized by the user for a specific need and that is to be processed similarly with respect to data protection and access control operations defined by the protection policy, and further wherein actions performed on the dataset will affect only the corresponding data objects referenced by the metadata; first determining a change of the dataset as reflected in a change in the protection policy as applied to a subset of data objects referenced by the dataset and due to an organizational factor and an external factor; second determining the change in the protection policy affecting the subset of data objects of the dataset and dictating changed data protection and access control operations applied to this subset; and tagging the change in the policy as a policy attribute derivation (PAD) tag to the dataset to affect the application of the changed data protection and access control operations only to the subset and not any remaining data objects of the dataset, wherein the PAD tag comprises a compound PAD tag combining at least one external PAD tag and a hierarchical PAD tag, wherein for the hierarchical PAD tag, the change in the policy results from a hierarchical directory relationship of a lower level data object in relation to a higher level data object, and for the external PAD tag, the change in the policy results from a change in circumstances of the subset of data objects, and includes one of a change in data location or an evolution of data over time.
2 . The method of claim 1 wherein the dataset defines a single data access unit for the referenced data objects, and further wherein the protection policy comprises a defined protection policy that controls processing of the referenced data objects as a single unit based on data content rather than data location in a file directory of the system, and comprises at least one of data backup operations, data restore operations, data move operations, and data tiering operations.
3 . The method of claim 2 wherein the protection policy further comprises control operations comprising at least one of: defining access permissions to the data objects by users of the system, or enforcing security measures on the data objects through encryption.
4 . The method of claim 1 wherein the policy attribute derivation PAD tag comprises an alphanumeric label appended as a classifier tag to the dataset associated with the dataset, and that may be appended to one or more classifier tags to modify protection or control operations dictated by the other classifier tags.
5 . The method of claim 1 wherein the metadata selectors comprise tags consisting of alphanumeric strings applied to respective data objects based on user-defined rules, and wherein the tags define at least one of a file type, name, location, creation time, or characteristic.
6 . A system for applying data protection and control policies in a data processing environment using defined content-based datasets, the system comprising: one or more processors; and a memory storing instructions which, when executed by the one or more processors, cause the one or more processors to: a scanning database storing store metadata of content data stored in disparate storage environments in a scanning database; a hardware-based dataset management component creating create a protection policy comprising a query representing data to be protected by the protection policy, and executing execute the query comprising metadata selectors as dataset tags for matching against the scanning database to generate a dataset comprising a logical collection of metadata for unstructured data objects grouped together by one or more filters, and that represents data categorized by the user for a specific need and that is to be processed similarly with respect to data protection and access control operations defined by the protection policy, and further wherein actions performed on the dataset will affect only the corresponding data objects referenced by the metadata; a policy attribute derivation manager first determining determine a change of the dataset as reflected in a change in the protection policy as applied to a subset of data objects referenced by the dataset and due to an organizational factor and an external factor; second determining determine the change in the protection policy affecting the subset of data objects of the dataset and dictating changed data protection and access control operations applied to this subset; and a tag component tagging tag the change in the policy as a policy attribute derivation (PAD) tag to the dataset to affect the application of the changed data protection and access control operations only to the subset and not any remaining data objects of the dataset, wherein the PAD tag comprises a compound PAD tag combining at least one external PAD tag and a hierarchical PAD tag, wherein for the hierarchical PAD tag, the change in the policy results from a hierarchical directory relationship of a lower level data object in relation to a higher level data object, and for the external PAD tag, the change in the policy results from a change in circumstances ofthe subset of data objects, and includes one of a change in data location or an evolution of data over time.

Description

TECHNICAL FIELD Embodiments are generally directed to large-scale data storage systems and more specifically to classifying evolving data using content-based datasets. BACKGROUND Enterprise data is scaling to extreme sizes in present business ecosystems. Users have traditionally relied on a single person or a small team of people to understand and manage all the data for a company. In the context of data protection, this would be the backup administrator or system admin team. Backup administrators would work with data owners who produce and consume the data, and would create lifecycle policies on the data so that data would be backed up, restored, moved, or deleted according known rules. These rules or policies could be anything from when to tier, archive, backup and delete the data, in accordance with appropriate company and legal requirements. As the sheer amount of data has grown, however, such users have had to change their operating models. Having a single person or team simply cannot scale to handle these increases. They thus must choose among a few options to keep up the increase in data, such as grow the team, invest in automation, and/or move the responsibilities of data management to the creators of the data, while overseeing compliance. While the operating model has changed, one element has not changed, and that is that lifecycle rules are very data specific. This means that the person creating the lifecycle rules has to know where the data exists, who created the data, and for how long the data needs to be saved. Present methods of handling the management of data lifecycles in the context of very large and dynamic datasets are simply unable to keep up with ever increasing management demands, such as when the incoming rate of data exceeds the capacity to manage the data lifecycles. For example, it is forecasted that volumes of unstructured data in enterprise environments will grow to exabyte scales in the future. This explosive growth in data will not come from a single source or process, but will instead come from many areas within a user environment, such as core networks, edge devices, public/cloud networks, and so on. Moreover, data will be generated by automated processes and consumed by other processes and due to the size, volume and variety of data. As datasets grow, there is usually a need for more granular control over protection and control policies, or external events and parameters that may affect the policy. In addition, targeted changes affecting a dataset may need to be applied in an efficient manner, rather than in a way that affects all defined datasets. What is needed, therefore, is a dataset processing system providing central management of data based on its content rather than physical location or directory location, and that provides granular and targeted control over processing different data objects belonging to a dataset. The subject matter discussed in the background section should not be assumed to be prior art merely as a result of its mention in the background section. Similarly, a problem mentioned in the background section or associated with the subject matter of the background section should not be assumed to have been previously recognized in the prior art. The subject matter in the background section merely represents different approaches, which in and of themselves may also be inventions. EMC, Networker, Data Domain, and Data Domain Restorer are trademarks of DellEMC Corporation. BRIEF DESCRIPTION OF THE DRAWINGS In the following drawings like reference numerals designate like structural elements. Although the figures depict various examples, the one or more embodiments and implementations described herein are not limited to the examples depicted in the figures. FIG. 1 is a diagram of a large-scale network implementing a large-scale dataset management process for content-based data protection, under some embodiments. FIG. 2 illustrates creating datasets from metadata for unstructured files and objects, under some embodiments. FIG. 3 illustrates data residing among different operating environments processed as a single dataset, under some embodiments. FIG. 4 is a diagram illustrating components of the dataset management processing component, under some embodiments. FIG. 5 illustrates protection policies composed of one or more data queries that find data in a data catalog based on file metadata, under some embodiments. FIG. 6 illustrates an example of datasets and data catalogs used in data protection software, under some embodiments. FIG. 7 is a flowchart that illustrates a method of tracking changes in a change file list, under some embodiments. FIG. 8A illustrates the constitution of a dataset, under some embodiments. FIG. 8B illustrates a catalog storing information making up a dataset, under some embodiments. FIG. 9 is a flowchart illustrating a method of managing datasets lifecycles, under some embodiments. FIG. 10 illustrates an example of semi structure-aw