EP-4738139-A1 - CHANGE EVENTS STREAM VIA UNIFIED DIFFERENCE DATA ACCESS LAYER FOR DATA PROTECTION PLATFORMS
Abstract
A computing system implementing a data management platform, the computing system comprising processing means. The processing means may expose, via a single application programming interface executed by the data management platform, a unified difference data access layer that provides an abstraction layer by which to obtain difference data between two or more events corresponding to protected data, and interface, via the single application programming interface, with the unified difference data access layer to obtain the difference data. The processing means may also publish the difference data to a change event stream, receive, from an application, a request to access at least a portion of the difference data published to the change event stream, and output, responsive to the request and to the application, at least the portion of the difference data published to the change event stream.
Inventors
- GUPTA, APURV
- BAJAJ, RUPESH
- ARON, MOHIT
- AGARWAL, Akshat
- GUTURI, VENKATA RANGA RADHANIKANTH
- DUTTAGUPTA, ANIRVAN
- KEDAR, IDAN
Assignees
- Cohesity, Inc.
Dates
- Publication Date
- 20260506
- Application Date
- 20250730
Claims (15)
- A method for a data management platform, the method comprising: exposing, via a single application programming interface executed by the data management platform, a unified difference data access layer that provides an abstraction layer by which to obtain difference data between two or more events corresponding to protected data; interfacing, via the single application programming interface, with the unified difference data access layer to obtain the difference data; publishing the difference data to a change event stream; receiving, from an application, a request to access at least a portion of the difference data published to the change event stream; and responsive to the request, outputting, to the application, at least the portion of the difference data published to the change event stream.
- The method of claim 1, wherein publishing the difference data comprises incrementally publishing the difference data to the change event stream as the difference data is obtained from the unified difference data access layer via the single application programming interface.
- The method of claim 1 or claim 2, wherein the unified difference data access layer executes within a data plane computing cluster of a computing system located in at least one of: a same region as a primary source in which the two or more events occur, a computing cluster on which the two or more events occur, or a computing cluster having a lowest computational cost to download data that is subjected to the two or more events.
- The method of any preceding claim, wherein the two or more events include two or more backups, two or more snapshots, or two or more archives.
- The method of any preceding claim, wherein receiving the request comprises receiving, from the application, a subscription request identifying at least the portion of the difference data published to the change event stream that is to be output to the application.
- The method of claim 5, wherein the subscription request identifies one or more filters to be applied to the difference data published to the change event stream in order to identify at least the portion of the difference data.
- The method of any preceding claim, wherein publishing the difference data comprises publishing the difference data according to an extensible schema, and wherein the method further comprises publishing the extensible schema to enable the application to parse at least the portion of the difference data output to the application.
- The method of any preceding claim, further comprising adaptively scheduling ingestion of one or more of metadata and content data based on one or more of a service level agreement for a primary source on which the two or more events are performed, a load on the primary source, and a change rate on the primary source.
- The method of any preceding claim, further comprising executing one or more tools for connecting to one or more primary sources on which the two or more events are performed.
- The method of any preceding claim, further comprising: publishing occurrence of the two or more events to an event message queue; and responsive to receiving a notification that at least one of the two or more events were published to the event message queue, interfacing with the unified difference data access layer.
- The method of any preceding claim, wherein the difference data includes one or more of metadata descriptive of a data item within an object to which the two or more events are performed and content data of the data item within the object to which the two or more events are performed.
- A computing system implementing a data management platform, the computing system comprising: processing means configured to: expose, via a single application programming interface executed by the data management platform, a unified difference data access layer that provides an abstraction layer by which to obtain difference data between two or more events corresponding to protected data; interface, via the single application programming interface, with the unified difference data access layer to obtain the difference data; publish the difference data to a change event stream; receive, from an application, a request to access at least a portion of the difference data published to the change event stream; and responsive to the request, output, to the application, at least the portion of the difference data published to the change event stream.
- The computing system of claim 12, wherein the processing means is configured to incrementally publish the difference data to the change event stream as the difference data is obtained from the unified difference data access layer via the single application programming interface.
- The computing system of claim 12 or claim 13, wherein the unified difference data access layer executes within a data plane computing cluster of a computing system located in at least one of: a same region as a primary source in which the two or more events occur, a computing cluster on which the two or more events occur, or a computing cluster having a lowest computational cost to download data that is subjected to the two or more events.
- One or more computer-readable storage media storing instructions that, when executed, causes processing means to perform the method of any of claims 1 to 11.
Description
TECHNICAL FIELD This disclosure relates to data management in computing systems. BACKGROUND Data may be commonly queried to retrieve specific information or datasets from storage systems, enabling data analysis, data recovery, data mining, forensic analysis, and compliance with regulatory requirements. Data may include metadata defining characteristics of the data, including file system metadata concerning file creation, file edit, file deletion, file structure, creator, owner, modification timestamps, etc. A document may be a file created and digitally stored. Documents can include PDFs, spreadsheets, emails, text files, word processor files, HTML, XML, transcripts, and presentations, for example. In some cases, text of the documents can be transcribed from media (e.g., speech transcription), encoded in the documents or visible in media (e.g., text displayed in a video, such as closed captioning), or otherwise represented in media. In some instances, various applications executed by a data management platform may perform comparisons between snapshots of the data (where snapshots may refer to incremental or full backups) to determine differences between metadata or other data between the snapshots. Each application may compute a distinct and separate difference (which may be referred to as a "diff") between two snapshots for purposes of further analysis (e.g., to reduce computing resource consumption by only considering changes to a subset of the data rather than the full set of data) in terms of performing, as a few examples, data analysis, data recovery, data mining, forensic analysis, and/or compliance with regulatory requirements. SUMMARY According to various aspects of the techniques described in this disclosure, a data management platform may expose a unified difference data access layer (via a single application programming interface - API) by which to access differences (which may, again, be referred to as a "diffs") in data between two snapshots (which again may refer to an incremental backup or a full backup). Rather than compute various diffs differently to achieve different forms of analysis (which may result in a fragmented code base that is difficult to support), the data management platform may expose a unified diff data access layer (UDDAL) by which to request diffs in a uniform and extendable manner. The single API may be invoked to publish a change event stream (which may also be referred to as a "delta stream") that may be referenced by a number of different applications that may request diffs between two different snapshots, which may be limited to diffs in data (which may also be referred to as "content data"), metadata, or both. Metadata may define characteristics of the content data, including file system metadata concerning file creation, file edit, file deletion, file structure, creator, owner, modification timestamps, etc. The techniques may provide one or more technical advantages that facilitate one or more practical applications. Existing data management platforms for interacting with diffs may include a number of different applications (which may be referred to as "apps") that generate separate differences to achieve different levels of analysis. Each of the apps may generate the diffs between two snapshots differently or in a proprietary manner. This results in difficulties managing the code base as any changes to one app for a particular diff may not carry over to a different app, which requires separate maintenance of each app. The techniques may provide a universal diff data access layer (UDDAL) exposed via the single API that is invoked to produce a uniform change event stream that each of the apps may reference to retrieve one or more diffs. This UDDAL exposed via the single API may allow for a more uniform code base, where updates to the UDDAL are available to all apps by way of the change event stream without having to perform much if any edits of the apps. The techniques may provide advantages over conventional data management platforms in terms of unifying dataset analysis via the uniform difference data access layer accessible via the single API. Rather than individually update the diff generation performed by each individual app (which may result in diffs having different characteristics), the UDDAL may provide the single API by which diffs can be generated in the form of the change event stream and filtered to expose only the changes that each of the various apps require to perform further analysis. By limiting the number of updates required, apps may be developed and deployed more quickly (considering that individual testing of the tools and/or agent diff generation is reduced to a single instance rather than being performed individually). Further, the single API allows for better extensibility in that only a single API needs to be updated to extend the functionality (in terms of generating diffs). In addition, the single API may produce a change event stream (whi