US-20260127147-A1 - SYSTEM AND METHOD FOR ROW-GROUP DEDUPLICATION USING CHANGE DATA CAPTURE IN DATABASE BACKUPS

US20260127147A1US 20260127147 A1US20260127147 A1US 20260127147A1US-20260127147-A1

Abstract

A method and system for updating backup data for a database is presented. The method includes initiating a merging operation for a defined time interval; obtaining change data capture (CDC) data generated during the defined time interval; aggregating and ordering the CDC data by a key value; identifying backup data associated with the ordered CDC data; generating a candidate data object that represents data modified during the defined time interval; deduplicating the candidate data object by comparing a hash identifier of the candidate data object with a plurality of hash identifiers stored in a repository; storing the candidate data object when the comparison determines that no matching hash identifier exists in the repository; and updating backup metadata to associate the key value with the candidate data object.

Inventors

Assaf Natanzon
Yaniv Ptashnik
Dmitry Kuznetsov
Ofir Ehrlich
Peleg Kazaz
Ron KIMCHI
Ran Mizrachi
Sigal Weiner

Assignees

Eon IO, Ltd.

Dates

Publication Date: 20260507
Application Date: 20250430

Claims (20)

1 . A method for updating backup data for a database, the method comprising: initiating a merging operation for a defined time interval; obtaining change data capture (CDC) data generated during the defined time interval; aggregating and ordering the CDC data by a key value; identifying backup data associated with the ordered CDC data; generating a candidate data object that represents data modified during the defined time interval; deduplicating the candidate data object by comparing a hash identifier of the candidate data object with a plurality of hash identifiers stored in a repository; storing the candidate data object when the comparison determines that no matching hash identifier exists in the repository; and updating backup metadata to associate the key value with the candidate data object.
2 . The method of claim 1 , wherein initiating the merging operation begins in response to at least one of expiration of a timer or a volume of CDC data exceeding a predefined threshold.
3 . The method of claim 1 , wherein ordering the CDC data further comprises: sorting a plurality of aggregated records by the key value, wherein the key value is used to segment data objects referenced by the backup data.
4 . The method of claim 1 , wherein updating the backup metadata further comprises: storing a new version of a manifest while retaining at least one earlier version of the manifest such that both versions remain available for recovery.
5 . The method of claim 1 , wherein aggregating the CDC data further comprises: retaining only a last recorded change that occurred within the defined time interval and discarding earlier changes of the key value.
6 . The method of claim 1 , wherein identifying the backup data comprises evaluating a manifest that maps one or more ranges corresponding to the key value to data objects stored in a repository.
7 . The method of claim 1 , further comprising: obtaining the CDC data in successive files captured at the defined time interval, wherein the time interval is fifteen minutes or less.
8 . The method of claim 1 , further comprising: integrating the updated backup data with previously stored backup data to generate an updated full backup of a database.
9 . The method of claim 8 , further comprising: generating rollback data that enables reconstruction of the backup data that existed prior to the updated full backup of the database.
10 . The method of claim 1 , further comprising: accessing a baseline backup and verifying its integrity by recomputing at least a strong hash identifier for at least one data object prior to generating the updated backup data.
11 . A non-transitory computer-readable medium storing a set of instructions for updating backup data for a database, the set of instructions comprising: one or more instructions that, when executed by one or more processors of a device, cause the device to: initiate a merging operation for a defined time interval; obtain change data capture (CDC) data generated during the defined time interval; aggregate and ordering the CDC data by a key value; identify backup data associated with the ordered CDC data; generate a candidate data object that represents data modified during the defined time interval deduplicate the candidate data object by comparing a hash identifier of the candidate data object with a plurality of hash identifiers stored in a repository store the candidate data object when the comparison determines that no matching hash identifier exists in the repository; and update backup metadata to associate the key value with the candidate data object.
12 . A system for updating backup data for a database comprising: a processing circuitry; a memory, the memory containing instructions that, when executed by the processing circuitry, configure the system to: initiate a merging operation for a defined time interval; obtain change data capture (CDC) data generated during the defined time interval; aggregate and ordering the CDC data by a key value; identify backup data associated with the ordered CDC data; generate a candidate data object that represents data modified during the defined time interval deduplicate the candidate data object by comparing a hash identifier of the candidate data object with a plurality of hash identifiers stored in a repository store the candidate data object when the comparison determines that no matching hash identifier exists in the repository; and update backup metadata to associate the key value with the candidate data object.
13 . The system of claim 12 , wherein initiating the merging operation begins in response to at least one of expiration of a timer or a volume of CDC data exceeding a predefined threshold.
14 . The system of claim 12 , wherein the one or more processing circuitry, when ordering the CDC data further, are configured to sort a plurality of aggregated records by the key value, wherein the key value is used to segment data objects referenced by the backup data.
15 . The system of claim 12 , wherein the one or more processing circuitry, when updating the backup metadata further, are configured to store a new version of a manifest while retaining at least one earlier version of the manifest such that both versions remain available for recovery.
16 . The system of claim 12 , wherein the one or more processing circuitry, when aggregating the CDC data further, are configured to retain only a last recorded change that occurred within the defined time interval and discarding earlier changes of the key value.
17 . The system of claim 12 , wherein the one or more processing circuitry, when identifying the backup data, are configured to evaluate a manifest that maps one or more ranges corresponding to the key value to data objects stored in a repository.
18 . The system of claim 12 , wherein the CDC data is obtained in successive files captured at the defined time interval, the time interval is fifteen minutes or less.
19 . The system of claim 12 , wherein the memory contains further instructions which when executed by the processing circuitry further configure the system to: integrate the updated backup data with previously stored backup data to generate an updated full backup of a database.
20 . The system of claim 19 , wherein the memory contains further instructions which when executed by the processing circuitry further configure the system to: generate rollback data that enables reconstruction of the backup data that existed prior to the updated full backup of the database.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application is a continuation-in-part of U.S. Non-Provisional application Ser. No. 18/940,450, filed on Nov. 7, 2024, now pending, the content of which is hereby incorporated by reference. TECHNICAL FIELD The present disclosure relates generally to database backup and restoration, and more particularly to systems and methods that perform row-group deduplication using Change Data Capture (CDC). BACKGROUND Enterprises rely on database backups to protect data against loss, corruption, or service outages. Common strategies include full, incremental, and differential copies, each striking a different balance between storage overhead and recovery time performance. Large production systems may combine these strategies with the “3-2-1” rule, maintaining multiple copies with at least one off-site replica, to satisfy stringent recovery point objective (RPO) and recovery time objective (RTO) targets. Cloud deployments introduce additional constraints. Transferring multi-terabyte-databases to object storage can take hours, and restoring an entire image back into a live environment may expose end users to prolonged downtime. Latency is compounded when a chain of incremental backups must be replayed before the database reaches a usable state. One existing approach reduces network traffic and storage consumption by dividing exported table data into content defined “row groups.” Each group is assigned a cryptographic hash; if that hash already appears in an earlier backup, the group is skipped, eliminating-duplicate storage. While effective for periodic full table scans, this technique still requires reading every row of the database at the start of each backup cycle. Modern databases also expose Change Data Capture (CDC) streams, which are ordered logs of insert, update, and delete events. Integrating fine grained CDC streams with row-group deduplication presents two practical obstacles. First, tracking a strong hash for every changed row would inflate metadata size and processing overhead. Second, retaining-long sequences of raw CDC files can lengthen restore operations, because reaching a recent point in time may involve replaying thousands of small log segments and thereby consume significant I/O and bandwidth. Modern databases often provide Change Data Capture (CDC) streams in the form of ordered logs that record every insert, update, and delete operation. Integrating these fine grained CDC streams with row group deduplication presents several obstacles. First, deduplicating at single row granularity would require maintaining a strong hash entry for every changed row, dramatically increasing metadata volume and processing overhead. Second, keeping long sequences of raw CDC files can prolong a restore, because bringing a database to a recent point in time may require replaying thousands of small log segments and consume significant input output bandwidth. In addition, CDC logs reference individual rows, whereas deduplication operates on multi row groups, making it difficult to correlate fine grained changes with the existing hash catalogue without rescanning large portions of table data. It would therefore be advantageous to provide a solution that would overcome the challenges noted above. SUMMARY A summary of several example embodiments of the disclosure follows. This summary is provided for the convenience of the reader to provide a basic understanding of such embodiments and does not wholly define the breadth of the disclosure. This summary is not an extensive overview of all contemplated embodiments, and is intended to neither identify key or critical elements of all embodiments nor to delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more embodiments in a simplified form as a prelude to the more detailed description that is presented later. For convenience, the term “some embodiments” or “certain embodiments” may be used herein to refer to a single embodiment or multiple embodiments of the disclosure. A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation cause(s) the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparatus to perform the actions. A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation causes or cause the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by data processing apparatus, cause the apparat