US-20260128901-A1 - TECHNIQUES FOR SEMANTIC VARIABLE-LENGTH DEDUPLICATION IN DATABASES

US20260128901A1US 20260128901 A1US20260128901 A1US 20260128901A1US-20260128901-A1

Abstract

A system and method for the device may include reading a plurality of rows of a database. In addition, the device may include generating a secondary hash value for each row of the plurality of rows. The device may include generating a plurality of row groups, each including a group of unique rows of the plurality of rows, based at least on the secondary hash value. Moreover, the device may include generating a primary hash value for each row group of the plurality of row groups. Also, the device may include updating a backup of the database based on a first-row group of the plurality of row groups, in response to determining that a primary hash value of the first-row group does not match any primary hash value associated with the backup of the database.

Inventors

Assaf Natanzon
Yaniv Ptashnik
Dmitry Kuznetsov
Ofir Ehrlich
Ron KIMCHI
Haim Ben-Shimol

Assignees

Eon IO, Ltd.

Dates

Publication Date: 20260507
Application Date: 20241107

Claims (13)

1 . A method of reducing duplication in database backup generation, comprising: reading a plurality of rows of a database; generating a secondary hash value for each row of the plurality of rows; generating a plurality of row groups, each including a group of unique rows of the plurality of rows, based at least on the secondary hash value; generating a primary hash value for each row group of the plurality of row groups; updating a backup of the database based on a first-row group of the plurality of row groups, in response to determining that a primary hash value of the first-row group does not match any primary hash value associated with the backup of the database.
2 . The method of claim 1 , further comprising: generating the backup of the database at a first time, the first backup including a plurality of primary hash values, each primary hash value generated based on content of at least a row of the database.
3 . The method of claim 2 , further comprising: storing a pointer to a location of the primary hash value and a primary key of the database; and associating the stored pointer and primary key with the backup.
4 . The method of claim 2 , further comprising: generating a manifest of the backup, the manifest including the plurality of primary hash values arranged in sequential order.
5 . The method of claim 4 , further comprising: generating a plurality of manifests, each manifest of the plurality of manifests corresponding to a unique backup of the database.6. The method of claim 1 , further comprising: discarding the first-row group of the plurality of row groups, in response to determining that the primary hash value of the first-row group matches a primary hash value associated with the backup.
6 . The method of claim 1 , further comprising: generating a rolling hash value based on a plurality of secondary hash values; and generating the plurality of row groups based on the rolling hash value.
7 . A non-transitory computer-readable medium storing a set of instructions for reducing duplication in database backup generation, the set of instructions comprising: one or more instructions that, when executed by one or more processors of a device, cause the device to: read a plurality of rows of a database; generate a secondary hash value for each row of the plurality of rows; generate a plurality of row groups, each including a group of unique rows of the plurality of rows, based at least on the secondary hash value; generate a primary hash value for each row group of the plurality of row groups; update a backup of the database based on a first-row group of the plurality of row groups, in response to determining that a primary hash value of the first-row group does not match any primary hash value associated with the backup of the database.
8 . A system for reducing duplication in database backup generation comprising: one or more processors configured to: read a plurality of rows of a database; generate a secondary hash value for each row of the plurality of rows; generate a plurality of row groups, each including a group of unique rows of the plurality of rows, based at least on the secondary hash value; generate a primary hash value for each row group of the plurality of row groups; update a backup of the database based on a first-row group of the plurality of row groups, in response to determining that a primary hash value of the first-row group does not match any primary hash value associated with the backup of the database.
9 . The system of claim 8 , wherein the one or more processors are further configured to: generate the backup of the database at a first time, the first backup including a plurality of primary hash values, each primary hash value generated based on content of at least a row of the database.
10 . The system of claim 9 , wherein the one or more processors are further configured to: store a pointer to a location of the primary hash value and a primary key of the database; and associate the stored pointer and primary key with the backup.
11 . The system of claim 9 , wherein the one or more processors are further configured to: generate a manifest of the backup, the manifest including the plurality of primary hash values arranged in sequential order.
12 . The system of claim 11 , wherein the one or more processors are further configured to: generate a plurality of manifests, each manifest of the plurality of manifests corresponding to a unique backup of the database.6. The method further comprising: discard the first-row group of the plurality of row groups, in response to determining that the primary hash value of the first-row group matches a primary hash value associated with the backup.
13 . The system of claim 8 , wherein the one or more processors are further configured to: generate a rolling hash value based on a plurality of secondary hash values; and generate the plurality of row groups based on the rolling hash value.

Description

TECHNICAL FIELD The present disclosure relates generally to databases, and specifically to deduplication in databases for backup purposes. BACKGROUND Database backup is the process of creating copies of data to protect against data loss, corruption, or hardware failure. Backups ensure that information can be restored if something goes wrong, maintaining data availability and minimizing downtime. There are several types of backups used to meet different recovery needs. A full backup captures the entire database, offering a complete snapshot at a specific point in time. Incremental backups, on the other hand, store only the changes made since the last backup, making them more space-efficient but requiring all previous backups for a full restore. Differential backups store changes made since the last full backup, striking a balance between efficiency and ease of recovery. Backup strategies play a critical role in deciding how often backups are taken and where they are stored. A common approach is the 3-2-1 strategy, which involves keeping three copies of data: the original plus two backups, with one stored offsite. In production environments, backups may occur at varying intervals—such as daily or weekly—depending on the organization's tolerance for data loss and downtime, often referred to as the Recovery Point Objective (RPO) and Recovery Time Objective (RTO). For high-demand systems, continuous or near-real-time backups, known as transaction log backups, are used to ensure minimal data loss. Additionally, automated backups in the cloud have become increasingly popular, offering scalability and offsite storage by default, which simplifies disaster recovery processes. However, there are challenges specific to cloud-based backups. One significant issue is latency, where the time taken to transfer large amounts of data to and from the cloud can hinder backup and restoration speed. This can be particularly problematic for large databases that need quick recovery. To overcome this, some solutions allow fast restoration of a database by doing an instance mount of the database and then querying the mounted database. While such a solution allows a user to access some content of the database, this still typically takes a significant amount of time. Further complicating this, if an incorrect version of the database is restored, a correction can be a long and error-prone process. In addition, cloud-based databases can be implemented as managed databases, such as Amazon® RDS, or by deploying a virtual machine, such as an Amazon® EC2 instance with a database application installed thereon. Such a machine can include many temporary files which occupy a large amount of storage space. Additionally, an older database backup may utilize a previous version of the database application, such that when it is restored might cause a cybersecurity risk, as an outdated application. Differential database backups present their own challenges, in attempting to discover what constitutes a differential change, how those are detected, and then how to update a backup based on such a detected change. It would therefore be advantageous to provide a solution that would overcome the challenges noted above. SUMMARY A summary of several example embodiments of the disclosure follows. This summary is provided for the convenience of the reader to provide a basic understanding of such embodiments and does not wholly define the breadth of the disclosure. This summary is not an extensive overview of all contemplated embodiments, and is intended to neither identify key or critical elements of all embodiments nor to delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more embodiments in a simplified form as a prelude to the more detailed description that is presented later. For convenience, the term “some embodiments” or “certain embodiments” may be used herein to refer to a single embodiment or multiple embodiments of the disclosure. A system of one or more computers can be configured to perform particular operations or actions by virtue of having software, firmware, hardware, or a combination of them installed on the system that in operation cause(s) the system to perform the actions. One or more computer programs can be configured to perform particular operations or actions by virtue of including instructions that, when executed by a data processing apparatus, cause the apparatus to perform the actions. In one general aspect, the method may include reading a plurality of rows of a database. Method may also include generating a secondary hash value for each row of the plurality of rows. Method may furthermore include generating a plurality of row groups, each including a group of unique rows of the plurality of rows, based at least on the secondary hash value. Method may in addition include generating a primary hash value for each row group of the plurality of row groups. Method may moreover include updating