US-12619503-B2 - End-to-end data validation for backup systems

US12619503B2US 12619503 B2US12619503 B2US 12619503B2US-12619503-B2

Abstract

Methods, systems, and devices for data management are described. The method may include obtaining a first fingerprint of one or more first data blocks obtained from a source data system, where the one or more first data blocks correspond to one or more data objects and are in accordance with a first storage format, obtaining, a second fingerprint of one or more second data blocks as stored in the backup system, where the one or more second data blocks correspond to the one or more data objects and are stored in the backup system in accordance with a second storage format, and comparing, in accordance with a mapping between data objects of the source data system and data blocks in the backup system, the first fingerprint and the second fingerprint to validate whether the one or more data objects are stored correctly at the backup system.

Inventors

Prasanta Ranjan Dash
Abdullah Reza
Arjun Sinha
Vinita Sharma
Saurabh Vashisth

Assignees

RUBRIK, INC.

Dates

Publication Date: 20260505
Application Date: 20240627

Claims (18)

1 . A method, comprising: writing, to a backup system, a first threshold portion of data of a backup snapshot obtained from a source data system; obtaining a first fingerprint of one or more first data blocks obtained from the source data system, wherein the one or more first data blocks correspond to one or more data objects and are in accordance with a first storage format associated with the source data system; obtaining a second fingerprint of one or more second data blocks as stored in the backup system, wherein the one or more second data blocks correspond to the one or more data objects and are stored in the backup system in accordance with a second storage format; comparing, in accordance with a mapping between data objects of the source data system and data blocks in the backup system and in response to writing the first threshold portion of data, the first fingerprint and the second fingerprint to validate whether the one or more data objects are stored correctly at the backup system; determining, based at least in part on the comparing, that a first data object of the one or more data objects is not stored correctly at the backup system; obtaining, from the source data system, a first set of data blocks corresponding to the first data object; writing, to the backup system, data of the first set of data blocks as a second set of data blocks at the backup system; and comparing, to validate whether the first data object is stored correctly at the backup system, a third fingerprint of the first set of data blocks to a fourth fingerprint of the second set of data blocks.
2 . The method of claim 1 , further comprising: using a first looping construct and a second looping construct to identify the first fingerprint and the second fingerprint for comparison, wherein the first looping construct is configured to iterate through first fingerprints corresponding to first data blocks in the source data system that correspond to a set of data objects, and wherein the second looping construct is configured to iterate through second fingerprints corresponding to second data blocks corresponding to the set of data objects in the backup system in accordance with the mapping.
3 . The method of claim 2 , wherein: the first fingerprints corresponding to the first data blocks of the source data system are contiguous; the second fingerprints corresponding to the second data blocks in the backup system are non-contiguous; and the second looping construct is configured to seek to one of the second fingerprints to read a contiguous subset of the second fingerprints corresponding to the second data blocks.
4 . The method of claim 3 , wherein the second fingerprints corresponding to the second data blocks are non-contiguous based at least in part on the backup system supporting deduplication, sharding, or compression of data blocks obtained from the source data system.
5 . The method of claim 1 , wherein each threshold portion of the backup snapshot that is written to the backup system triggers a respective set of comparisons between first fingerprints obtained for the source data system and second fingerprints obtained for the backup system.
6 . The method of claim 1 , further comprising: refraining, until the first data object is validated as stored correctly at the backup system, from writing a second threshold portion of the backup snapshot that is subsequent to the first threshold portion.
7 . The method of claim 1 , further comprising: outputting an error message indicating that the first data object is not stored correctly at the backup system, wherein the error message is indicative of a first location of the first data object in the source data system, a second location of data corresponding to the first data object in the backup system, or both.
8 . The method of claim 1 , wherein: the first storage format of the source data system is a local file storage format, a network file storage format, or an object storage format, wherein each of the one or more data objects is a respective file; and the backup system is a block device storing data of each respective file to one or more blocks.
9 . The method of claim 1 , wherein: the first storage format of the source data system is a local file storage format, wherein each of the one or more data objects is a respective file; and the second storage format of the backup system is a local file storage format, a network file storage format, or an object storage format, wherein the backup system comprises a storage cluster comprising a plurality of storage devices that are accessed using the network file storage format.
10 . The method of claim 1 , wherein the mapping between data objects of the source data system and data blocks in the backup system comprises, for each data object in the source data system, a respective mapping between a first starting offset and a first length for the data object at the source data system to a device in the backup system, a second starting offset, and a second length of data of the data object as stored at the backup system.
11 . The method of claim 1 , wherein: the mapping between data objects of the source data system and data blocks in the backup system comprises, for each data object in the source data system, one or more respective mappings of the data object across one or more storage components at the backup system, and each of the one or more respective mappings may correspond to a respective mapping function, a respective block size, or both corresponding to a respective storage component of the one or more storage components.
12 . The method of claim 1 , wherein obtaining the second fingerprint comprises: computing the second fingerprint prior to encrypting or compressing data of the one or more second data blocks for storage at the backup system; or decompressing the data of the one or more second data blocks, decrypting the one or more second data blocks, or both prior to computing the second fingerprint.
13 . The method of claim 1 , further comprising: performing data validation for data objects in the source data system in accordance with a multi-pass procedure that utilizes a fixed block size for fingerprints and prioritizes processing of data objects based on a size of the data objects of the source data system such that one or more large data objects are processed prior to one or more small data objects; and reserving, in the backup system, storage resources for storage of data blocks for the data objects, wherein the multi-pass procedure and the storage resources reservation results in data alignment for the one or more data objects of the source data system.
14 . An apparatus, comprising: one or more memories storing processor-executable code; and one or more processors coupled with the one or more memories and individually or collectively operable to execute the code to cause the apparatus to: write, to a backup system, a first threshold portion of data of a backup snapshot obtained from a source data system; obtain a first fingerprint of one or more first data blocks obtained from the source data system, wherein the one or more first data blocks correspond to one or more data objects and are in accordance with a first storage format associated with the source data system; obtain a second fingerprint of one or more second data blocks as stored in the backup system, wherein the one or more second data blocks correspond to the one or more data objects and are stored in the backup system in accordance with a second storage format; compare, in accordance with a mapping between data objects of the source data system and data blocks in the backup system and in response to writing the first threshold portion of data, the first fingerprint and the second fingerprint to validate whether the one or more data objects are stored correctly at the backup system; determine, based at least in part on the comparing, that a first data object of the one or more data objects is not stored correctly at the backup system; obtain, from the source data system, a first set of data blocks corresponding to the first data object; write, to the backup system, data of the first set of data blocks as a second set of data blocks at the backup system; and compare, to validate whether the first data object is stored correctly at the backup system, a third fingerprint of the first set of data blocks to a fourth fingerprint of the second set of data blocks.
15 . The apparatus of claim 14 , wherein the one or more processors are individually or collectively further operable to execute the code to cause the apparatus to: use a first looping construct and a second looping construct to identify the first fingerprint and the second fingerprint for comparison, wherein the first looping construct is configured to iterate through first fingerprints corresponding to first data blocks in the source data system that correspond to a set of data objects, and wherein the second looping construct is configured to iterate through second fingerprints corresponding to second data blocks corresponding to the set of data objects in the backup system in accordance with the mapping.
16 . The apparatus of claim 15 , wherein: the first fingerprints corresponding to the first data blocks of the source data system are contiguous; the second fingerprints corresponding to the second data blocks in the backup system are non-contiguous; and the second looping construct is configured to seek to one of the second fingerprints to read a contiguous subset of the second fingerprints corresponding to the second data blocks.
17 . A non-transitory computer-readable medium storing code, the code comprising instructions executable by one or more processors to: write, to a backup system, a first threshold portion of data of a backup snapshot obtained from a source data system; obtain a first fingerprint of one or more first data blocks obtained from the source data system, wherein the one or more first data blocks correspond to one or more data objects and are in accordance with a first storage format associated with the source data system; obtain a second fingerprint of one or more second data blocks as stored in the backup system, wherein the one or more second data blocks correspond to the one or more data objects and are stored in the backup system in accordance with a second storage format; compare, in accordance with a mapping between data objects of the source data system and data blocks in the backup system and in response to writing the first threshold portion of data, the first fingerprint and the second fingerprint to validate whether the one or more data objects are stored correctly at the backup system; determine, based at least in part on the comparing, that a first data object of the one or more data objects is not stored correctly at the backup system; obtain, from the source data system, a first set of data blocks corresponding to the first data object; write, to the backup system, data of the first set of data blocks as a second set of data blocks at the backup system; and compare, to validate whether the first data object is stored correctly at the backup system, a third fingerprint of the first set of data blocks to a fourth fingerprint of the second set of data blocks.
18 . The non-transitory computer-readable medium of claim 17 , wherein the instructions are further executable by the one or more processors to: use a first looping construct and a second looping construct to identify the first fingerprint and the second fingerprint for comparison, wherein the first looping construct is configured to iterate through first fingerprints corresponding to first data blocks in the source data system that correspond to a set of data objects, and wherein the second looping construct is configured to iterate through second fingerprints corresponding to second data blocks corresponding to the set of data objects in the backup system in accordance with the mapping.

Description

FIELD OF TECHNOLOGY The present disclosure relates generally to data management, including techniques for end-to-end data validation for backup systems. BACKGROUND A data management system (DMS) may be employed to manage data associated with one or more computing systems. The data may be generated, stored, or otherwise used by the one or more computing systems, examples of which may include servers, databases, virtual machines, cloud computing systems, file systems (e.g., network-attached storage (NAS) systems), or other data storage or processing systems. The DMS may provide data backup, data recovery, data classification, or other types of data management services for data of the one or more computing systems. Improved data management may offer improved performance with respect to reliability, speed, efficiency, scalability, security, or ease-of-use, among other possible aspects of performance. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates an example of a computing environment that supports end-to-end data validation for backup systems in accordance with aspects of the present disclosure. FIG. 2 shows an example of a computing environment that supports end-to-end data validation for backup systems in accordance with aspects of the present disclosure. FIG. 3 shows an example of a process flow that supports end-to-end data validation for backup systems in accordance with aspects of the present disclosure. FIG. 4 shows a block diagram of an apparatus that supports end-to-end data validation for backup systems in accordance with aspects of the present disclosure. FIG. 5 shows a block diagram of a data validation manager that supports end-to-end data validation for backup systems in accordance with aspects of the present disclosure. FIG. 6 shows a diagram of a system including a device that supports end-to-end data validation for backup systems in accordance with aspects of the present disclosure. FIGS. 7 through 9 show flowcharts illustrating methods that support end-to-end data validation for backup systems in accordance with aspects of the present disclosure. DETAILED DESCRIPTION Backup systems may protect (e.g., back up) various types of computing objects by reading data from the source system and writing the data to a backup store. The source system may store data in a first format (e.g., windows file system) while the backup system may store the data in accordance with a different format (e.g., a data cluster with multiple devices exposed as a fourth extended file system (EXT4 file system). Additionally, the data may be read from the source in a contiguous manner (e.g., from contiguous storage locations), while the data may be stored in the backup system in a non-contiguous manner due to the use of incremental snapshots, deduplication, sharding/partitioning, etc. As such, performing data validations between data obtained from the source and data written to the target is complex and some related techniques may be computationally inefficient. For example, some techniques may read a snapshot from the source, compute a checksum, write the snapshot to the target, and after the snapshot is completely written, read/mount the snapshot, and compute a checksum. However, computing checksums for large chunks of data after the data has been completely written may be computationally inefficient. Techniques described herein support a more granular and scalable approach for data validation by a data protection system. The backup system (or an associated component) may maintain a mapping of aspects of data objects as obtained from the source to corresponding data blocks in the backup system, such that fingerprints of smaller chunks/blocks of data may be computed and compared. The validation techniques may be performed as part of the backup procedure and/or a subsequent recovery procedure. Additionally, a first loop construct (e.g., a first iterator function) may be used to track consecutive data blocks/objects read from the source, and a second loop construct (e.g., a second iterator function) may be used to track where corresponding data blocks are written to the target, even if data is not written contiguously to the target. The validation may be performed at various layers in the input/output (I/O) path based on how the data is translated, written, encrypted, deduped, sharded, etc. Additionally, the same fingerprinting algorithm is implemented for the source data and the target data. Data integrity is checked by comparison of data signatures computed in source and target components. These and other techniques are described in further detail with respect to the figures. FIG. 1 illustrates an example of a computing environment 100 that supports end-to-end data validation for backup systems in accordance with aspects of the present disclosure. The computing environment 100 may include a computing system 105, a data management system (DMS) 110, and one or more computing devices 115, which may be in communication wi