US-12619500-B2 - Data management system for detecting changes in a data management system
Abstract
An incremental backup agent performs backup operations that synchronize database on client side to a server database. In one embodiment, such backup operations are incremental backups, where the agent may identify differences between the current directory with the latest backed up version. The agent may issue a direct RPC using SMB protocols or NFS protocols to fetch all entries of directories with metadata in a single RPC call, instead of issuing one call to fetch metadata for each directory entry. The agent may identify changes with efficiency by performing checksum changes in a DFS manner. Starting from a root directory, the agent may generate a checksum for each directory and compare the checksums on the client side with the retrieved fingerprints, and if the backup agent identifies that the fingerprints match, the backup agent may then go to a deeper level and compare the fingerprints for child directories.
Inventors
- Sudhakar Paulzagade
Assignees
- DRUVA INC.
Dates
- Publication Date
- 20260505
- Application Date
- 20230706
- Priority Date
- 20220803
Claims (20)
- 1 . A system comprising: a processor; and a non-transitory computer-readable storage medium storing instructions for conducting an incremental backup of a data source, the instructions when executed by the processor cause the processor to: receive a request to identify one or more changes in a file system; retrieve, in a single call, a collection of directory-level metadata for a first file directory, the directory-level metadata being metadata for a directory instead of a file, wherein the retrieval of the collection of the directory-level metadata is performed without retrieving individual file content and prior to comparing any specific file content; determine a first directory checksum derived from the directory-level metadata that is the metadata for the first file directory, wherein the first file directory comprises at least a file and a child file directory; compare the first directory checksum to a second directory checksum of a second file directory; compare, responsive to determining that the first and the second directory checksums match, a first child directory checksum associated with the child file directory and a second child directory checksum associated with a respective child file directory in the second file directory; conduct, responsive to determining that the first and the second directory checksums match, an iterative search traversing through the first file directory; and identify a file that is different in the first file directory and the second file directory.
- 2 . The system of claim 1 , wherein the second file directory resides in a Network Attached Storage (NAS) system.
- 3 . The system of claim 1 , wherein the collection of metadata is gathered by transmitting a directory-metadata call to the data source, the directory-metadata call requesting for metadata of the first file directory.
- 4 . The system of claim 3 , wherein the directory-metadata call is a single system call that fetches metadata of a directory.
- 5 . The system of claim 1 , wherein the first file directory and the second file directory are the same file directory, the second file directory being a previous version of the first file directory that was backed up in a previous point in time.
- 6 . The system of claim 1 , wherein the instructions when executed by the processor cause the processor to further perform steps including: comparing, responsive to the first directory checksum being different from the second directory checksum, each entry under the first directory with each entry under the second directory, the comparing including content for each entry and modification time for each entry.
- 7 . The system of claim 1 , wherein the instructions when executed by the processor cause the processor to further perform steps including: requesting, as part of the incremental backup and from the data source, one or more files in one of the child file directories whose child directory checksum is different from the child directory checksum of the second file directory.
- 8 . A system comprising: a processor; and a non-transitory computer-readable storage medium storing instructions for conducting an incremental backup of a source data base, the instructions when executed by the processor cause the processor to: receive a request to identify whether content of a first file directory in a NAS (network attached storage) system is different from content of a second file directory, the first file directory comprising at least one or more files and at least one or more file directories; issue a direct call to the NAS system wherein the direct call does not trigger a call through a kernel, the direct call fetching a set of metadata associated with the first file directory; and determine a file in the first directory that comprises a change from the file in the second directory, wherein determining the file in the first directory that comprises a change from the file in the second directory comprises: retrieve, in a single call of the direct call to the NAS system, a collection of directory-level metadata for a first file directory, the directory-level metadata being metadata for a directory instead of a file, wherein the retrieval of the collection of the directory-level metadata is performed without retrieving individual file content and prior to comparing any specific file content; determine a first directory checksum derived from the directory-level metadata that is the metadata for the first file directory, wherein the first file directory comprises at least a file and a child file directory, compare the first directory checksum to a second directory checksum of the second file directory, and conduct, responsive to determining that the first and the second directory checksums match, an iterative search traversing through the first file directory.
- 9 . The system of claim 8 , wherein the direct call is a single call that fetches multiple directories under the first file directory.
- 10 . The system of claim 9 , wherein the direct call is a single call that fetches multiple sets of metadata associated with the multiple directories.
- 11 . The system of claim 8 , wherein the first file directory and the second file directory are a same directory with different contents, the content in the second file directory being a previous version of the content in a previous time point.
- 12 . The system of claim 8 , wherein the first file directory is managed through a Server Message Block (SMB).
- 13 . The system of claim 12 , wherein the request is implemented using a SMB interface QUERY_DIRECTORY RPC.
- 14 . The system of claim 8 , wherein the first file directory is managed through a Network File System (NFS).
- 15 . The system of claim 14 , wherein the request is implemented using a NFS interface NFSPROC3_READDIRPLUS.
- 16 . A non-transitory computer-readable storage medium for storing executable computer instructions for conducting an incremental backup of a data source, wherein the computer instructions, when executed by one or more processors, cause the one or more processors to perform operations, the instructions comprising instructions to: receive a request to identify one or more changes in a file system; retrieve, in a single call, a collection of directory-level metadata for a first file directory, the directory-level metadata being metadata for a directory instead of a file, wherein the retrieval of the collection of the directory-level metadata is performed without retrieving individual file content and prior to comparing any specific file content; determine a first directory checksum derived from the directory-level metadata that is the metadata for the first file directory, wherein the first file directory comprises at least a file and a child file directory; compare the first directory checksum to a second directory checksum of a second file directory; compare, responsive to determining that the first and the second directory checksums match, a first child directory checksum associated with the child file directory and a second child directory checksum associated with a respective child file directory in the second file directory; conduct, responsive to determining that the first and the second directory checksums match, an iterative search traversing through the first file directory; and identify a file that is different in the first file directory and the second file directory.
- 17 . The non-transitory computer-readable storage medium of claim 16 , wherein the second file directory resides in a Network Attached Storage (NAS) system.
- 18 . The non-transitory computer-readable storage medium of claim 16 , wherein the collection of metadata is gathered by transmitting a directory-metadata call to the data source, the directory-metadata call requesting for metadata of the first file directory.
- 19 . The non-transitory computer-readable storage medium of claim 16 , wherein the instructions further comprise instructions to: transmit, responsive to the first directory checksum being different from the second directory checksum, one or more additional directory-metadata calls to the data source for requesting metadata associated with one or more child file directories.
- 20 . The non-transitory computer-readable storage medium of claim 16 , wherein the first file directory and the second file directory are the same file directory, the second file directory being a previous version of the first file directory that was backed up in a previous point in time.
Description
CROSS-REFERENCE TO RELATED APPLICATION This application claims the benefit of Indian Provisional Application No. 202241044456, filed Aug. 3, 2022, which is herein incorporated by reference for all purposes. TECHNICAL FIELD The disclosed embodiments are related to data management systems, and, more specifically, to a data management system that may efficiently detect changes in databases. BACKGROUND To protect against data loss, organizations may periodically backup data to a backup system and restore data from the backup system. In some cases, the backup data may comprise files in large sizes such as large data files or a snapshot of virtual disks within a virtual machine. Conventionally, NAS (Network Attached Storage) systems are used for maintaining a large amount of unstructured data. An NAS storage system is usually a storage device connected to a network that allows storage and retrieval of data from a centralized location for authorized network users and clients. However, the existing system poses a challenge for incremental backups as many NAS service providers do not provide a functionality for identifying differences between two snapshots of a file system. Therefore, a more efficient implementation for detecting changes in different snapshots of a NAS system is needed. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram illustrating a system environment of an example data management system with an incremental backup agent, in accordance with an embodiment. FIG. 2 is a block diagram illustrating an architecture of an example backup agent with an incremental backup agent, in accordance with an embodiment. FIG. 3 is a block diagram illustrating an exemplary structure between NAS agents and NAS service providers for consolidated data fetching, in accordance with an embodiment. FIG. 4 is an exemplary tree structure for illustrating a depth-first search algorithm, in accordance with an embodiment. FIG. 5 is an exemplary root directory with child directories, in accordance with an embodiment. FIG. 6 is a flowchart depicting an example process of identifying a change between databases, in accordance with an embodiment. FIG. 7 is a flowchart depicting an example process of consolidating data fetching in a change detection process, in accordance with an embodiment. FIG. 8 is a block diagram illustrating components of an example computing machine, in accordance with an embodiment. The figures depict various embodiments for purposes of illustration only. One skilled in the art will readily recognize from the following discussion that alternative embodiments of the structures and methods illustrated herein may be employed without departing from the principles described herein. DETAILED DESCRIPTION Configuration Overview A data management system is disclosed with example embodiments related to systems and processes of change detection associated with files in a NAS system that improve the efficiency and cost of performing incremental backups. The data management system may include a backup agent with an incremental backup agent for performing such efficient incremental backups. The incremental backup agent may perform backup operations that synchronize database on client side to a server database. In one embodiment, such backup operations are incremental backups, where the incremental backup agent may identify differences between the current directory with the latest backed up version. To perform such scan, the incremental backup agent may first issue a direct RPC (remote procedure call) using SMB (server message block) protocols or NFS (network file system) protocols to fetch all entries of directories with metadata in a single RPC call, instead of issuing one call to fetch metadata for each directory entry, as presented in traditional implementations. The incremental backup agent, after retrieving metadata for all entries of directories, may identify changes with efficiency by performing checksum changes in a DFS (depth first search) manner. Starting from a root directory, the incremental backup agent may generate a checksum for each directory, with the checksum containing condensed information for files and directories under the directory. The incremental backup agent may then compare the checksums (or may be referred to as fingerprints) on the client side with the retrieved fingerprints, and if the backup agent identifies that the fingerprints match, the incremental backup agent may then go to a deeper level and compare the fingerprints for child directories under the directory. The backup agent may iteratively perform such an operation until a difference in a file is identified. The difference may then be used for incremental backup without reconstructing backup data from scratch. The disclosed systems and methods provide multiple advantageous technical features. For example, the disclosed incremental backup agent may improve time and network efficiency by reducing the number of network round trips.