US-12625771-B2 - Parallelizing restoration of database files

US12625771B2US 12625771 B2US12625771 B2US 12625771B2US-12625771-B2

Abstract

Methods, systems, and devices for data management are described. Point-in-time data that includes one or more files may be generated from data management information stored at a data management system having multiple nodes. One or more sets of virtual partitions may be created for at least one file of the one or more files. One or more external file descriptors associated with respective locations of the at least one file and one or more sets of internal file descriptors associated with respective external file descriptors and locations of corresponding sets of virtual partitions may be generated in response to a request. One or more subsequent requests to read the at least one file may be routed to the nodes based on the one or more sets of internal file descriptors. Based on the routing, the respective portions of the one or more files may be output in parallel.

Inventors

Ganesh Karuppur Rajagopalan
Prasenjit Sarkar
Prabhu Mohan

Assignees

RUBRIK, INC.

Dates

Publication Date: 20260512
Application Date: 20231002

Claims (19)

1 . A method, comprising: generating, from data management information stored at a data management system for a data object at a computing system, point-in-time data that comprises a plurality of files; creating, at a plurality of nodes of the data management system and based at least in part on generating the point-in-time data, a plurality of virtual partitions of a file included in the plurality of files; generating, in response to receiving a first request associated with restoring the point-in-time data from the data management system to the computing system or a second computing system, an external file descriptor associated with a location of the file at the data management system and a plurality of internal file descriptors associated with locations of the plurality of virtual partitions of the file at the plurality of nodes; routing, to the plurality of nodes, one or more second requests associated with reading the file included in the plurality of files, wherein the one or more second requests are routed to the plurality of nodes based at least in part on the plurality of internal file descriptors generated in response to the first request; outputting, in parallel and in response to the one or more second requests to read the file included in the plurality of files, respective portions of the file from the plurality of virtual partitions to the computing system; receiving, after outputting the respective portions of the file included in the plurality of files from the plurality of virtual partitions, a request to close the file; and closing the plurality of internal file descriptors in a sequential manner in response to the request to close the file.
2 . The method of claim 1 , further comprising: distributing the plurality of virtual partitions of the file across the plurality of nodes.
3 . The method of claim 2 , wherein each node of the plurality of nodes stores a respective virtual partition of the plurality of virtual partitions of the file.
4 . The method of claim 1 , further comprising: generating, based at least in part on generating the plurality of internal file descriptors, a mapping from the plurality of virtual partitions of the file to respective internal file descriptors of the plurality of internal file descriptors.
5 . The method of claim 1 , further comprising: performing, in response to the first request, a restoration procedure for restoring the point-in-time data; and receiving, during the restoration procedure, a request to open the file included in the plurality of files, wherein the external file descriptor and the plurality of internal file descriptors are generated in response to the request to open the file.
6 . The method of claim 5 , further comprising: responding, to the request to open the file included in the plurality of files, with the external file descriptor; and storing, in response to the request to open the file included in the plurality of files, the plurality of internal file descriptors, a mapping from the plurality of virtual partitions of the file to respective internal file descriptors of the plurality of internal file descriptors, or both.
7 . The method of claim 1 , further comprising: receiving, after receiving a request to open the file included in the plurality of files, a request to read the file; and identifying, in response to the request to read the file included in the plurality of files, a set of virtual partitions from among the plurality of virtual partitions of the file based at least in part on a size of the plurality of virtual partitions and a quantity of the plurality of nodes.
8 . The method of claim 7 , further comprising: identifying an offset in the request to read the file included in the plurality of files, wherein the set of virtual partitions is further identified from among the plurality of virtual partitions of the file based at least in part on the offset.
9 . The method of claim 7 , further comprising: identifying, from among the plurality of internal file descriptors and based at least in part on a mapping from the plurality of virtual partitions of the file to respective internal file descriptors of the plurality of internal file descriptors, a set of internal file descriptors corresponding to the set of virtual partitions; and identifying, from among the plurality of nodes and based at least in part on the set of internal file descriptors, respective nodes corresponding to the respective internal file descriptors included in the set of internal file descriptors.
10 . The method of claim 9 , wherein routing the one or more second requests associated with reading the file included in the plurality of files comprises: routing one or more instances of the request to read the file to the respective nodes in accordance with the respective internal file descriptors included in the set of internal file descriptors, wherein the respective portions of the file are output from the plurality of virtual partitions based at least in part on routing the one or more instances of the request to read the file to the respective nodes.
11 . The method of claim 1 , further comprising: aggregating the respective portions of the file into a single file at the computing system.
12 . The method of claim 1 , further comprising: triggering, as part of a procedure for capturing the point-in-time data at the data management system, the data object to perform a backup procedure that is native to the data object and is configured to cause the data object to transfer the point-in-time data to the data management system; and storing, as a result of the backup procedure, the data management information comprising the point-in-time data at the data management system.
13 . The method of claim 1 , further comprising: triggering, as part of a procedure for restoring the point-in-time data to the computing system, the data object to perform a restoration procedure that is native to the data object and is configured to cause the data object to retrieve the point-in-time data from the data management system, wherein the restoration procedure supports file-wise data transfer, and wherein the external file descriptor and the plurality of internal file descriptors are generated based at least in part on triggering the data object to perform the restoration procedure.
14 . The method of claim 1 , further comprising: triggering, as part of a procedure for duplicating the point-in-time data at the computing system, the data object to perform a duplication procedure that is native to the data object and is configured to cause the data object to transfer the point-in-time data to the second computing system, wherein the duplication procedure supports file-wise data transfer, and wherein the external file descriptor and the plurality of internal file descriptors are generated based at least in part on triggering the data object to perform the duplication procedure.
15 . The method of claim 1 , wherein the computing system, the second computing system, or both are separate from the data management system, and wherein the external file descriptor is used to address the file by the computing system, the second computing system, or both, the method further comprising: receiving, from either the computing system or the second computing system, a request to read the file, the request to read the file comprising the external file descriptor.
16 . An apparatus, comprising: one or more processors; and one or more memories storing instructions executable, individually or collectively, by the one or more processors to cause the apparatus to: generate, from data management information stored at a data management system for a data object at a computing system, point-in-time data that comprises a plurality of files; create, at a plurality of nodes of the data management system and based at least in part on generating the point-in-time data, a plurality of virtual partitions of a file included in the plurality of files; generate, in response to receiving a first request associated with restoring the point-in-time data from the data management system to the computing system or a second computing system, an external file descriptor associated with a location of the file at the data management system and a plurality of internal file descriptors associated with locations of the plurality of virtual partitions of the file at the plurality of nodes; route, to the plurality of nodes, one or more second requests associated with reading the file included in the plurality of files, wherein the one or more second requests are routed to the plurality of nodes based at least in part on the plurality of internal file descriptors generated in response to the first request; output, in parallel and in response to the one or more second requests to read the file included in the plurality of files, respective portions of the file from the plurality of virtual partitions to the computing system; receive, after outputting the respective portions of the file included in the plurality of files from the plurality of virtual partitions, a request to close the file; and close the plurality of internal file descriptors in a sequential manner in response to the request to close the file.
17 . The apparatus of claim 16 , wherein the instructions are further executable, individually or collectively, by the one or more processors to cause the apparatus to: distribute the plurality of virtual partitions of the file across the plurality of nodes.
18 . The apparatus of claim 16 , wherein the instructions are further executable, individually or collectively, by the one or more processors to cause the apparatus to: generate, based at least in part on generating the plurality of internal file descriptors, a mapping from the plurality of virtual partitions of the file to respective internal file descriptors of the plurality of internal file descriptors.
19 . A non-transitory, computer-readable medium storing code that comprises instructions executable, individually or collectively, by one or more processors of an electronic device to cause the electronic device to: generate, from data management information stored at a data management system for a data object at a computing system, point-in-time data that comprises a plurality of files; create, at a plurality of nodes of the data management system and based at least in part on generating the point-in-time data, a plurality of virtual partitions of a file included in the plurality of files; generate, in response to receiving a first request associated with restoring the point-in-time data from the data management system to the computing system or a second computing system, an external file descriptor associated with a location of the file at the data management system and a plurality of internal file descriptors associated with locations of the plurality of virtual partitions of the file at the plurality of nodes; route, to the plurality of nodes, one or more second requests associated with reading the file included in the plurality of files, wherein the one or more second requests are routed to the plurality of nodes based at least in part on the plurality of internal file descriptors generated in response to the first request; output, in parallel and in response to the one or more second requests to read the file included in the plurality of files, respective portions of the file from the plurality of virtual partitions to the computing system; receive, after outputting the respective portions of the file included in the plurality of files from the plurality of virtual partitions, a request to close the file; and close the plurality of internal file descriptors in a sequential manner in response to the request to close the file.

Description

FIELD OF TECHNOLOGY The present disclosure relates generally to data management, including techniques for parallelizing restoration of database files. BACKGROUND A data management system (DMS) may be employed to manage data associated with one or more computing systems. The data may be generated, stored, or otherwise used by the one or more computing systems, examples of which may include servers, databases, virtual machines, cloud computing systems, file systems (e.g., network-attached storage (NAS) systems), or other data storage or processing systems. The DMS may provide data backup, data recovery, data classification, or other types of data management services for data of the one or more computing systems. Improved data management may offer improved performance with respect to reliability, speed, efficiency, scalability, security, or ease-of-use, among other possible aspects of performance. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 illustrates an example of a computing environment that supports parallelizing restoration of database files in accordance with aspects of the present disclosure. FIG. 2 shows an example of a subsystem that supports parallelizing restoration of database files in accordance with aspects of the present disclosure. FIG. 3 shows an example of a set of operations for parallelizing restoration of database files in accordance with aspects of the present disclosure. FIG. 4 shows a block diagram of an apparatus that supports parallelizing restoration of database files in accordance with aspects of the present disclosure. FIG. 5 shows a block diagram of an data management component that supports parallelizing restoration of database files in accordance with aspects of the present disclosure. FIG. 6 shows a diagram of a system including a device that supports parallelizing restoration of database files in accordance with aspects of the present disclosure. FIG. 7 shows a flowchart illustrating methods that support parallelizing restoration of database files in accordance with aspects of the present disclosure. DETAILED DESCRIPTION A data management system (DMS) may provide data management services (e.g., backup, restore, duplication, failover, data analysis, threat detection) for data objects (e.g., data, file systems, applications, databases) implemented at a computing system. The DMS may coordinate with an agent installed at (e.g., on) the computing system to perform or otherwise support one or more of the data management services for the data objects. In some examples, the agent may further coordinate with the data objects to affect a data management operation. In some cases, to perform a data management service, the agent may execute data management operations that are native to the data object. For example, the agent may execute a native backup operation, a native restore operation, a native duplication operation, or the like, to support a corresponding data management service provided by the DMS. Some data objects implemented at a computing system may support the generation of files that are larger than a threshold size (e.g., larger than 500 Gigabytes). Such files may be referred to as “very large files.” For data objects that support very large files and for which native operations are used to support data management services of the DMS, the execution of the data management services may experience significant latency if the native operations fail to support parallelized processing of sections of individual files. In some examples, a data object may support parallelized processing of sections of individual files for some data management operations (e.g., backup operations) but not others (e.g., restore operations, duplication operations). Accordingly, for a data object, execution of some data management services may experience significant latency. Thus, mechanisms (e.g., techniques, components, configurations) that enable all data management functions (e.g., backup, restoration duplication) to support parallelized processing of sections of individual files (e.g., on a per-section basis) for data objects (e.g., that support very large files, that do not have native functions that support per-section processing for all data management functions) may be desired. To support parallelized processing of sections of individual files for all data management functions, virtual partitions of one or more files (e.g., files that exceed a size threshold) materialized for a data management process (e.g., a restoration procedure, a duplication procedure, etc.) may be distributed across the nodes of a data management system and processed in parallel by the nodes. FIG. 1 illustrates an example of a computing environment 100 that supports parallelizing restoration of database files in accordance with aspects of the present disclosure. The computing environment 100 may include a computing system 105, a data management system (DMS) 110, and one or more computing devices 115, which may be in communicatio