US-12625778-B1 - Data management across cloud environments

US12625778B1US 12625778 B1US12625778 B1US 12625778B1US-12625778-B1

Abstract

Techniques are described for data management across cloud environments. An example method comprises restoring, by the data platform, a selected copy of the one or more copies from the one or more second storage systems by, for each chunk of the selected copy: identifying, by the data platform, the chunk in the chunk metadata, determining, by the data platform, whether a matching chunk is stored on the first storage system based on the chunk metadata, responsive to determining the matching chunk is stored on the first storage system, retrieving, by the data platform, the matching chunk from the first storage system, wherein the matching chunk is included in the selected copy, and responsive to determining the matching chunk is not stored on the first storage system, retrieving, by the data platform, the chunk from the one or more second storage systems, wherein the chunk is included in the selected copy.

Inventors

Anirudh Kumar

Assignees

Cohesity, Inc.

Dates

Publication Date: 20260512
Application Date: 20240130

Claims (18)

1 . A method comprising: storing, by a data platform implemented by a computing system, a first plurality of chunks storing data for an object of a file system and a second plurality of chunks storing data for a copy of the object, wherein the first plurality of chunks are stored on a first storage system and the second plurality of chunks are stored on one or more second storage systems; storing, by the data platform, chunk metadata for both the first plurality of chunks and the second plurality of chunks at the first storage system that is deployed local to the computing system that implements the data platform; and restoring, by the data platform, a selected copy of the one or more copies from the one or more second storage systems by, for each chunk of the selected copy: identifying, by the data platform and from the chunk metadata, a data access cost for restoring a matching chunk, that matches the chunk, from the first storage system, wherein the data access cost differs based on whether restoring the matching chunk from the first storage system includes a data egress from a different computing system to the computing system that implements the data platform; determining, by the data platform, whether the data access cost for restoring the matching chunk from the first storage system is lower than that of restoring the chunk from the one or more second storage systems; responsive to determining that the data access cost for restoring the matching chunk from the first storage system is lower than that of restoring the chunk from the second storage system, retrieving, by the data platform, the matching chunk from the first storage system, wherein the matching chunk is included in the selected copy; and responsive to determining that the data access cost for restoring the matching chunk from the first storage system is not lower than that of restoring the chunk from the second storage system, retrieving, by the data platform, the chunk from the one or more second storage systems, wherein the chunk is included in the selected copy.
2 . The method of claim 1 , further comprising, after restoring the selected copy, storing, by the data platform, the selected copy on a storage system selected from one or more of the first storage system and the one or more second storage systems.
3 . The method of claim 1 , wherein each of at least a subset of the one or more copies represents a backup of the object for a particular time.
4 . The method of claim 1 , wherein the one or more second storage systems are deployed to the different computing system.
5 . The method of claim 1 , wherein each of the first storage system and the one or more second storage systems are provided by distinct cloud service providers.
6 . The method of claim 5 , wherein retrieving the chunk from the one or more second storage systems is associated with a higher cost than retrieving the matching chunk from the first storage system.
7 . The method of claim 5 , further comprising locating the chunk in the one or more second storage systems is associated with a higher cost than locating the matching chunk from the first storage system.
8 . The method of claim 1 , further comprising receiving, by the data platform, an indication of the selected copy via an input device.
9 . A computing system comprising: a memory storing instructions; and processing circuitry that executes the instructions to: store a first plurality of chunks storing data for an object of a file system and a second plurality of chunks storing data for a copy of the object, wherein the first plurality of chunks are stored on a first storage system and the second plurality of chunks are stored on one or more second storage systems; store chunk metadata for the first plurality of chunks and the second plurality of chunks at the first storage system that is deployed local to the computing system that implements a data platform; and restore a selected copy of the one or more copies from the one or more second storage systems by, for each chunk of the selected copy: identify, from the chunk metadata, a data access cost for restoring a matching chunk, that matches the chunk, from the first storage system, wherein the data access cost differs based on whether restoring the matching chunk from the first storage system includes a data egress from a different computing system to the computing system; determine whether the data access cost for restoring the matching chunk from the first storage system is lower than that of restoring the chunk from the one or more second storage system; responsive to determining that the data access cost for restoring the matching chunk from the first storage system is lower than that of restoring the chunk from the second storage system, retrieve the matching chunk from the first storage system, wherein the matching chunk is included in the selected copy; and responsive to determining that the data access cost for restoring the matching chunk from the first storage system is not lower than that of restoring the chunk from the second storage system, retrieve the chunk from the one or more second storage systems, wherein the chunk is included in the selected copy.
10 . The computing system of claim 9 , wherein the processing circuitry further executes the instructions to, after restoring the selected copy, store the selected copy on a storage system selected from one or more of the first storage system and the one or more second storage systems.
11 . The computing system of claim 9 , wherein each of at least a subset of the one or more copies represents a backup of the object for a particular time.
12 . The computing system of claim 9 , wherein the one or more second storage systems are deployed to the different computing system.
13 . The computing system of claim 9 , wherein each of the first storage system and the one or more second storage systems are provided by distinct cloud service providers.
14 . The computing system of claim 13 , wherein retrieving the chunk from the one or more second storage systems is associated with a higher cost than retrieving the matching chunk from the first storage system.
15 . The computing system of claim 13 , further comprising locating the chunk in the one or more second storage systems is associated with a higher cost than locating the matching chunk from the first storage system.
16 . The computing system of claim 9 , wherein the processing circuitry further executes the instructions to receive an indication of the selected copy via an input device.
17 . A non-transitory computer-readable storage medium comprising instructions that, when executed, cause processing circuitry of a computing system to: store a first plurality of chunks storing data for an object of a file system and a second plurality of chunks storing data for a copy of the object, wherein the first plurality of chunks are stored on a first storage system and the second plurality of chunks are stored on one or more second storage systems; store chunk metadata for the first plurality of chunks and the second plurality of chunks at the first storage system that is deployed local to the computing system that implements a data platform; and restore a selected copy of the one or more copies from the one or more second storage systems by, for each chunk of the selected copy: identify, from the chunk metadata, a data access cost for restoring a matching chunk, that matches the chunk, from the first storage system, wherein the data access cost differs based on whether restoring the matching chunk from the first storage system includes a data egress from a different computing system to the computing system; determine whether the data access cost for restoring the matching chunk from the first storage system is lower than that of restoring the chunk from the one or more second storage system; responsive to determining that the data access cost for restoring the matching chunk from the first storage system is lower than that of restoring the chunk from the second storage system, retrieve the matching chunk from the first storage system, wherein the matching chunk is included in the selected copy; and responsive to determining that the data access cost for restoring the matching chunk from the first storage system is not lower than that of restoring the chunk from the second storage system, retrieve the chunk from the one or more second storage systems, wherein the chunk is included in the selected copy.
18 . The computer-readable storage medium of claim 17 , wherein, when further executed, the instructions cause the processing circuitry of the computing system to, after restoring the selected copy, store the selected copy on a storage system selected from one or more of the first storage system and the one or more second storage systems.

Description

TECHNICAL FIELD This disclosure relates to data platforms for computing systems. BACKGROUND Data platforms that support computing applications rely on primary storage systems to support latency sensitive applications. However, because primary storage is often more difficult or expensive to scale, a secondary storage system is often relied upon to support secondary use cases such as backup and archive. SUMMARY Aspects of this disclosure describe techniques for data management across cloud environments, such as may be provided by public cloud service providers. Some data platforms exist in a hybrid cloud arrangement where data of a file system (e.g., a distributed file system) may be stored across various cloud environments. For example, a data platform may store data in a storage service within the cloud environment where the data platform resides (e.g., a primary copy, backup, or archive), while storing a copy (e.g., a secondary copy, backup, or archive) of the data in a storage service of one or more other cloud environments. When a data platform in a first cloud environment reads data from a second cloud environment, data must egress the second cloud environment. For example, a data platform may store a selected copy of the data (e.g., a primary copy) in a first cloud environment (e.g., a primary cloud environment) and store one or more secondary copies of the data (e.g., secondary copies) in one or more distinct second cloud environments (e.g., a secondary cloud environment). In this example, the data platform may be deployed to or otherwise reside in the primary cloud environment and therefore no data egress occurs when the data platform accesses the primary copy. Data egress may occur when a data platform accesses secondary copies, such as during regular operation or during restoration of a secondary copy of the primary copy. A primary copy may be substantial in size (e.g., hundreds of gigabytes (GBs) or more) thereby requiring an equally substantial amount of data for secondary copies. As such, when a data platform accesses data from another cloud environment (e.g., a secondary copy) an equal amount of data egress occurs. For example, to restore 500 GBs of data from a secondary copy, a data platform may retrieve 500 GBs of data from a secondary cloud environment thereby causing 500 GBs of data to egress the secondary cloud environment. Data egress may incur various data access costs. For example, data egress may have costs related to latency and bandwidth as data is transmitted between cloud environments. Data egress may also be subject to monetary data access costs assessed by cloud environments, such as public cloud services. For example, some public cloud services may assess charges for API calls (e.g., $2.00 per 1 million API calls) and data egress (e.g., $1.00 per megabyte (Mb)). A data platform at a primary cloud service may therefore incur various data access costs when reading secondary copies at one or more secondary cloud services. For example, to create a secondary copy, a data platform may read data from the primary cloud environment and store the data in the secondary cloud environment as the secondary copy. To restore the secondary copy, some data platforms may read the secondary copy entirely from the secondary cloud environment, which is subject to data access costs. As will be described further herein, a data platform may store data of a file system in one or more chunks, where each chunk may represent a portion of the data. For example, a file system may comprise one or more files or other objects. The data platform may split the objects into one or more fixed or variable size chunks (e.g., 16-48 kilobytes (kB)) and store the objects as chunks in multiple cloud environments (e.g., in a hybrid cloud environment). The techniques described herein provide data management across cloud environments to reduce or eliminate data access costs when utilizing multiple cloud environments for storage of data and one or more backups, archives, or other copies thereof. For example, rather than reading a secondary copy entirely from a secondary cloud service, in accordance with the disclosed techniques, a data platform at a primary cloud service may instead determine whether at least a portion of the secondary copy is available within a primary cloud service. Responsive to the determination, the data platform may retrieve the data unavailable within the primary cloud service from the secondary cloud services, thereby reducing or eliminating data egress and data access costs relative to the secondary cloud services. Although the techniques described in this disclosure are primarily described with respect to a backup function of a data platform (e.g., restoring backups), similar techniques may be applied for an archive function (e.g., restoring archives) or other similar function of the data platform. In one example, this disclosure describes a method comprising storing, by a data platform implemented