US-12625879-B2 - System for optimizing storage replication in a distributed data analysis system using historical data access patterns

US12625879B2US 12625879 B2US12625879 B2US 12625879B2US-12625879-B2

Abstract

Historical analysis of query patterns is used to discover relationships between data sets. These relationships are used to make optimal decisions about where to place data in a globally distributed environment in which locality of data is an important factor in providing good query performance. A mixed integer programming model is used to solve a constraint based system which balances the need to have data kept local with other data and the cost of replicating data across a low-bandwidth network.

Inventors

Alejandra Estanislao
Purujit Saha
Alan Pearson
Andrew Hitchcock

Assignees

GOOGLE LLC

Dates

Publication Date: 20260512
Application Date: 20180821

Claims (17)

1 . A method for optimizing replicated data storage, the method comprising: analyzing, with the one or more processors, query logs for a distributed computing system; identifying, with the one or more processors, projects with linked data sets in the query logs; storing, with the one or more processors, storage data sets for the projects with the linked data sets on a same computing device in the distributed computing system to reduce a number of remote read transactions and associated latency to access data in the linked data sets, wherein storing storage data sets for the projects with the linked data sets on the same computing device includes storing, on the same computing device in the distributed computing system, two or more of the storage data sets that the query logs indicate have been joined by a single user query; generating, with the one or more processors, a graph, wherein each of the identified projects is represented by a node in the graph; and identifying, with the one or more processors, clusters of projects connected by edges having weights greater than a predetermined threshold, and wherein a number of times two of the identified projects were linked in the query logs is represented by a weighted edge between the nodes presenting the two projects.
2 . The method of claim 1 , wherein storing the projects with the linked data sets comprises storing the identified clusters.
3 . The method of claim 2 , further comprising determining, with the one or more processors, data transfer operations required to implement an assignment map, wherein determining the data transfer operations comprises ensuring that an amount of available storage is not exceeded when data is copied from a source server to a destination server.
4 . The method of claim 1 , further comprising using a mixed integer programming model to identify clusters of projects based on linear relationships between projects.
5 . The method of claim 4 , wherein constraints for the mixed integer programming model include storage and compute capacities for each server, storage and compute requirements of each cluster, cumulative storage and compute requirements for each cluster must not exceed the storage and compute capacities for the server assigned to the cluster, each cluster has multiple replicas which must be stored on separate servers, data should remain on a server where it is already located where possible.
6 . The method of claim 4 , further comprising generating, with the one or more processors, an assignment map based on an output of the mixed integer programming model, the assignment map indicating where to store each cluster.
7 . The method of claim 6 , further comprising determining, with the one or more processors, data transfer operations required to implement the assignment map.
8 . The method of claim 7 , wherein determining the data transfer operations comprises ensuring that an amount of available storage is not exceeded when data is copied from a source server to a destination server.
9 . A system for optimizing replicated data storage, comprising: one or more memories storing a log of transactions between data sets in a distributed computing system over a period of time; one or more processors in communication with the one or more memories, the one or more processors configured to: analyze the log for a distributed computing system; identify, based on the analysis of the log, projects with linked data sets; store storage data sets for the projects with the linked data sets on a same computing device in the distributed computing system to reduce a number of remote read transactions and associated latency to access data in the linked data sets, wherein the one or more processors are configured to store storage data sets for the projects with the linked data sets on the same computing device including storing, on the same computing device in the distributed computing system, two or more of the storage data sets that the query logs indicate have been joined by a single user query; generate a graph, wherein each of the identified projects is represented by a node in the graph; and identify clusters of projects connected by edges having weights greater than a predetermined threshold, and wherein a number of times two of the identified projects were linked in the query logs is represented by a weighted edge between the nodes presenting the two projects.
10 . The system of claim 9 , wherein the one or more processors are further configured to identify clusters of projects connected by edges having weights greater than a predetermined threshold.
11 . The system of claim 10 , wherein storing the projects with the most linked data sets comprises storing the identified clusters.
12 . The system of claim 9 , wherein the one or more processors are further configured to use a mixed integer programming model to identify clusters of projects based on linear relationships between projects.
13 . The system of claim 12 , wherein constraints for the mixed integer programming model include storage and compute capacities for each server, storage and compute requirements of each cluster, cumulative storage and compute requirements for each cluster must not exceed the storage and compute capacities for the server assigned to the cluster, each cluster has multiple replicas which must be stored on separate servers, data should remain on a server where it is already located where possible.
14 . The system of claim 13 , wherein the one or more processors are further configured to generate an assignment map based on an output of the mixed integer programming model, the assignment map indicating where to store each cluster.
15 . The system of claim 14 , wherein the one or more processors are further configured to determine data transfer operations required to implement the assignment map.
16 . The system of claim 15 , wherein determining the data transfer operations comprises ensuring that an amount of available storage in the servers is not exceeded when data is copied from a source server to a destination server.
17 . A non-transitory computer readable medium storing instructions executable by a processor for performing a method of optimizing replicated data storage, the method comprising: analyzing query logs for a distributed computing system; identifying projects with linked data sets in the query logs; storing the projects with the linked data sets on a same computing device in the distributed computing system to reduce a number of remote read transactions and associated latency to access data in the linked data sets, wherein storing storage data sets for the projects with the linked data sets on the same computing device includes storing, on the same computing device in the distributed computing system, two or more of the storage data sets that the query logs indicate were joined by a single user query; generating, with the one or more processors, a graph, wherein each of the identified projects is represented by a node in the graph; and identifying, with the one or more processors, clusters of projects connected by edges having weights greater than a predetermined threshold, and wherein a number of times two of the identified projects were linked in the query logs is represented by a weighted edge between the nodes presenting the two projects.

Description

BACKGROUND Cloud computing systems sometimes include a distributed data analysis engine, which operates in multiple data centers distributed globally. Each data center contains one or more servers. Users of such cloud computing systems may create organizations and projects. Within a project, the distributed data analysis engine allows users to create data sets and tables. Internally, tables are partitioned into units of data replication, called storage sets. Each storage set corresponds to one or more files stored on a server. While users typically query their own data sets, it is also possible for one user to share data sets with another user or make them publicly available to many users. Multiple data sets may be joined together at query time, which potentially requires the system to read data from a large number of distinct data sets, possibly belonging to arbitrary users. When evaluating a query, the distributed data analysis engine executes a set of processes within a specific server. These processes read the storage set files described above and perform most efficiently when the files being read are stored on the same server that is running the analysis processes. Reading data from a remote server is inherently more expensive and involves higher latency. The amount of available bandwidth for cross-server data transfer is also limited and is a scarce resource. Due to these limitations, in some systems cross-server reads exceeding a small limit are disabled and all of the data being processed by the analysis processes must be present in the local server. In order to provide fault tolerance, redundancy and high availability, some systems replicate all storage sets to every server in which the analysis processes may be run. However, replicating data to every server in use is problematic because it is costly in terms of the amount of data transfer needed as the number of servers in use grows. Each additional server would add linear growth in the volume of data transfer and so would not be scalable. Moreover, replicating data to every server is further problematic because it imposes a ceiling on growth of system's storage capabilities, since the system would be capped by the available storage size of the smallest server in use. BRIEF SUMMARY The present disclosure provides a method for optimizing replicated data storage. The method includes identifying, with one or more processors, data sets commonly owned by an organization, automatically storing the commonly owned data sets on a same computing device in a distributed computing system, analyzing, with the one or more processors, query logs for the distributed computing system, identifying, with the one or more processors, projects with linked data sets in the query logs, and storing, with the one or more processors, the projects with the most frequently linked data sets on the same computing device in the distributed computing system. According to some examples, the method may further include generating, with the one or more processors, a graph, wherein each of the identified projects is represented by a node in the graph, and wherein a number of times two of the identified projects were linked in the query logs is represented by a weighted edge between the nodes presenting the two projects. Clusters of projects connected by edges having weights greater than a predetermined threshold may be identified, wherein storing the projects with the most linked data sets comprises storing the identified clusters. According to other examples, the method may include using a mixed integer programming model to identify clusters of projects based on linear relationships between projects. Constraints for the mixed integer programming model may include storage and compute capacities for each server, storage and compute requirements of each cluster, cumulative storage and compute requirements for each cluster must not exceed the storage and compute capacities for the server assigned to the cluster, each cluster has multiple replicas which must be stored on separate servers, data should remain on a server where it is already located where possible. An assignment map may be generated based on an output of the mixed integer programming model, the assignment map indicating where to store each cluster. Data transfer operations required to implement the assignment map may be determined, ensuring that an amount of available storage in the servers is not exceeded when data is copied from a source server to a destination server. Another aspect of the disclosure provides a system for optimizing replicated data storage. The system includes one or more memories storing a log of transactions between data sets in a distributed computing system over a period of time; and one or more processors in communication with the one or more memories. The one or more processors are configured to identify data sets commonly owned by an organization, automatically store the commonly owned data sets on