US-12619633-B2 - Fast database scaling utilizing a decoupled storage and compute architecture

US12619633B2US 12619633 B2US12619633 B2US 12619633B2US-12619633-B2

Abstract

Techniques for fast online scaling of a database via a split architecture including decoupled storage and compute tiers in a database service are described. A cluster of database (DB) nodes is scaled to add a new DB node. The scaling includes determining a split for data of a first volume managed by an existing DB node. A second DB node is obtained, and the first volume is cloned according to a lightweight copy technique to yield a second volume for use by the second DB node. After the cloning, a set of database modifications are applied to the second volume based on modifications caused by database traffic received by the first DB node, involving the volume, during the cloning of the first volume. Each DB node may drop the portion of the volume that it does not need according to the split.

Inventors

Andrew James WHITAKER
Gajanan Sharadchandra CHINCHWADKAR
Jan ENGELSBERG
Meet Kiritkumar BHAGDEV
Sean Oczkowski
Saleem Mohideen
Alexandre Olegovich Verbitski
Murali Brahmadesam
Navaneetha Krishnan Thanka Nadar
Li Che David HSIAO
Apoorv Birthare

Assignees

AMAZON TECHNOLOGIES, INC.

Dates

Publication Date: 20260505
Application Date: 20211210

Claims (20)

1 . A computer-implemented method, comprising: determining to add a new database (DB) node to a cluster of one or more DB nodes implementing a database in a database service, wherein the cluster of DB nodes includes a first DB node operating in a compute tier of the database service that is implemented by a first one or more electronic devices and that manages query execution, wherein the database service further includes a storage tier implemented by a second one or more electronic devices that includes storage nodes that manage the storage and durability of data of databases operated upon by the compute tier; scaling the cluster of DB nodes to add at least the new DB node, comprising: registering the new DB node in global metadata managed by a control plane of the database service; determining a logical split for data of a first volume in the storage tier, wherein the first volume is managed by the first DB node in the compute tier, the logical split including a first portion of the data to remain managed by the first DB node and a second portion of the data to be managed by the new DB node; obtaining a second DB node, in the compute tier, to serve as the new DB node; cloning the first volume, within the storage tier, to yield a second volume in the storage tier for use by the second DB node in the compute tier, wherein the cloning copies less than all of the first volume within the storage tier; configuring the second DB node in the compute tier to utilize the second volume in the storage tier; applying, after the cloning, a set of database modifications to the second volume in the storage tier, where the set of database modifications resulted from traffic received by the first DB node in the compute tier during the cloning of the first volume; updating the global metadata to indicate a deleting state for the second portion of the data on the first DB node; updating the global metadata to indicate the second DB node is active and to begin sending database traffic involving the second portion of the data to the second DB node in the compute tier; causing the first DB node in the compute tier to delete the second portion of the data that is in the deleting state from the first volume in the storage tier; and causing the second DB node in the compute tier to delete the first portion of the data from the second volume in the storage tier.
2 . The computer-implemented method of claim 1 , further comprising prior to the cloning of the first volume, waiting up to a period of time for existing transactions involving the first DB node to complete before proceeding with the cloning of the first volume.
3 . The computer-implemented method of claim 1 , wherein during or after the cloning of the first volume, but prior to the updating of the global metadata to indicate the second DB node is active, the method further comprises: receiving a request at the first DB node involving the second portion of the data that is to be managed by the new DB node; and sending a response message to a proxy layer of the database service indicating that the request is denied, wherein the response message at least one of: indicates that the request can be retried; or includes a hint value allowing the proxy layer to update its routing cache to send additional requests involving the second portion of the data to the second DB node.
4 . A computer-implemented method, comprising: determining to add a new database (DB) node to a cluster of one or more DB nodes implementing a database in a database service, wherein the cluster includes a first DB node of a compute tier of the database service; and scaling the cluster of one or more DB nodes to add at least the new DB node, comprising: registering the new DB node in global metadata managed by a control plane of the database service; determining a logical split for data of a first volume in a storage tier that is distinct from the compute tier, wherein the first volume is managed by the first DB node in the compute tier, the logical split including a first portion of the data to remain managed by the first DB node and a second portion of the data to be managed by the new DB node; obtaining a second DB node, in the compute tier, to serve as the new DB node; cloning the first volume, within the storage tier, to yield a second volume in the storage tier for use by the second DB node in the compute tier, wherein the cloning copies less than all of the data of the first volume; applying, after the cloning, a set of database modifications to the second volume in the storage tier, where the set of database modifications resulted from database traffic received by the first DB node in the compute tier, involving the second portion of the data, during the cloning of the first volume; updating the global metadata to indicate a deleting state for the second portion of the data on the first DB node; updating the global metadata to indicate the second DB node is active and to begin sending database traffic involving the second portion of the data to the second DB node in the compute tier; and expunging the second portion of the data on the first DB node that is in the deleting state.
5 . The computer-implemented method of claim 4 , further comprising causing the second DB node to delete the first portion of the data from the second volume.
6 . The computer-implemented method of claim 5 , wherein after the scaling of the cluster the second DB node deletes at least some of the first portion of the data from the second volume.
7 . The computer-implemented method of claim 4 , wherein scaling the cluster further comprises: configuring the first DB node to insert, into a modification log, data corresponding to the set of database modifications; and transmitting the data, from the modification log, to the second DB node, wherein the applying of the set of database modifications occurs at least in part via the second DB node of the compute tier and includes obtaining, by the second DB node, the data corresponding to the set of database modifications from the first DB node.
8 . The computer-implemented method of claim 4 , wherein the applying of the set of database modifications occurs within the storage tier without the involvement of the compute tier and includes: identifying, by the storage tier, modifications to the first volume; and causing, by the storage tier, the modifications to be applied to the second volume.
9 . The computer-implemented method of claim 4 , wherein cloning the first volume to yield the second volume comprises creating, as the second volume, a copy-on-write volume based on creating one or more metadata elements for the second volume that identify the data of the first volume, wherein the copy-on-write volume does not include a distinct copy of the data of the first volume.
10 . The computer-implemented method of claim 4 , wherein the determining to add the new DB node to the cluster comprises: receiving a command originated on behalf of a user of the database service to scale the database; or determining that an autoscaling condition of a scaling policy is satisfied, wherein the autoscaling condition is based at least in part on a metric associated with the cluster or with individual DB nodes of the cluster.
11 . The computer-implemented method of claim 4 , wherein the database service is implemented within a multi-tenant cloud provider network, and wherein the storage tier provides the first volume and the second volume to the first DB node and the second DB node via use of a plurality of storage nodes implemented by a plurality of computing devices, wherein the plurality of computing devices are multi-tenant and provide storage tier services for the database service to multiple different databases of multiple different users.
12 . The computer-implemented method of claim 4 , wherein updating the proxy layer includes at least transmitting a message, to the proxy layer, including data identifying a set of one or more partition key values and identifying the second DB node.
13 . The computer-implemented method of claim 4 , further comprising, prior to the cloning of the first volume, waiting up to a period of time for existing transactions involving the first DB node to complete before proceeding with the cloning of the first volume.
14 . The computer-implemented method of claim 4 , wherein during or after the cloning of the first volume, but prior to the updating of the proxy layer, the method further comprises: receiving a request at the first DB node involving the second portion of the data that is to be managed by the new DB node; and sending a response message to the proxy layer indicating that the request is denied.
15 . The computer-implemented method of claim 14 , wherein the response message indicates that the request can be retried.
16 . The computer-implemented method of claim 15 , wherein the response message further includes a hint value allowing the proxy layer to update its routing cache to send additional requests involving the second portion of the data to the second DB node.
17 . A system comprising: a first one or more computing devices to implement a compute tier of a database service in a multi-tenant provider network to manage query execution against databases; a second one or more computing devices to implement a storage tier of the database service to manage storage and durability of data of the databases operated upon by the compute tier; and a third one or more computing devices to implement a control plane of the database service in the multi-tenant provider network, the control plane including instructions that upon execution cause the control plane to: determine to add a new database (DB) node to a cluster of one or more DB nodes implementing a database of the databases, wherein the cluster includes a first DB node implemented in the compute tier; and scale the cluster of one or more DB nodes to add at least the new DB node, comprising: registering the new DB node in global metadata managed by the control plane of the database service; determining a logical split for data of a first volume in the storage tier, wherein the first volume is managed by the first DB node in the compute tier, the logical split including a first portion of the data to remain managed by the first DB node and a second portion of the data to be managed by the new DB node; obtaining a second DB node, in the compute tier, to serve as the new DB node; causing the storage tier to clone the first volume to yield a second volume in the storage tier for use by the second DB node in the compute tier, wherein the cloning copies less than all of the data of the first volume; causing an application, after the cloning, of a set of database modifications to the second volume in the storage tier, where the set of database modifications resulted from database traffic received by the first DB node in the compute tier, involving the second portion of the data, during the cloning of the first volume; updating the global metadata to indicate a deleting state for the second portion of the data on the first DB node; updating the global metadata to indicate the second DB node is active and to begin sending database traffic involving the second portion of the data to the second DB node in the compute tier; and expunging the second portion of the data on the first DB node that is in the deleting state.
18 . The system of claim 17 , wherein the control plane further includes instructions that upon execution further cause the control plane to cause the second DB node to delete the first portion of the data from the second volume.
19 . The system of claim 17 , wherein causing the second DB node to delete the first portion of the data from the second volume occurs after the scaling of the cluster.
20 . The system of claim 17 , wherein to scale the cluster, the control plane is further to configure the first DB node to insert, into a data stream, data corresponding to the set of database modifications, wherein the application of the set of database modifications occurs at least in part via the compute tier and includes obtaining, by the second DB node, the data corresponding to the set of database modifications from the data stream.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims the benefit of U.S. Provisional Application No. 63/283,364, filed Nov. 26, 2021, which is hereby incorporated by reference. BACKGROUND Traditionally, databases are able to “scale” to accommodate increased demand using a technique called sharding (which is sometimes called partitioning). The key idea of sharding is to subdivide a user's data across multiple backing nodes—or “shards”—so that the load is distributed. Typically, a proxy fleet is responsible for mapping a user's query to one or more of these shards, e.g., based on a user-provided shard key. A key challenge for such sharding implementations, however, is how to implement dynamic scaling—i.e., the ability to incorporate new shards into the system as it remains operational—in an efficient manner Traditionally, this process involves rebalancing (i.e., copying) data from an existing shard to a newly created shard. BRIEF DESCRIPTION OF DRAWINGS Various embodiments in accordance with the present disclosure will be described with reference to the drawings, in which: FIG. 1 is a diagram illustrating problematic aspects of scaling a database cluster in an online manner. FIG. 2 is a diagram illustrating fast online scaling of a database via a split architecture including decoupled storage and compute tiers in a database service according to some embodiments. FIG. 3 is a diagram illustrating exemplary components of a database system compute tier utilized with a distinct storage tier according to some embodiments. FIG. 4 is a diagram illustrating exemplary components of a distributed storage tier of a database system utilized with a distinct compute tier according to some embodiments. FIG. 5 is a diagram illustrating exemplary operational aspects of a split architecture database including decoupled storage and compute tiers in a database service according to some embodiments. FIG. 6 is a flow diagram illustrating operations of a method for scaling a database cluster using a decoupled storage and compute architecture according to some embodiments. FIG. 7 illustrates an example provider network environment according to some embodiments. FIG. 8 is a block diagram of an example provider network that provides a storage service and a hardware virtualization service to customers according to some embodiments. FIG. 9 is a block diagram illustrating an example computer system that can be used in some embodiments. DETAILED DESCRIPTION The present disclosure relates to methods, apparatus, systems, and non-transitory computer-readable storage media for implementing fast database scaling utilizing a decoupled storage and compute-based architecture. As indicated herein, dynamically scaling a database to incorporate new shards into a cluster is a key challenge, and typically involves rebalancing efforts by copying data from one or more existing shards to a new shard or set of new shards. For example, FIG. 1 is a diagram illustrating problematic aspects of scaling a database cluster 106A. In this example, a database may be implemented by a database service 110 (e.g., of a cloud provider network 100) via a database cluster 106A, where the database service 110 may implement a large number of different database clusters 106A-106M at any point in time for one or more (and typically many) different users. Generally speaking, complex databases (e.g., relational databases, document databases, and the like) providing complex analysis/querying functionalities can be architected using a database cluster 106A of one or more database nodes 114. A database node may include an instantiation of a database, e.g., in the form of a database engine 116 (including a query engine) with associated data (shown here as a collection of “records” 118) belonging to the database that is “managed” by the node. In some embodiments, a database node may be implemented by launching a virtual machine or other type of compute instance having database software preinstalled or installed after the launch of the instance. In this example, for increased processing capability, a database can be “sharded” into being implemented by multiple different database nodes, here meaning that a first database node is responsible for a first portion of the data of the database (e.g., records A-L 118A of a particular table or collection of records, or datasets 1-10 of a collection of datasets, or the like) while a second database node may be responsible for a second portion of the database (e.g., records M-Z 118B). In some cases, a control plane 104 set of components of the database service may determine that a rebalancing is to occur, that a particular database node 114B is exhausting its available resources (e.g., storage, memory, compute, bandwidth), that a user desires to rebalance the database (e.g., via receipt of a command from the user to do so), etc. Thus, the control plane may wish to expand the cluster 106A of nodes, e.g., by adding a new database node 1