US-12625883-B2 - Quorum-based scalable database system

US12625883B2US 12625883 B2US12625883 B2US 12625883B2US-12625883-B2

Abstract

Techniques are disclosed relating to a database system. The database system includes multiple coordinator nodes storing replicas of a partition. Each partition describes the state of locks and transactions for keys covered by that partition of keys. Each partition is, in turn, replicated. The multiple coordinator nodes receive, from multiple worker nodes, requests to grant a lock for a key to permit a worker node to write a record for the key as part of executing a transaction. A given coordinator node of the multiple coordinator nodes sends an approval response for the lock to at most one of the worker nodes. A single worker node acquires the lock in response to receiving approval responses from a majority of the multiple coordinator nodes, and none of the multiple worker nodes acquire the lock in response to none of them receiving approval responses from a majority of the multiple coordinator nodes.

Inventors

Patrick James Helland

Assignees

SALESFORCE, INC.

Dates

Publication Date: 20260512
Application Date: 20240722

Claims (18)

1 . A method, comprising: storing, by a plurality of coordinator nodes of a database system, a respective replica of a data partition, wherein the respective replica includes information about locks granted to ones of a plurality of worker nodes that perform database transactions; determining, by a first coordinator node of the plurality of coordinator nodes, to relocate at least a portion of the respective replica to a second coordinator node; copying, by the first coordinator node, the portion of the respective replica to the second coordinator node; during the copying, receiving, by the plurality of coordinator nodes and the second coordinator node, requests from multiple worker nodes to grant a lock for a key that permits a worker node to write a record for the key as part of executing a database transaction; determining, by multiple of the plurality of coordinator nodes and the second coordinator node independently, whether to grant the lock based on whether the lock conflicts with granted locks stored in their respective replica; and sending, by a given coordinator node that determines to grant the lock, an approval response for the lock to at most one of the multiple worker nodes, wherein a single one of the multiple worker nodes acquires the lock in response to receiving approval responses from a majority of a quorum having ones of the plurality of coordinator nodes and the second coordinator node.
2 . The method of claim 1 , wherein the plurality of coordinator nodes form a first quorum for the data partition, and the second coordinator node and the plurality of coordinator nodes without the first coordinator node form a second quorum for the data partition, wherein, to acquire the lock, the single worker node has to acquire approval responses from a majority of coordinator nodes in at least one of the first and second quorums.
3 . The method of claim 1 , wherein a given one of the plurality of worker nodes is operable to issue requests to the second coordinator node for locks and to store new uncommitted work for the data partition in response to the first coordinator node determining to relocate the at least a portion of the first coordinator node's respective replica.
4 . The method of claim 1 , further comprising: determining, by the first coordinator node based on a set of characteristics of the first coordinator node's respective replica, to locally split the first coordinator node's respective replica into a plurality of replicas corresponding to a plurality of subpartitions representing a splitting of the data partition; and splitting, by the first coordinator node, the first coordinator node's respective replica into the plurality of replicas.
5 . The method of claim 4 , further comprises: informing, by the first coordinator node, one or more of remaining ones of the plurality of coordinator nodes about the splitting to cause the one or more coordinator nodes to split their respective replica of the data partition.
6 . The method of claim 4 , wherein the plurality of replicas include fewer replicas than a number of replicas into which a third coordinator node of the plurality of coordinator nodes has split the third coordinator node's respective replica of the data partition.
7 . The method of claim 1 , further comprising: determining, by the first coordinator node based on a set of characteristics of the first coordinator node's respective replica, to locally merge the first coordinator node's respective replica with a replica of an adjacent data partition into a single replica corresponding to a single data partition representing a merging of the data partition and the adjacent data partition; and merging, by the first coordinator node, the first coordinator node's respective replica with the replica of the adjacent data partition into the single replica.
8 . The method of claim 7 , further comprising: after the merging, the first coordinator node splitting the single data partition into a different number of data partitions than two data partitions.
9 . The method of claim 7 , wherein key ranges of the first coordinator node's respective replica and the replica of the adjacent data partition are adjacent.
10 . A non-transitory computer-readable medium having program instructions stored thereon that are capable of causing a computer system to implement a first coordinator node that performs operations comprising: storing a replica of a data partition, wherein the replica includes information about locks granted to ones of a plurality of worker nodes that perform database transactions for a database system, and wherein the first coordinator node is one of a plurality of coordinator nodes that is operable to ensure transactional consistency for database transactions; determining to relocate at least a portion of the replica to a second coordinator node; copying the portion of the replica to the second coordinator node; during the copying, receiving, from multiple worker nodes of the plurality of worker nodes, requests to grant a lock for a key that permits a worker node to write a record for the key as part of executing a database transaction, wherein the multiple worker nodes are operable to send requests to the second coordinator node to grant the lock; and sending an approval response for the lock to at most one of the multiple worker nodes, wherein a single one of the multiple worker nodes acquires the lock in response to receiving approval responses from a majority of a quorum having ones of the plurality of coordinator nodes and the second coordinator node.
11 . The non-transitory computer-readable medium of claim 10 , wherein the operations further comprise: performing a split operation on the replica to logically split the replica into two or more replicas corresponding to two or more subpartitions that represent a splitting of the data partition; and performing a merge operation on the replica to logically merge the replica and another replica into a single replica corresponding to a single partition representing a merging of two partitions, wherein the split and merge operations are performed independent of other ones of the plurality of coordinator nodes.
12 . The non-transitory computer-readable medium of claim 10 , wherein the operations further comprise: removing committed records from the replica based on committed records associated with the replica being persisted in a persistent store.
13 . The non-transitory computer-readable medium of claim 10 , wherein the operations further comprise: receiving a different replica from a different coordinator node of the database system; and while information of the different replica is being received, process requests from ones of the plurality of worker nodes, wherein processing at least one of the requests includes storing a lock in the different replica that is granted in association with the at least one request.
14 . A system, comprising: at least one processor; and memory having program instructions stored thereon that are executable by the at least one processor to cause the system to implement a first coordinator node that performs operations comprising: storing a replica of a data partition, wherein the replica includes information about locks granted to ones a plurality of worker nodes that perform database transactions for a database system, and wherein the first coordinator node is one of a plurality of coordinator nodes that is operable to ensure transactional consistency for database transactions; determining to relocate at least a portion of the replica to a second coordinator node; copying the portion of the replica to the second coordinator node; during the copying, receiving, from multiple worker nodes of the plurality of worker nodes, requests to grant a lock for a key that permits a worker node to write a record for the key as part of executing a database transaction, wherein the multiple worker nodes are operable to send requests to the second coordinator node to grant the lock; and sending an approval response for the lock to at most one of the multiple worker nodes, wherein a single one of the multiple worker nodes acquires the lock in response to receiving approval responses from a majority of a quorum having ones of the plurality of coordinator nodes and the second coordinator node.
15 . The system of claim 14 , wherein the operations further comprise: performing a split operation on the replica to logically split the replica into two or more replicas corresponding to two or more subpartitions that represent a splitting of the data partition, wherein the portion corresponds to one of the two or more replicas.
16 . The system of claim 14 , wherein the operations further comprise: receiving a different replica from a different coordinator node of the database system; and performing a merge operation to logically merge the replica and the different replica into a single replica corresponding to a single data partition.
17 . The system of claim 14 , wherein the operations further comprise: receiving a different replica from a different coordinator node of the database system; and while information of the different replica is being received, process requests from ones of the plurality of worker nodes, wherein processing at least one of the requests includes storing a lock in the different replica that is granted in association with the at least one request.
18 . The system of claim 14 , wherein the operations further comprise: removing committed records from the replica based on committed records associated with the replica being persisted in a persistent store.

Description

PRIORITY CLAIM The present application claims priority to U.S. Provisional Appl. No. 63/515,791, filed Jul. 26, 2023, and U.S. Provisional Appl. No. 63/515,792, filed Jul. 26, 2023, which are incorporated by reference herein in their entirety. BACKGROUND Technical Field This disclosure relates generally to database systems and, more specifically, to various mechanisms for implementing a quorum-based scalable database system. Description of the Related Art Modern database systems routinely implement management systems that enable users to store a collection of information in an organized manner that can be efficiently accessed and manipulated. In some cases, these management systems maintain a log-structured merge-tree (LSM tree) comprising multiple levels that each store information in database records as key-value pairs. A database system can include a persistent storage that houses the LSM tree and a database node having an in-memory buffer. During operation, the database node initially writes records into the in-memory buffer before later flushing them to the persistent storage. As a part of flushing the records, the database node writes them to new files that are stored in one of the many levels of the LSM tree. Over time, the records are rewritten into new files stored in lower levels as the records are moved down the LSM tree. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 is a block diagram illustrating example elements of a system that has a database, worker nodes, and a transaction coordinator service, according to some embodiments. FIGS. 2A and 2B are block diagrams illustrating an example in which a request affecting partitions is processed at a time provided by a worker node, according to some embodiments. FIG. 3 is a block diagram illustrating an example in which multiple transactions attempt to acquire a lock but neither transaction wins the conflict, according to some embodiments. FIG. 4A-B are block diagrams illustrating example elements of partition operations that can be performed by a coordinator node, according to some embodiments. FIG. 5 is a block diagram illustrating an example in which a worker node interacts with multiple partition quorums during a relocation of a replica of a partition, according to some embodiments. FIG. 6 is a flow diagram illustrating an example method relating to coordinator nodes ensuring transactional consistency for transactions performed by worker nodes, according to some embodiments. FIG. 7 is a block diagram illustrating elements of a computer system for implementing various systems described in the present disclosure, according to some embodiments. DETAILED DESCRIPTION Many database systems process transactions according to certain guarantees that ensure transactional consistency. Read committed snapshot isolation (RCSI) is one example. RCSI is a guarantee that all reads made in a transaction will see a consistent snapshot of the database system, and the transaction will successfully commit only if the updates it made do not conflict with any concurrent updates made since that snapshot. In particular, a transaction can contain one or more database statements and, in RCSI, the transaction can be performed one statement at a time. Each statement acquires a snapshot time to use as it reads, locks, and updates records; the transaction can read records that were committed as—of the snapshot time. If the statement performs updates, then the database system ensures that the record being updated has not been modified since the snapshot time of that statement—a transaction can acquire a snapshot time at the transaction level, and thus the database system can ensure that the records being updated have not been modified since the snapshot time of the transaction when the transaction is being committed. Abiding by RCSI leads to scaling limitations on a database system that are inherent in the characteristics of RCSI. For example, if multiple transactions attempt to write a record for the same database key at relatively the same time, those transactions can block each other. As an example, one of the transactions may acquire a lock on the key (so that it can write a record for that key) that prevents the other transactions from acquiring a lock on that key. As another example, in the event of a conflicting update, a transaction can experience a rollback in which all changes made within a statement of the transaction are undone back to the beginning of that statement. Accordingly, a transaction can be stalled and/or at least partially rolled back under certain circumstances. If these circumstances happen rarely even under scale, then a database system may be scaled up to a significant size—that is, if transactions do not fight to update the same records, then there may be no limitations imposed by RCSI on database scaling. But while RCSI may impose database scaling limitations on a database system, modern database system implementations that support RCSI have artificial sca