Search

CN-121979622-A - RDMA-based distributed database two-stage transaction submitting method and device

CN121979622ACN 121979622 ACN121979622 ACN 121979622ACN-121979622-A

Abstract

The invention discloses a two-stage transaction submitting method and device of a distributed database based on RDMA, which relate to the technical field of high-performance transaction processing of the distributed database and comprise the following steps of completing data jump by utilizing RDMA unilateral write operation in a pre-submitting stage; the method comprises the steps of confirming that transaction redo log data is successfully written through polling, comparing and exchanging operation atoms through RDMA atoms in a commit stage to update commit status flag bits, and skipping waiting for cyclic execution of consensus achievement processing when the counted successful times of the polling RDMA completion queue reach a multi-dispatch threshold value. The method and the system can eliminate multi-round network round trip and software processing delay in the traditional protocol, reduce end-to-end transaction delay from a software protocol level to a hardware network level, release system CPU resources to enable the system CPU resources not to participate in protocol processing, enable computing resources to be efficiently utilized, and remarkably improve the overall throughput and concurrent processing capacity of the distributed cluster.

Inventors

  • YANG WEIWEI
  • JIN RIZE
  • SONG KEHUI
  • ZHOU BAOHANG
  • PENG YANG

Assignees

  • 天津工业大学

Dates

Publication Date
20260505
Application Date
20260407

Claims (10)

  1. 1. An RDMA-based distributed database two-phase transaction commit method, comprising: s101, in a pre-submitting stage, a Leader node directly writes transaction redo log data into a pre-registered remote memory of a Follower node by utilizing RDMA unilateral write operation to finish data jump; S102, through polling the RDMA completion queue, confirming that transaction redo log data is successfully written in a lasting manner at a plurality of Follower nodes, and completing a pre-commit stage; S103, entering a commit stage, and writing a commit value into a commit status flag bit preset in a remote memory of each Follower node storing transaction redo log data by a Leader node through RDMA atomic comparison and exchange operation so as to atomically update the value of the commit status flag bit; s104, the Leader node polls the RDMA completion queue through a waiting loop, counts the successful times and the failed times of atomic comparison and exchange operation of each Follower node, and immediately jumps out of the waiting loop to execute consensus to achieve processing when the counted successful times reach a preset multi-dispatch threshold value, and returns a successful response of transaction submission.
  2. 2. The method according to claim 1, wherein said S101 comprises: In the pre-submitting stage, the Leader node packages logs formed by the transaction redo in batches to form transaction redo log data; and directly writing the transaction redo log data from the local memory of the Leader node into a pre-allocated receiving buffer slot of the Follower node in the remote memory through asynchronous RDMA unilateral write operation, so as to finish data jump.
  3. 3. The method of claim 2, wherein the batch packaging comprises: According to a preset batch packaging threshold value, recording logs of all transactions within the threshold value range, and packaging in a local log buffer area; and adding a data head for each log batch, wherein the data head comprises a starting log serial number, an ending log serial number, a log batch size, a checksum and common identification information.
  4. 4. The method according to claim 1, wherein S102 comprises: The Leader node actively polls a completion queue associated with the RDMA network card, confirms the execution completion condition of corresponding writing operation according to the completion items in the queue, and can also confirm that the transaction redo log data is successfully written after the writing operation meeting the threshold is completed according to a preset writing operation completion threshold; for failed writing operation, the Leader node records failure information, selects a new writing address to trigger retry writing operation, marks the corresponding Follower node as unavailable if the writing operation fails beyond the preset maximum retry number, judges whether enough healthy Follower nodes still form a majority group, and aborts the transaction if the majority group cannot be reached.
  5. 5. The method according to claim 2, wherein S103 comprises: Confirming an address, a secret key, a comparison value and an exchange value of a commit status flag bit corresponding to a slot bit of a receiving buffer of transaction redo log data in a remote memory of the Follower node; And comparing the RDMA atoms and exchanging the operation and submitting the operation to an RDMA sending queue, atomically and according to the corresponding comparison value, if the comparison values are consistent, performing replacement writing by using the exchange value to update the value of the submitted state flag bit, and if the comparison values are not matched, recording that the operation fails and performing no replacement writing.
  6. 6. The method of claim 2, wherein the consensus achievement process comprises: according to the transaction redo log data, replaying the data modification operation in the log data to a database of the Follower node, and executing the transaction redo consistent with the Leader node; And marking the corresponding storage of the current transaction redo log data as an available state by the slot bit of the receiving buffer area, and simultaneously releasing occupied resources.
  7. 7. The method of claim 1, wherein S104 further comprises: when the counted success times reach a preset multiple dispatch threshold value and the consensus achieving process is executed, the Follower nodes which remain incomplete atomic comparison and exchange operation continue to execute the atomic comparison and exchange operation, and if the operation is successful, the formula achieving process is still executed for the corresponding nodes; If the number of Follower nodes of the remaining incomplete atomic comparison and exchange operation cannot reach a preset multiple dispatch threshold value even if all operations are successful, jumping out of circulation in advance to wait and execute common failure processing, recording transaction aborting logs, triggering transaction rollback to Follower nodes which are operated successfully, releasing occupied resources, and returning a transaction commit failure response; and for Follower nodes with failed atomic comparison and exchange operation, the Leader node performs state inspection on Follower nodes, and performs corresponding node removal, downtime recovery and failover operation according to inspection results.
  8. 8. An RDMA-based distributed database two-phase transaction commit apparatus for implementing an RDMA-based distributed database two-phase transaction commit method as recited in any of claims 1-7, comprising: the data jump module is used for enabling the Leader node to directly write the transaction redo log data into a pre-registered remote memory of the Follower node by utilizing RDMA unilateral write operation in a pre-commit stage; the pre-commit confirmation module is used for polling the RDMA completion queue and confirming that transaction redo log data is successfully written in a plurality of Follower nodes in a lasting manner; the atomic comparison and exchange operation module is used for entering a commit phase, enabling the Leader node to write a commit value into a commit status flag bit preset in a remote memory of each Follower node storing transaction redo log data through RDMA atomic comparison and exchange operation, and updating the value of the commit status flag bit through atoms; And the consensus achieving module is used for enabling the Leader node to poll the RDMA completion queue through a waiting loop, counting the successful times and the failed times of the atomic comparison and exchange operation of each Follower node, and immediately jumping out of the waiting loop to execute consensus achieving processing and returning a successful response of transaction submission when the counted successful times reach a preset multiple dispatch threshold value.
  9. 9. An electronic device, the electronic device comprising: One or more processors; storage means for storing one or more programs, The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the RDMA-based distributed database two-phase transaction commit method of any of claims 1-7.
  10. 10. A storage medium containing computer executable instructions which, when executed by a computer processor, are for performing the RDMA-based distributed database two-phase transaction commit method of any of claims 1-7.

Description

RDMA-based distributed database two-stage transaction submitting method and device Technical Field The invention relates to the technical field of distributed database high-performance transaction processing, in particular to a distributed database two-stage transaction submitting method and device based on RDMA. Background In order to ensure strong consistency of data across copies, two-stage commit of distributed transactions is widely implemented by a common-knowledge algorithm such as Raft, in a pre-commit stage of log replication, a transaction coordinator (Leader node) synchronizes a redo log of a transaction to all participants (Follower nodes) and waits for successful persistence acknowledgements of most nodes to be collected, and then in an validation stage of state commit, after the majority acknowledgements are obtained, the Leader node initiates commit, notifies all participants to apply the persisted log to a local state machine, so that the transaction is validated. Meanwhile, in order to break through the traditional network performance bottleneck, RDMA (remote direct memory access) is adopted, the characteristics of high bandwidth, low delay, kernel bypass and zero copy are utilized to accelerate data transmission between nodes, and the network card of the computer is allowed to directly read and write the user mode memory of the remote node without the participation of a kernel protocol stack and a CPU of a remote operating system. However, applying RDMA transfer only to the communication layer of the traditional TCP/IP network protocol, i.e., simply replacing the original TCP Socket communication module in the protocol one-to-one with RDMA, because of the serialized "request-response" interaction logic (e.g., multi-round broadcast and acknowledgement) inherent in the two-phase commit protocol, even with RDMA transfer, the logic of multiple network round trips and blocking waits still exists, which can result in delays for a microsecond level single operation provided by RDMA hardware, and still create a new performance bottleneck. Particularly, under the RDMA network environment with high bandwidth and low delay, the CPU at Follower end still needs to participate in receiving confirmation, and complex multi-round software interaction is needed to reach consensus, the CPU at Follower node still needs to be frequently interrupted to process protocol logic depending on software interaction logic of multiple network round trips and deep participation of the CPUs of both sides, the overall CPU cost is still obvious, and transaction processing low delay and CPU cost optimization cannot be realized by fully utilizing RDMA hardware potential. Disclosure of Invention The embodiment of the invention provides a two-stage transaction submitting method and device for a distributed database based on RDMA (remote direct memory access), which are used for solving the technical problems that in a distributed transaction protocol submitted in two stages, RDMA hardware is adopted but a CPU is still required to participate in interaction, a submitting instruction is issued and consensus is required to interact with all Follower through multiple rounds of network messages, so that CPU load is too high, copying delay is high and a critical path is blocked to generate a performance bottleneck. In a first aspect, an embodiment of the present invention provides a two-phase transaction commit method for an RDMA-based distributed database, including: s101, in a pre-submitting stage, a Leader node directly writes transaction redo log data into a pre-registered remote memory of a Follower node by utilizing RDMA unilateral write operation to finish data jump; S102, through polling the RDMA completion queue, confirming that transaction redo log data is successfully written in a lasting manner at a plurality of Follower nodes, and completing a pre-commit stage; S103, entering a commit stage, and writing a commit value into a commit status flag bit preset in a remote memory of each Follower node storing transaction redo log data by a Leader node through RDMA atomic comparison and exchange operation so as to atomically update the value of the commit status flag bit; s104, the Leader node polls the RDMA completion queue through a waiting loop, counts the successful times and the failed times of atomic comparison and exchange operation of each Follower node, and immediately jumps out of the waiting loop to execute consensus to achieve processing when the counted successful times reach a preset multi-dispatch threshold value, and returns a successful response of transaction submission. In a second aspect, an embodiment of the present invention provides an RDMA-based distributed database two-phase transaction commit apparatus, including: the data jump module is used for enabling the Leader node to directly write the transaction redo log data into a pre-registered remote memory of the Follower node by utilizing RDMA unilateral