US-12619507-B2 - Method for seamless failback after unplanned failover

US12619507B2US 12619507 B2US12619507 B2US 12619507B2US-12619507-B2

Abstract

Systems and methods are directed to seamless failback after an unplanned failover. The method involves executing a truncate-and-restore command on a primary topic of a primary cluster, performing checks on a secondary topic to ensure it is a mirror topic in a stopped state with valid identifiers and offsets, and transitioning the primary topic to immutable state. The method further includes comparing sequence numbers to ensure safe truncation, truncating partitions to match the secondary topic's log end offsets, and clamping consumer group offsets. The primary topic is then converted to a mirror state, enabling active mirroring. A reverse command is then executed to complete the failback process, restoring the primary topic to a writable state.

Inventors

Sanjana Kaundinya
Chern Yih CHEAH

Assignees

Confluent, Inc.

Dates

Publication Date: 20260505
Application Date: 20241010

Claims (18)

1 . A method for seamless failback after an unplanned failover, the method comprising: executing a truncate-and-restore command on a primary topic of a primary cluster after the primary cluster becomes operational after the unplanned failover; responsive to executing the truncate-and-restore command, performing a plurality of checks on a corresponding secondary topic of a secondary cluster that was failed over to during the unplanned failover; determining stopped log end offsets of the corresponding secondary topic of the secondary cluster; converting the primary topic of the primary cluster to an immutable state based on successful validation of the plurality of checks; truncating partitions of the primary topic to log end offsets corresponding to the stopped log end offsets of the corresponding secondary topic; after the truncating, converting the primary topic to a mirror state and enabling active mirroring of new data written to the corresponding secondary topic after the unplanned failover; and at or near zero mirror lag, executing a reverse command to switch a direction of mirroring flow and convert the primary topic to a writable state.
2 . The method of claim 1 , wherein the plurality of checks comprises one or more of: verifying that the corresponding secondary topic is a mirror topic; verifying that the corresponding secondary topic is in a stopped mirror state; verifying that a source topic identifier of the corresponding secondary topic matches a local topic identifier of the primary topic; or confirming that the corresponding secondary topic has valid stopped log end offsets.
3 . The method of claim 1 , further comprising: comparing a sequence number associated with the primary topic with a sequence number associated with the corresponding secondary topic, wherein the truncating occurs based on the sequence number associated with the primary topic being lower than the sequence number associated with the corresponding secondary topic.
4 . The method of claim 3 , wherein the sequence number is a monotonically increasing integer that increments each time a mirror topic transitions to a stopped state.
5 . The method of claim 1 , further comprising: clamping consumer group offsets associated with the primary topic to a minimum of a persisted consumer group offset or truncated log end offset.
6 . The method of claim 1 , wherein the reverse command comprises a reverse-and-start command that immediately transitions the corresponding secondary topic to the mirror state after the failback.
7 . The method of claim 1 , wherein the reverse command comprises a reverse-and-pause command that places the corresponding secondary topic in a paused mirror state until a resume-mirror command is executed.
8 . The method of claim 1 , further comprising: performing a periodic remote call to the corresponding secondary topic to obtain the stopped log end offsets and epochs from the corresponding secondary topic.
9 . The method of claim 1 , further comprising: writing divergent records to a special internal topic before truncating the primary topic.
10 . A system for seamless failback after an unplanned failover, the system comprising: one or more hardware processors; and one or more storage components storing instructions that, when executed by the one or more hardware processors, cause the one or more hardware processors to perform operations comprising: executing a truncate-and-restore command on a primary topic of a primary cluster after the primary cluster becomes operational after the unplanned failover; responsive to executing the truncate-and-restore command, performing a plurality of checks on a corresponding secondary topic of a secondary cluster that was failed over to during the unplanned failover; determining stopped log end offsets of the corresponding secondary topic of the secondary cluster; converting the primary topic of the primary cluster to an immutable state based on successful validation of the plurality of checks; truncating partitions of the primary topic to log end offsets corresponding to the stopped log end offsets of the corresponding secondary topic; after the truncating, converting the primary topic to a mirror state and enabling active mirroring of new data written to the corresponding secondary topic after the unplanned failover; and at or near zero mirror lag, executing a reverse command to switch a direction of mirroring flow and convert the primary topic to a writable state.
11 . The system of claim 10 , wherein the plurality of checks comprises one or more of: verifying that the corresponding secondary topic is a mirror topic; verifying that the corresponding secondary topic is in a stopped mirror state; verifying that a source topic identifier of the corresponding secondary topic matches a local topic identifier of the primary topic; or confirming that the corresponding secondary topic has valid stopped log end offsets.
12 . The system of claim 10 , wherein the operations further comprise: comparing a sequence number associated with the primary topic with a sequence number associated with the corresponding secondary topic, wherein the truncating occurs based on the sequence number associated with the primary topic being lower than the sequence number associated with the corresponding secondary topic.
13 . The system of claim 12 , wherein the sequence number is a monotonically increasing integer that increments each time a mirror topic transitions to a stopped state.
14 . The system of claim 10 , wherein the operations further comprise: clamping consumer group offsets associated with the primary topic to a minimum of a persisted consumer group offset or truncated log end offset.
15 . The system of claim 10 , wherein the reverse command comprises a reverse-and-start command that immediately transitions the corresponding secondary topic to the mirror state after the failback.
16 . The system of claim 10 , wherein the reverse command comprises a reverse-and-pause command that places the corresponding secondary topic in a paused mirror state until a resume-mirror command is executed.
17 . The system of claim 10 , wherein the operations further comprise: performing a periodic remote call to the corresponding secondary topic to obtain the stopped log end offsets and epochs from the corresponding secondary topic.
18 . A machine-storage medium comprising instructions which, when executed by one or more hardware processors of a machine, cause the machine to perform operations for seamless failback after an unplanned failover, the operations comprising: executing a truncate-and-restore command on a primary topic of a primary cluster after the primary cluster becomes operational after the unplanned failover; responsive to executing the truncate-and-restore command, performing a plurality of checks on a corresponding secondary topic of a secondary cluster that was failed over to during the unplanned failover; determining stopped log end offsets of the corresponding secondary topic of the secondary cluster; converting the primary topic of the primary cluster to an immutable state based on successful validation of the plurality of checks; truncating partitions of the primary topic to log end offsets corresponding to the stopped log end offsets of the corresponding secondary topic; after the truncating, converting the primary topic to a mirror state and enabling active mirroring of new data written to the corresponding secondary topic after the unplanned failover; and at or near zero mirror lag, executing a reverse command to switch a direction of mirroring flow and convert the primary topic to a writable state.

Description

TECHNICAL FIELD The subject matter disclosed herein generally relates to data storage technologies. Specifically, the present disclosure addresses systems and methods for seamless failback after an unplanned failover. BACKGROUND An unplanned failover occurs when a primary cluster experiences an outage, in which case, all applications have to failover to a secondary cluster. Once the primary cluster becomes operational again, a solution is needed to allow a user to mirror back and synchronize any data that was written to the secondary cluster during the outage. Traditionally, mirror topics and cluster links are deleted and recreated, sometimes twice-once to copy the data back to the primary cluster, and once more to re-copy the data to the secondary cluster. This results in significant operational overhead which may not be acceptable due to various constraints. BRIEF DESCRIPTION OF THE DRAWINGS Some embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings. FIG. 1 is a diagram illustrating a high-level distributed streaming architecture in which unplanned failover and failback can occur, in accordance with example embodiments. FIG. 2 illustrates the distributed streaming architecture in the unplanned failover stage, in accordance with example embodiments. FIG. 3 is a diagram of a stage of the failback process in which a primary topic is in pending setup for restore state, according to some example embodiments. FIG. 4 is a diagram of a stage of the failback process in which truncation occurs, according to example embodiments. FIG. 5 is a diagram of a stage of the failback process after truncation, according to example embodiments. FIG. 6 is a diagram of a stage of the failback process in which the primary topic mirrors data from a secondary topic, according to example embodiments. FIG. 7 is a diagram of a stage of the failback process in which the operations are reversed, according to example embodiments. FIG. 8 is a flowchart illustrating operations of a method for performing the failback after the unplanned failover, according to some example embodiments. FIG. 9 is a block diagram illustrating components of a machine, according to some example embodiments, able to read instructions from a machine-storage medium and perform any one or more of the methodologies discussed herein. DETAILED DESCRIPTION The description that follows describes systems, methods, techniques, instruction sequences, and computing machine program products that illustrate example embodiments of the present subject matter. In the following description, for purposes of explanation, numerous specific details are set forth in order to provide an understanding of various embodiments of the present subject matter. It will be evident, however, to those skilled in the art, that embodiments of the present subject matter may be practiced without some or other of these specific details. Examples merely typify possible variations. Unless explicitly stated otherwise, structures (e.g., structural components, such as modules) are optional and may be combined or subdivided, and operations (e.g., in a procedure, algorithm, or other function) may vary in sequence or be combined or subdivided. Example embodiments provide a failback process to a primary cluster after an unplanned failover using a sequence of operations that does not require the deletion and recreation of mirror topics and cluster links. It is important that the failback results in little data loss and minimal disruption including minimizing an amount of time events cannot be produced or consumed. Thus, example embodiments address the technical problem of how to efficiently failback after an unplanned failover. To address the technical problem, example embodiments provide a technical solution that utilizes a truncate-and-restore command to trigger a sequence of operations to be performed on both the primary and secondary clusters. The sequence of operations include performing a check of a secondary topic of the secondary cluster to ensure it is a mirror topic in a stopped mirror state that has a source topic identifier that is the same as a local topic identifier of a primary topic of the primary cluster and that the secondary topic has valid stopped log end offsets. Additionally, a check is performed that a stopped sequence number associated with the primary topic is lower than a stopped sequence number associated with the corresponding secondary topic. Assuming all these checks are verified/valid, the primary topic is truncated to the corresponding stopped log end offsets of the secondary topic and begins mirroring new data from the secondary topic. When mirror lag is zero (or close to zero), a reversal operation is executed that reverses the direction of the mirroring and completes the failback. Advantageously, by using the technical solution, example embodiments synchronize the topics in the primary and secondary clusters before the reve