US-20260125099-A1 - Systems and Methods for Efficiently Identifying Failed Computing Nodes

US20260125099A1US 20260125099 A1US20260125099 A1US 20260125099A1US-20260125099-A1

Abstract

Systems and methods for identifying failed computing nodes are disclosed. An example system includes a plurality of computing nodes, where each computing node of the plurality of computing nodes includes, or has access to, at least one hardware processor. An example method includes generating a unique identifier particularly corresponding to a particular node. The example method also includes periodically updating a node record in a persistent storage location with the unique identifier, where the node record is associated with the particular node. The example method further includes accessing the persistent storage location at a first time and identifying a most recent update to the node record. The example method also includes determining that the most recent update to the node record occurred more than a threshold amount of time prior to the first time.

Inventors

Sean Elliot Roberts
Ran Simha Biron
Shishir Sharma

Assignees

Egnyte, Inc.

Dates

Publication Date: 20260507
Application Date: 20251009

Claims (20)

1 . A method for identifying failed computing nodes, said method comprising: providing a plurality of computing nodes, each computing node of said plurality of computing nodes having access to at least one hardware processor and to shared data storage; generating a first unique identifier particularly corresponding to a first particular node of said plurality of computing nodes; generating a second unique identifier particularly corresponding to a second particular node of said plurality of computing nodes; storing task information in said shared data storage, said task information indicative of a plurality of computing tasks each available to be completed by one of said plurality of computing nodes; periodically updating node information in said shared data storage with new information associated with said first unique identifier; accessing said shared data storage at a first time; identifying a most recent update to said node information associated with said first unique identifier in said shared data storage; determining whether said most recent update to said node information associated with said first unique identifier in said shared data storage occurred more than a threshold amount of time prior to said first time; concluding, when said most recent update to said node information associated with said first unique identifier in said shared data storage occurred more than a threshold amount of time prior to said first time, that a first task identified by said task information as being processed by said first particular node is no longer being processed by said first particular node; and processing said first task that is concluded to be no longer being processed by said first particular node with a second particular node of said plurality of computing nodes.
2 . The method of claim 1 , further comprising: updating said task information to associate said first task with said second unique identifier to indicate said first task is being processed by said second node; and periodically updating said node information in said shared data storage with new information associated with said second unique identifier.
3 . The method of claim 1 , wherein said step of determining that said most recent update to said node record in said shared data storage occurred more than a threshold amount of time prior to said first time is performed by said second node of said plurality of computing nodes.
4 . The method of claim 1 , wherein said task information includes a plurality of records, each record including: a first field including information identifying a particular task; a second field including information identifying a particular status associated with said particular task identified by said information of said first field; and a third field including information identifying a particular node of said plurality of computing nodes associated with said particular task identified by said information of said first field.
5 . The method of claim 4 , further comprising: after concluding that said first task is no longer being processed by said first particular node, writing a record to said task information wherein said second field includes a status indicating that said first task is available to be processed by another computing node of said plurality of computing nodes; and said step of determining and said step of writing said record are performed by a separate process apart from said first particular node and said second particular node of said plurality of computing nodes.
6 . The method of claim 2 , wherein said step of periodically updating said node information includes updating said node information with a new time stamp associated with said first particular node at a predetermined interval.
7 . The method of claim 6 , wherein said step of updating said node information with said new time stamp includes overwriting a prior time stamp associated with said first particular node with said new time stamp.
8 . The method of claim 6 , wherein said step of updating said node information with said new time stamp includes adding an additional node record different from said node record, said additional node record including said unique identifier and said new time stamp, and said node record including said unique identifier and a prior time stamp.
9 . The method of claim 1 , wherein said first unique identifier includes: information identifying said first particular node; and time information indicative of a first particular time.
10 . The method of claim 9 , further comprising: after said step of concluding, generating a new unique identifier particularly corresponding to said first particular node of said plurality of computing nodes; and updating said node information with said new unique identifier; and wherein said new unique identifier particularly corresponding to said first particular node includes said information identifying said first particular node and time information indicative of a second particular time different than said first particular time.
11 . A multi-node computing system, comprising: a plurality of computing nodes, each computing node of said plurality of computing nodes having access to associated node memory, for storing data and code, and at least one hardware processor configured to execute said code, said code including a set of native instructions configured to cause said at least one hardware processor to perform a corresponding set of native operations when executed by said at least one hardware processor; and shared data storage accessible to each node of said plurality of computing nodes, said shared data storage including task information and node information, said task information indicative of a plurality of tasks each available to be processed by one of said nodes, said node information indicative of a status of each of said nodes; and wherein said data and said code associated with a first one of said nodes include a first subset of said set of native instructions configured to generate a first unique identifier particularly corresponding to said associated first node, a second subset of said set of native instructions configured to periodically update said node information in said shared data storage with new information associated with said first unique identifier of said first node, a third subset of said set of native instructions configured to determine whether a first task, previously identified by said task information as being processed by another node, is available for processing by said first node, and a fourth subset of said set of native instructions configured to process said first task when it is determined to be available.
12 . The system of claim 11 , wherein said third subset of said set of native instructions configured to: access said shared data storage at a first time; identify a most recent update to said node information by said another node; determine whether said most recent update to said node information by said another node occurred more than a threshold amount of time prior to said first time; and conclude, when said most recent update to said node information by said another node occurred more than said threshold amount of time prior to said first time, that said first task is no longer being processed by said another node.
13 . The system of claim 11 , wherein: said system includes a third node configured to access said shared data storage at a first time, identify a most recent update to said node information by said another node, determine whether said most recent update to said node information by said another node occurred more than a threshold amount of time prior to said first time, conclude, when said most recent update to said node information by said another node occurred more than said threshold amount of time prior to said first time, that said first task is no longer being processed by said another node, and update a status identifier in said task information to indicate that said first task is available for processing; and said third subset of said set of native instructions is configured to determine whether said first task is available for processing by said first node by checking said updated status identifier.
14 . The system of claim 11 , wherein said task information includes a plurality of records, each record including: a first field including information identifying a particular task; a second field including information identifying a particular status associated with said particular task identified by said information of said first field; and a third field including information identifying a particular node of said plurality of computing nodes associated with said particular task identified by said information of said first field.
15 . The system of claim 11 , wherein said second subset of said set of native instructions is configured to periodically update said node information by storing new time information associated with said first node by said first unique identifier in said node information.
16 . The system of claim 15 , wherein said storing new time information includes overwriting previous time information associated with said first node by said first unique identifier in said node information.
17 . The system of claim 15 , wherein said storing said new time information includes storing said new time information in said node information such that said updated node information includes previous time information and said new time information associated with said first node by said first unique identifier.
18 . The system of claim 11 , wherein said first unique identifier includes: information identifying said first node; and time information indicative of a first time.
19 . The system of claim 18 , wherein, upon restart of said first node, said first subset of said set of native instructions is configured to generate a new unique identifier particularly corresponding to said associated first node, said new unique identifier including: said information identifying said first node; and new time information indicative of a second time different than said first time.
20 . In a multi-node computing system, a computing node comprising: a hardware processor; memory for storing data and code, said code including a set of native instructions configured to cause said hardware processor to perform a corresponding set of native operations when executed by said hardware processor; and an interface to shared memory, said shared memory accessible to other computing nodes of said multi-node computing system; and wherein said data and said code include a first subset of said set of native instructions configured to generate a first unique identifier particularly corresponding to said computing node, a second subset of said set of native instructions configured to periodically update node information in said shared memory with new information associated said first unique identifier of said computing node, a third subset of said set of native instructions configured to determine whether a first task, previously identified by task information in said shared memory as being processed by another node, is available for processing by said computing node, and a fourth subset of said set of native instructions configured to process said first task when it is determined to be available.

Description

RELATED CASES This application is a continuation of co-pending U.S. patent application Ser. No. 18/486,485, filed on Oct. 13, 2023 by the same inventors, which claims the benefit of U.S. Provisional Patent Application Ser. No. 63/416,242 filed on Oct. 14, 2022 by the same inventors and entitled “Systems and Methods for Efficiently Identifying Failed Computing Nodes”, each of which is incorporated herein by reference in its respective entirety. BACKGROUND OF THE INVENTION Field of the Invention This invention relates generally to nodal computing environments, and more particularly to identifying failed computing nodes. BACKGROUND In computing environments with multiple nodes it is often desirable to ensure that a unit of work is run once, but only once. One common pattern is to store the status of this work unit in a persistent storage location (e.g. a database) that is visible/accessible to all of the nodes. For example, the status could be indicated by a field with PENDING, PROCESSING, and COMPLETED as possible states. Nodes are free to pick up work units with PENDING status, converting the status to PROCESSING at the time of pick up. If the node completes a work unit, it can update the status to COMPLETED, but, if it encounters an issue, it can revert the status to PENDING. However, nodes may fail in unanticipated ways, such as a loss of power, thus preventing the status from being changed. This failure leaves the work unit in a stuck state; no other nodes will pick it up due to the PROCESSING status, but the underlying node is no longer working on it. An example solution to the stuck work unit issue incorporates an expiry time with the status field. In other words, if the status is PROCESSING but with an exceeded expiry time, then this work unit can be treated as though it is in a PENDING state and will be free to be picked up by another node. Choosing how to set this expiry time presents a number of issues. The time should be longer than the longest time that the underlying work unit is expected to run. If a work unit is being legitimately worked on, but the time taken exceeds the expiry time, then other nodes could also pick it up, thus violating the constraint that the work unit is run only once. On the other hand, if a node fails without being able to update the status of a work unit it was running, then that work unit is stuck for the remaining duration of the expiry time. These two considerations are at odds with each other. Additionally, the time taken to run a work unit can change over time as the dataset or implementation changes, requiring an expiry time that is tightly specified to be frequently reevaluated. As an alternative example solution, it is possible to set a short expiry time, and then continuously renew it for a period of time. However, it is desirable for this renewal process to occur at as high a frequency as possible in order to decrease the amount of time that a work unit can be stuck. A higher frequency requires more resources be utilized by the shared data store where the status is stored, and the amount of required resources scales per the number of work units that are being worked on. SUMMARY In an illustrative example embodiment, a node run service is created on startup of a node. The node run service produces a unique identifier for the particular run of the particular node, called a “NodeRun”. Then it establishes a periodic (e.g., every minute, every five minutes, etc.) job to continuously push the NodeRun and the current time to a set of recent NodeRuns stored in a shared data repository (e.g. a persistent storage location) visible to all nodes. The job additionally pulls this set into a collection stored in memory by the node run service. In the example embodiment, when a node picks up a task and updates its status to PROCESSING, the node fetches its NodeRun from the node run service and stores it alongside the status. Then, when other processes are validating whether the work unit is stuck, they use the NodeRun associated with that work unit in order to check with their own node run service to ensure that the associated node is still active. If the node is not active, the work unit is reverted to the PENDING state, allowing other nodes to pick up the task again. Alternatively, rather than changing the status of the work unit back to the PENDING state, the process (e.g. a process running on a node) could simply pick up the work unit and substitute its own NodeRun for the NodeRun previously assigned to the work unit. In other words, PROCESSING work units associated with an inactive NodeRun may simply be treated by other processes and/or nodes as PENDING. In effect, the exclusivity a node has for a specific task expires when its node run service is no longer active, as seen by the node run services running on other nodes. The time it takes for this to occur is primarily governed by the frequency of the node run service's periodic job to update its active NodeRun in the dat