US-12626540-B2 - Method for detecting an application progress and handling an application failure in a distributed system

US12626540B2US 12626540 B2US12626540 B2US 12626540B2US-12626540-B2

Abstract

A method for detecting an application progress and handling an application failure in a distributed system. The method includes: monitoring an interaction between modules of at least one application, the at least one application being deployed across different physical nodes, the interaction being carried out by exchanging messages between the modules using a message broker, the monitoring being carried out at least partially using the message broker; detecting the application progress based on the monitoring; initiating a failure handling based on the detecting.

Inventors

Dakshina Narahari Dasari
Arne Hamann
Nuno PEREIRA

Assignees

ROBERT BOSCH GMBH
CARNEGIE MELLON UNIVERSITY

Dates

Publication Date: 20260512
Application Date: 20231115
Priority Date: 20230217

Claims (20)

1 . A method for detecting an application progress and handling an application failure in a distributed system, the distributed system including a message broker through which messages are transmitted between a plurality of modules of at least one application deployed across different physical nodes, the message broker being programmed with a respective application manifest for each of one or more of the modules, the manifest specifying expected processing behavior of the respective module within a processing pipeline, including that when the message broker routes to the respective module a predefined input message output by another of the modules in the processing pipeline, the respective module is expected to generate a corresponding predefined output message for routing by the message broker to the other of the modules or to a further one of the modules in the processing pipeline, the method comprising the following steps: monitoring, by the message broker, for consistency of interactions between the modules with the application manifests; detecting, by the message broker, that a respective one of the modules has stalled when a result of the monitoring is that an expected processing behavior defined in the respective manifest is not met, wherein the expected processing behavior comprises any one or more of the following: an expected timing between receiving the input message and publishing the output message, a maximum backlog of unprocessed input messages, and a minimum number of messages to be processed successfully; and in response to detecting that the respective module has stalled, initiating, by the message broker, a recovery action comprising: restarting the respective module on a same physical node on which the respective module was running prior to the detection; restarting the respective module on a different node than on which the respective module was running prior to the detection; or starting a backup module to perform a processing of the stalled module.
2 . The method of claim 1 , wherein the monitoring is carried out using a publish-subscribe-mechanism, the message exchange being carried out to provide at least one functionality of the application including a driving functionality for a vehicle.
3 . The method of claim 1 , wherein the at least one application includes multiple applications, and wherein each of the multiple applications registers with the message broker and specifies messages to be exchanged, the monitoring being carried out based on observing the specified messages.
4 . The method of claim 3 , wherein the messages to be exchanged include messages that are subscribed to.
5 . The method of claim 1 , wherein the application manifest is specific for at least one of the following specifications of the respective application: a topology, interactions among the modules, requirements, a list of the modules, a list of messages published and/or subscribed by each of the modules, information about a timing behaviour for the interactions including a time between receiving input and publishing output messages and/or a maximum backlog of input messages before which an output is expected and/or a minimum amount of messages that must be processed successfully, the recovery action to be carried in response to the detection of the stall.
6 . The method of claim 1 , wherein the recovery action includes sending a result of the detection of the stall to a central orchestrator of the distributed system to initiate further actions, the central orchestrator being used at least for spawning and/or terminating and/or suspending and/or migrating the modules by providing commands from the orchestrator to a local module manager, the local module manager being provided for deploying and/or stopping and/or starting at least one of the modules on at least one of the different physical nodes based on the commands and/or for sending information about a status of the modules and resources of the nodes to the orchestrator, the orchestrator providing the commands based on the information.
7 . The method of claim 1 , wherein the detecting by the message broker includes determining a sequence of messages that are erroneously not processed by at least one of the modules, detecting a failure of at least one of the modules based on the monitoring and the determining of the sequence of unprocessed messages, and/or backtracking through a sequence of unprocessed messages for a diagnosis of a source of the failure.
8 . The method of claim 1 , wherein the monitoring includes determining a duration between receiving an input message and publishing an output message, and detecting the stall when the determined duration exceeds a predefined maximum according to the definition in the application manifest.
9 . The method of claim 1 , further comprising performing, by the message broker, a learning phase in which the message broker: records, during normal operation of the distributed system, historic timing with which the respective modules in the processing pipeline generate output messages in response to input messages output by other modules in the pipeline; establishes, as at least part of the respective manifests of the modules and based on the recorded historic timing, at least one baseline timeout value or backlog threshold representing the expected processing behavior of the respective modules in the processing pipeline, wherein the detection of the stall occurs when an actual timing of output messages generated by a module in response to input messages from another module deviates from the baseline timeout value or backlog threshold.
10 . The method of claim 1 , wherein the monitoring includes determining a number of unprocessed messages and detecting the stall where the determined number exceeds a predefined maximum according to the definition in the application manifest.
11 . The method of claim 1 , wherein the monitoring includes determining a number of processed messages and detecting the stall when the determined number falls below a predefined minimum according to the definition in the application manifest.
12 . The method of claim 1 , wherein the recovery action further comprises replaying, to the restarted module or to the backup module, at least some of the input messages that were routed by the message broker to the stalled module prior to the detection of the stall, so that the restarted module or the backup module resumes processing based on the prior messages leading up to the stall.
13 . The method of claim 1 , wherein the recovery action includes the restarting of the respective module on the same physical node on which the respective module was running prior to the detection.
14 . The method of claim 1 , wherein the recovery action includes the restarting of the respective module on the different node than on which the respective module was running prior to the detection.
15 . The method of claim 1 , wherein the recovery action includes the starting of the backup module to perform the processing of the stalled module.
16 . The method of claim 1 , wherein: the processing pipeline includes the modules performing different actions, including at least one of the modules sensing an environment and generating sensor data, and at least one other module processing the sensor data to produce an output that directly or indirectly provides the predefined input message to the respective module; and the application manifest of the respective module defines the corresponding predefined output message that the respective module is expected to generate in response to the predefined input message resulting from the sensor data within the expected timing specified in the respective manifest.
17 . The method of claim 1 , wherein the expected processing behavior comprises the expected timing between receiving the input message and publishing the output message.
18 . The method of claim 1 , wherein the expected processing behavior comprises the maximum backlog of unprocessed input messages.
19 . A non-transitory computer-readable medium on which is stored a computer program including instructions for detecting an application progress and handling an application failure in a distributed system, the distributed system including a message broker through which messages are transmitted between a plurality of modules of at least one application deployed across different physical nodes, the message broker being programmed with a respective application manifest for each of one or more of the modules, the manifest specifying expected processing behavior of the respective module within a processing pipeline, including that when the message broker routes to the respective module a predefined input message output by another of the modules in the processing pipeline, the respective module is expected to generate a corresponding predefined output message for routing by the message broker to the other of the modules or to a further one of the modules in the processing pipeline, the instructions being executable by a computer of the message broker and, when executed by the computer, causing the computer to perform the following steps: monitoring for consistency of interactions between the modules with the application manifests; detecting that a respective one of the modules has stalled when a result of the monitoring is that an expected processing behavior defined in the respective manifest is not met, wherein the expected processing behavior comprises any one or more of the following: an expected timing between receiving the input message and publishing the output message, a maximum backlog of unprocessed input messages, and a minimum number of messages to be processed successfully; and in response to detecting that the respective module has stalled, initiating a recovery action comprising: restarting the respective module on a same physical node on which the respective module was running prior to the detection; restarting the respective module on a different node than on which the respective module was running prior to the detection; or starting a backup module to perform a processing of the stalled module.
20 . A data processing apparatus configured to detect an application progress and handling an application failure in a distributed system, the distributed system including a message broker through which messages are transmitted between a plurality of modules of at least one application deployed across different physical nodes, the message broker being programmed with a respective application manifest for each of one or more of the modules, the manifest specifying expected processing behavior of the respective module within a processing pipeline, including that when the message broker routes to the respective module a predefined input message output by another of the modules in the processing pipeline, the respective module is expected to generate a corresponding predefined output message for routing by the message broker to the other of the modules or to a further one of the modules in the processing pipeline, the data processing apparatus comprising a computer of the message broker, the computer being configured to: monitor, by the message broker, for consistency of interactions between the modules with the application manifests; detect, by the message broker, that a respective one of the modules has stalled when a result of the monitoring is that an expected processing behavior defined in the respective manifest is not met, wherein the expected processing behavior comprises any one or more of the following: an expected timing between receiving the input message and publishing the output message, a maximum backlog of unprocessed input messages, and a minimum number of messages to be processed successfully; and in response to detecting that the respective module has stalled, initiate, by the message broker, a recovery action comprising: restarting the respective module on a same physical node on which the respective module was running prior to the detection; restarting the respective module on a different node than on which the respective module was running prior to the detection; or starting a backup module to perform a processing of the stalled module.

Description

CROSS REFERENCE The present application claims the benefit under 35 U.S.C. § 119 of German Patent Application No. DE 10 2023 201 398.3 filed on Feb. 17, 2023, which is expressly incorporated herein by reference in its entirety. BACKGROUND INFORMATION In distributed setups, where applications are deployed across different physical nodes, it is important to actively monitor the health of an application composed of a set of interacting modules or services which may be spread across the different nodes. The monitored health of application modules or services can be used to trigger recovery mechanisms, e.g., restart the module or failover to a redundant module. Conventional mechanisms address the problem of whether a node or application is alive (e.g., by responding to heartbeats or pings from a central co-ordinator). However, conventional methods are not able to consider details of whether an application is indeed progressing. An application may be internally stalled due to live locks or deadlocks, whereas another thread in the application may be actively simply responding to the liveliness checks. SUMMARY According to aspects of the present invention, a method, a computer program, and a data processing apparatus are provided. Features and details of the present invention are disclosed herein. Features and details described in the context to the method according to the present invention also correspond to the computer program as well as the data processing apparatus, and vice versa in each case. One aspect of the present invention comprises a method for detecting an application progress and/or handling an application failure in a distributed system. According to an example embodiment of the present invention, the method comprises according to a first method step monitoring an interaction between modules of at least one application. The at least one application may be deployed across different physical nodes, particularly hardware platform for data processing. The interaction may be carried out by exchanging messages between the modules, preferably using a message broker. Furthermore, the monitoring may be carried out at least partially using the message broker. According to another method step, the method may comprise detecting the application progress based on the monitoring. According to a further method step, the method may comprise initiating a failure handling based on the detecting. The method steps may be carried out one after the other and/or repeatedly. The present invention may thereby allow to detect application progress and to handle application failure in a distributed setup. The method according to an example embodiment of the present invention may be implemented using a system setup comprising a message broker, particularly centralized enhanced message broker, also referred to as EMB, particularly with an application progress detector, also referred to as APD, and/or a centralized orchestrator and/or a local module manager, the latter particularly on each physical node in the distributed setup. The EMB with an additional APD may be cognizant of the application graph and the interaction between the constituent modules. It may be configured to detect when an application module is not progressing and accordingly deals with unprocessed messages and hands it over to the application module when it is restarted. The EMB may interact with a central orchestrator which may have a global view of the deployment of applications across different nodes, particularly hardware platforms. On each of the nodes, a local module manager (also referred to as LMM) may be used to execute commands sent by the orchestrator. It may also send information regarding the status of the modules and the node resource availability information (periodically and on specific events) to the orchestrator. According to an example embodiment of the present invention, each application may specify its static architecture (particularly the constituent modules and their interactions) to the APD and additionally specify, for each module, the messages it will publish and subscribe to in a corresponding application manifest. The application, in addition, optionally specifies in the manifest, how its constituent modules interact with each other (via messages) in a normal mode, which may then be used by the EMB to detect deviant behaviour. The manifest may also be augmented with information regarding how the broker must handle messages, e.g., by buffering or evicting them, when it detects that a module is down. The APD may monitor the interactions between different modules by monitoring the time when messages are received by a module and whether it responds correspondingly within a given time, as specified in the application manifest, or learns the trend of interactions and issues a warning when it detects a deviation from the regular behaviour. If the application does not specify specific timing details regarding the receipt and/or publishing of