US-12625755-B2 - Errors in distributed computing environments

US12625755B2US 12625755 B2US12625755 B2US 12625755B2US-12625755-B2

Abstract

Embodiments of the present invention provide concepts for quantifying impact of one or more errors in a distributed computing environment. A processor may detect, at a caller entity of the distributed computing environment, an error resulting from a request from the caller entity to a callee entity of the distributed computing environment. The processor may associate the detected error with a callee incident, the callee incident describing an abnormal operating condition of the callee entity. The processor may quantify an impact of the error based on callee incident associated with the detected error and a service level metric of the distributed computing environment.

Inventors

Christopher Neil Bailey

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260512
Application Date: 20230512

Claims (20)

1 . A method for quantifying impact of one or more errors in a distributed computing environment, the method comprising: utilizing a distributed tracing to track end-to-end requests across applications, starting from a user; creating Service Level Objectives (SLOs) and Business Key Performance Indicators (KPIs) by creating calculations against available metrics and setting goals to the calculations; detecting, at a caller entity of the distributed computing environment, an error, based on the distributed tracing, resulting from a request from the caller entity to a callee entity of the distributed computing environment; associating the detected error with a callee incident, the callee incident describing an abnormal operating condition of the callee entity; quantifying an impact of the error based on the callee incident associated with the detected error and a service level metric of the distributed computing environment; generating an impact report based on the quantified impact to implement a remediation; providing a deterministic quantification of an impact associated with the impact report, wherein the quantification of the impact comprises: calculating an impact value based on the callee incident associated with the detected error and a count value of the service level metric; facilitating improved assessment by quantifying and measuring an impact based on the impact report in consideration of a service level metric of the distributed computing environment, wherein the measure measuring is propagated along a chain of services until it is assigned to the service that is the cause of the incident; and correcting a working of a service by providing a deterministic quantification of the impact of the detected error that affects a predetermined availability of the service and corrects the working of the service, information technology (IT), and business function by moving a measurement of an incident to a caller entity to enable fault tolerance capabilities and by associating the detected error with a callee incident and quantifying the impact of the error based on the callee incident and a service level metric of the distributed computing environment.
2 . The method of claim 1 , wherein the request is a service request for requesting performance or delivery of a service by the callee entity.
3 . The method of claim 1 , wherein the service level metric comprises a predetermined service level metric or key performance indicator associated with the callee entity.
4 . The method of claim 1 , wherein quantifying an impact of the error comprises calculating an impact value based on the callee incident associated with the detected error and a count value of the service level metric.
5 . The method of claim 4 , wherein calculating the impact value comprises calculating a value signifying a first count value representing non-occurrence of the error minus a second count value representing occurrence of the error.
6 . The method of claim 1 , wherein the quantified impact comprises a fractional value.
7 . The method of claim 1 , wherein the quantified impact comprises an averaged value.
8 . The method of claim 1 , wherein the distributed computing environment comprises a transaction processing environment.
9 . A system comprising: a processor; and a computer-readable storage medium communicatively coupled to the processor and storing program instructions which, when executed by the processor, cause the processor to perform a method for quantifying impact of one or more errors in a distributed computing environment, the method comprising: utilizing a distributed tracing to track end-to-end requests across applications, starting from a user; creating Service Level Objectives (SLOs) and Business Key Performance Indicators (KPIs) by creating calculations against available metrics and setting goals to the calculations; detecting, at a caller entity of the distributed computing environment, an error, based on the distributed tracing, resulting from a request from the caller entity to a callee entity of the distributed computing environment; associating the detected error with a callee incident, the callee incident describing an abnormal operating condition of the callee entity; quantifying an impact of the error based on the callee incident associated with the detected error and a service level metric of the distributed computing environment; generating an impact report based on the quantified impact to implement a remediation; providing a deterministic quantification of an impact associated with the impact report, wherein the quantification of the impact comprises: calculating an impact value based on the callee incident associated with the detected error and a count value of the service level metric; facilitating improved assessment by quantifying and measuring an impact based on the impact report in consideration of a service level metric of the distributed computing environment, wherein the measure measuring is propagated along a chain of services until it is assigned to the service that is the cause of the incident; and correcting a working of a service by providing a deterministic quantification of the impact of the detected error that affects a predetermined availability of the service and corrects the working of the service, information technology (IT), and business function by moving a measurement of an incident to a caller entity to enable fault tolerance capabilities and by associating the detected error with a callee incident and quantifying the impact of the error based on the callee incident and a service level metric of the distributed computing environment.
10 . The system of claim 9 , wherein the request is a service request for requesting performance or delivery of a service by the callee entity.
11 . The system of claim 9 , wherein the service level metric comprises a predetermined service level metric or key performance indicator associated with the callee entity.
12 . The system of claim 9 , wherein quantifying an impact of the error comprises calculating an impact value based on the callee incident associated with the detected error and a count value of the service level metric.
13 . The system of claim 12 , wherein calculating an impact value comprises calculating a value signifying a first count value representing non-occurrence of the error minus a second count value representing occurrence of the error.
14 . The system of claim 9 , wherein the quantified impact comprises a fractional value.
15 . A computer-readable storage medium having program instructions embodied therewith, the program instructions executable by a processor to cause the processor to perform a method for quantifying impact of one or more errors in a distributed computing environment, the method comprising: utilizing a distributed tracing to track end-to-end requests across applications, starting from a user; creating Service Level Objectives (SLOs) and Business Key Performance Indicators (KPIs) by creating calculations against available metrics and setting goals to the calculations; detecting, at a caller entity of the distributed computing environment, an error, based on the distributed tracing, resulting from a request from the caller entity to a callee entity of the distributed computing environment; associating the detected error with a callee incident, the callee incident describing an abnormal operating condition of the callee entity; quantifying an impact of the error based on the callee incident associated with the detected error and a service level metric of the distributed computing environment; generating an impact report based on the quantified impact to implement a remediation; providing a deterministic quantification of an impact associated with the impact report, wherein the quantification of the impact comprises: calculating an impact value based on the callee incident associated with the detected error and a count value of the service level metric; facilitating improved assessment by quantifying and measuring an impact based on the impact report in consideration of a service level metric of the distributed computing environment, wherein the measure measuring is propagated along a chain of services until it is assigned to the service that is the cause of the incident; and correcting a working of a service by providing a deterministic quantification of the impact of the detected error that affects a predetermined availability of the service and corrects the working of the service, information technology (IT), and business function by moving a measurement of an incident to a caller entity to enable fault tolerance capabilities and by associating the detected error with a callee incident and quantifying the impact of the error based on the callee incident and a service level metric of the distributed computing environment.
16 . The computer-readable storage medium of claim 15 , wherein the request is a service request for requesting performance or delivery of a service by the callee entity.
17 . The computer-readable storage medium of claim 15 , wherein the service level metric comprises a predetermined service level metric or key performance indicator associated with the callee entity.
18 . The computer-readable storage medium of claim 15 , wherein quantifying an impact of the error comprises calculating an impact value based on the callee incident associated with the detected error and a count value of the service level metric.
19 . The computer-readable storage medium of claim 18 , wherein calculating an impact value comprises calculating a value signifying a first count value representing non-occurrence of the error minus a second count value representing occurrence of the error.
20 . The computer-readable storage medium of claim 15 , wherein the quantified impact comprises a fractional value.

Description

BACKGROUND The present disclosure relates generally to distributed computing environments and, more specifically, to assessing the impact of errors or incidents in a distributed computing environment. When an error or incident occurs within a distributed computing environment, the error/incident is typically categorized as affecting the availability (e.g., an outage), performance (e.g., a latency issue) or correctness (e.g., unexpected error) of the computing environment. With an increase in adoption of Service Level Objectives (SLOs) and associated error budgets, there is a trend towards measuring and reporting the impact of errors/incidents against those SLOs and associated error budget. For example, a correctness SLO of 99.9% allows for an error budget of 0.1%, meaning that 0.1% of requests can respond with an error. When an incident occurs due to unexpected errors, the number of errors may be quantified as they reduce the remaining error budget. It may be desirable to accurately quantify and measure the impact of errors/incidents in distributed computing environments, in order to understand the impact of outages, prioritize appropriately, and/or to take mitigating action(s). SUMMARY An embodiment of the present invention provides a computer-implemented method for quantifying impact of one or more errors in a distributed computing environment. The method comprises: detecting, at a caller entity of or in the distributed computing environment, an error resulting from a request from the caller entity to a callee entity of the distributed computing environment; associating the detected error with a callee incident, the callee incident describing an abnormal operating condition of the callee entity; and quantifying an impact of the error based on the callee incident associated with the detected error and a service level metric of the distributed computing environment. Another embodiment of the present invention provides a system comprising: one or more processors; and a computer-readable storage medium communicatively coupled to the processor and storing program instructions which, when executed by the processor, cause the processor to perform a method for quantifying impact of one or more errors in a distributed computing environment. The method comprising: detecting, at a caller entity of or in the distributed computing environment, an error resulting from a request from the caller entity to a callee entity of the distributed computing environment; associating the detected error with a callee incident, the callee incident describing an abnormal operating condition of the callee entity; and quantifying an impact of the error based on callee incident associated with the detected error and a service level metric of the distributed computing environment. Another embodiment of the present invention provides a computer program product comprising a computer-readable storage medium having program instructions embodied therewith that, when executed, performs a method for quantifying impact of one or more errors in a distributed computing environment. The method comprising: detecting, at a caller entity of or in the distributed computing environment, an error resulting from a request from the caller entity to a callee entity of the distributed computing environment; associating the detected error with a callee incident, the callee incident describing an abnormal operating condition of the callee entity; and quantifying an impact of the error based on the callee incident associated with the detected error and a service level metric of the distributed computing environment. Proposed embodiments may thus provide one or more concepts for determining and quantifying a technical performance impact of an error and/or a business performance impact of an error. This may enable issues and/or actions to be prioritized (e.g., according to severity of performance impact). Thus, there may be provided a mechanism for determining failure impact, measuring and quantifying the effect on Information Technology (IT)/business metrics and Key Performance Indicators (KPIs), and associating that with information on individual impacted users and their activities. The proposed embodiments may be employed in combination with conventional or existing distributed computing environments, such as transaction processing environments, for example. In this way, embodiments may be integrated into legacy systems so as to improve and/or extend their functionality and capabilities. An improved computing environment may therefore be provided by proposed embodiments. The above summary is not intended to describe each illustrated embodiment or every implementation of the present disclosure. BRIEF DESCRIPTION OF THE DRAWINGS The drawings included in the present disclosure are incorporated into, and form part of, the specification. They illustrate embodiments of the present disclosure and, along with the description, serve to explain the principles of the dis