US-20260127515-A1 - Smart Incident Response

US20260127515A1US 20260127515 A1US20260127515 A1US 20260127515A1US-20260127515-A1

Abstract

An event management bus is configured to ingest events from a plurality of monitoring tools at a defined acceptance rate. Events received in excess of the acceptance rate are rejected, and a rejection notification is transmitted. For an ingested event, an incident is triggered. A machine-learning model, selected based on a determined incident type, initiates a process to identify a resolution for the incident. An action determined as a result of the process is executed by an action execution tool. Feedback data indicating the effectiveness of the executed action in resolving the incident is received. The machine-learning model is then retrained using the feedback data.

Inventors

Sanghamitra Goswami
Laura Ann Zuchlewski

Assignees

PagerDuty, Inc.

Dates

Publication Date: 20260507
Application Date: 20251030

Claims (20)

1 . A method for managing event data in an event management system, the method comprising: configuring an event management bus to ingest one or more events from a plurality of event sources limited to a defined acceptance rate; rejecting, in response to determining that a received event is not received in accordance with the defined acceptance rate, the received event; transmitting, in response to the determining, a rejection notification to a system from which the received event was received, wherein the rejection notification indicates that the received event is not accepted for processing; and triggering, for an event ingested in accordance with the defined acceptance rate, an incident, initiating, based on an output from a machine-learning model, a process for identifying a resolution for the incident, wherein the machine-learning model is selected based on an incident type determined for the incident; executing an action determined as a result of the initiated process by an action execution tool; receiving, from the action execution tool, feedback data indicating an effectiveness of the executed action in resolving the incident; and retraining the machine-learning model using the feedback data.
2 . The method of claim 1 , wherein the incident type is selected from a set comprising a rare type, a novel type, and a frequent type.
3 . The method of claim 2 , further comprising: determining the incident type for the incident by: responsive to incident data meeting a first condition, determining that the incident is of the rare type; responsive to the incident data meeting a second condition, determining that the incident is of the novel type; and responsive to the incident data meeting a third condition, determining that the incident is of the frequent type.
4 . The method of claim 3 , wherein determining the incident type further comprises: evaluating the incident based on constituent data of the incident and historical data related to the incident.
5 . The method of claim 2 , wherein the machine-learning model is a collaborative filtering model when the incident is of the frequent type and is a content-based filtering model when the incident is of the rare type.
6 . The method of claim 5 , further comprising executing the collaborative filtering model, the collaborative filtering model including normalizing an incident title associated with the incident to identify one or more similar historical incidents.
7 . The method of claim 6 , wherein normalizing the incident title comprises tokenizing the incident title and replacing specific types of information with representative tokens, wherein the specific types of information comprises at least one of a timestamp, a unique identifier, or a network address.
8 . The method of claim 6 , wherein the collaborative filtering model further includes: executing a term frequency-inverse document frequency algorithm to build a user profile of one or more incident title words from the normalized incident title; and calculating cosine similarities using the one or more incident title words.
9 . The method of claim 2 , wherein the machine-learning model is a content-based filtering model, the method further comprising executing the content-based filtering model, the content-based filtering model including: analyzing one or more incidents selected from a list of incidents occurring over a specific time frame; or analyzing one or more incident title tokens common to title tokens related to the incident.
10 . The method of claim 1 , wherein at least some of the one or more events are ingested via one of a Short Message Service (SMS) message, a HyperText Transfer Protocol (HTTP) request, or an Application Programming Interface (API) call.
11 . A method for managing event data in an event management system, the method comprising: configuring an event management bus to ingest one or more events at a defined acceptance rate and to reject any event received in excess of the defined acceptance rate; classifying, for an ingested event, a corresponding incident as one of at least a frequent type or a rare type; executing, responsive to the incident being classified as the frequent type, a collaborative filtering model to initiate a process to identify an action for resolving the incident; executing, responsive to the incident being classified as the rare type, a content-based filtering model to initiate a process to identify the action for resolving the incident; tracking feedback data resulting from the action being carried out; retraining at least one of the collaborative filtering model or the content-based filtering model using the feedback data to improve one or more later action recommendations.
12 . The method of claim 11 , wherein executing the collaborative filtering model comprises normalizing an incident title associated with the incident.
13 . The method of claim 12 , wherein normalizing the incident title comprises tokenizing the incident title and replacing specific types of information with representative tokens, wherein the specific types of information comprises at least one of a timestamp, a unique identifier, or a network address.
14 . The method of claim 11 , wherein executing the collaborative filtering model further comprises: executing a term frequency-inverse document frequency algorithm to build a user profile of one or more incident title words; and calculating cosine similarities using the one or more incident title words.
15 . The method of claim 11 , wherein executing the content-based filtering model comprises at least one of: analyzing one or more incidents selected from a list of incidents occurring over a specific time frame; or analyzing one or more incident title tokens common to title tokens related to the incident.
16 . A method for improving incident response, the method comprising: configuring an event management bus to ingest one or more received events, wherein each received event indicates a condition detected by a monitoring tool; determining, responsive to an ingested event, an incident type for a corresponding incident from a set comprising at least one of a frequent type or a rare type; generating, by a machine-learning model, a recommendation for resolving the incident, wherein the machine-learning model is selected based on the determined incident type and the generating includes: normalizing incident titles by tokenizing the incident titles and replacing specific types of information with representative tokens, wherein the specific types of information includes timestamps, unique identifiers, and network addresses, executing a collaborative filtering model using the normalized incident titles to identify similar historical incidents, wherein the collaborative filtering model is trained based on features of incidents in a historical data set, calculating cosine similarities between the incident and the similar historical incidents, and generating the recommendation based on the cosine similarities and the similar historical incidents; executing an action determined as a result of the recommendation by an action execution tool; receiving, from the action execution tool, feedback data indicating an effectiveness of the executed action in resolving the incident; and retraining the machine-learning model using the feedback data.
17 . The method of claim 16 , wherein determining the incident type comprises: responsive to historical data meeting a first condition, determining that the incident is of the rare type; and responsive to the historical data meeting a second condition, determining that the incident is of the frequent type.
18 . The method of claim 17 , wherein determining the incident type further comprises evaluating the incident based on constituent data of the incident and the historical data related to the incident.
19 . The method of claim 16 , wherein at least some of the one or more received events are ingested via one of a Short Message Service (SMS) message, a HyperText Transfer Protocol (HTTP) request, or an Application Programming Interface (API) call.
20 . The method of claim 16 , wherein the machine-learning model is the collaborative filtering model when the incident is of the frequent type, and wherein the machine-learning model is a content-based filtering model when the incident is of the rare type.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This disclosure claims the benefit of U.S. Patent Application No. 17/876,727 filed July 29, 2022, the disclosure of which is incorporated by reference herein in its entirety. BACKGROUND Information technology (IT) systems are increasingly becoming complex, multivariate, and in some cases non-intuitive systems with varying degrees of nonlinearity. These complex IT systems may be difficult to model or accurately understand. Various monitoring systems may be arrayed to provide events, alerts, notifications, or the like, in an effort to provide visibility into operational metrics, failures, and/or correctness. However, the sheer size and complexity of these IT systems may result in a flooding of disparate event messages from disparate monitoring/reporting services. With the increased complexity of distributed computing systems existing event reporting and/or management may not, for example, have the capability to effectively process events in complex and noisy systems. At enterprise scale, IT systems may have millions of components resulting in a complex inter-related set of monitoring systems that report millions of events from disparate subsystems. Manual techniques and pre-programmed rules are labor and computing intensive and expensive, especially in the context of large, centralized IT Operations with very complex systems distributed across large numbers of components. Further, these manual techniques may limit the ability of systems to scale and evolve for future advances in IT systems capabilities In networked environments, network operators retain a certain number of responders to address incidents in networked systems and applications. A responder assigned to resolve an incident may need assistance from other responders. SUMMARY Disclosed herein are implementations of a system and method for responding to alerts in a networked environment. In an aspect, a method for managing event data in an event management system is disclosed. The method includes configuring an event management bus to ingest one or more events from a plurality of disparate monitoring tools limited to a defined acceptance rate, rejecting a received event in response to determining that the received event is not received in accordance with the defined acceptance rate, and transmitting a rejection notification to a system from which the received event was received in response to the determining, wherein the rejection notification indicates that the received event is not accepted for processing. The method includes triggering an incident for an event ingested in accordance with the defined acceptance rate, initiating, based on an output from a machine-learning model, a process for identifying a resolution for the incident, wherein the machine-learning model is selected based on an incident type determined for the incident, executing an action determined as a result of the initiated process by an action execution tool, receiving feedback data from the action execution tool indicating an effectiveness of the executed action in resolving the incident, and retraining the machine-learning model using the feedback data. In a second aspect, a method for managing event data in an event management system is disclosed. The method includes configuring an event management bus to ingest one or more events at a defined acceptance rate and to reject any event received in excess of the defined acceptance rate, classifying a corresponding incident as one of at least a frequent type or a rare type for an ingested event, and executing a collaborative filtering model to initiate a process to identify an action for resolving the incident, responsive to the incident being classified as the frequent type. The method includes executing a content-based filtering model to initiate a process to identify the action for resolving the incident, responsive to the incident being classified as the rare type, tracking feedback data resulting from the action being carried out, and retraining at least one of the collaborative filtering model or the content-based filtering model using the feedback data to improve one or more subsequent action recommendations. In a third aspect, a method for improving incident response is disclosed. The method includes configuring an event management bus to ingest one or more received events, wherein each received event indicates a condition detected by a monitoring tool, and determining an incident type for a corresponding incident from a set comprising at least one of a frequent type or a rare type, responsive to an ingested event. The method includes generating, by a machine-learning model, a recommendation for resolving the incident, wherein the machine-learning model is selected based on the determined incident type. The generating includes normalizing incident titles by tokenizing the incident titles and replacing specific types of information with representative tokens, wherein the specific type