US-12626138-B2 - Causality detection for outlier events in telemetry metric data

US12626138B2US 12626138 B2US12626138 B2US 12626138B2US-12626138-B2

Abstract

Identifying causal relationships between outlier telemetry events in telemetry metric data using machine learning ensembles of an autoencoder and an attention mechanism provides an automated framework for root cause analysis. Outlier telemetry events are detected across a cloud of telemetry events using unsupervised learning models. To establish a causal relationship between outlier telemetry events, autoencoder/attention mechanism ensembles are trained for pairs of telemetry metrics. When inputs of sequences of telemetry events of a first telemetry metric and a second telemetry metric to the ensemble have sufficiently high loss value, a causal relationship is inferred. Internal node values of the attention mechanism from the input identify specific time stamps for the first telemetry metric that have a causal relationship with the outlier telemetry event.

Inventors

Zhen Han Si
Claudionor Jose Nunes Coelho, JR.
Viswesh Ananthakrishnan
Eyal Firstenberg

Assignees

PALO ALTO NETWORKS, INC.

Dates

Publication Date: 20260512
Application Date: 20210913

Claims (20)

1 . A method comprising: determining, with a set of one or more unsupervised learning models, a plurality of outliers in a first plurality of telemetry events across a plurality of telemetry metrics for computing resources; inputting a plurality of event indications into an ensemble of an attention mechanism and an autoencoder, wherein the plurality of event indications at least include indications of a first outlier of the plurality of outliers and a second outlier of the plurality of outliers that occurs after the first outlier, and wherein the first outlier is an outlier telemetry event of a first telemetry metric of the plurality of telemetry metrics and the second outlier is an outlier telemetry event of a second telemetry metric of the plurality of telemetry metrics; determining that a loss function value for the ensemble of the attention mechanism and the autoencoder on the input of the plurality of event indications satisfies a first criterion based on a threshold loss function value; based on determining that the loss function value satisfies the first criterion, determining which values of a set of one or more values output by a first internal layer of the attention mechanism satisfy a second criterion, wherein the second criterion comprises a criterion that a value output by the first internal layer at least one of exceeds one or more threshold values and exceeds a threshold percentile of values observed at one or more internal layers of the attention mechanism including the first internal layer; and based on determining that a first value of the set of one or more values output by the first internal layer corresponding to a first telemetry event of the first telemetry metric satisfies the second criterion, indicating the first telemetry event of the first telemetry metric as causing the second outlier.
2 . The method of claim 1 , wherein each of the plurality of telemetry metrics for the computing resources corresponds to one or more telemetry metric values and one or more corresponding time stamps.
3 . The method of claim 1 , further comprising: based on determining that the loss function value satisfies the first criterion, determining which of a set of one or more values at a plurality of layers of the attention mechanism satisfies the second criterion; and based on determining that a second value of the set of one or more values output by the plurality of layers corresponding to a second telemetry event of the first telemetry metric satisfies the second criterion, indicating a second telemetry event of the first telemetry metric as causing the second outlier.
4 . The method of claim 1 , wherein the second criterion comprises a determination that a value in the set of one or more values at the first internal layer of the attention mechanism is a maximal value of the set of one or more values at the first internal layer of the attention mechanism.
5 . The method of claim 1 , further comprising identifying a plurality of intervening telemetry events occurring after the first outlier and prior to the second outlier.
6 . The method of claim 5 , wherein the plurality of event indications at least includes indications of the plurality of intervening telemetry events.
7 . The method of claim 5 , wherein one or more of the plurality of intervening telemetry events comprise telemetry events of the first telemetry metric and the second telemetry metric.
8 . The method of claim 1 , wherein the attention mechanism is a neural network, further comprising training the ensemble of the autoencoder and the attention mechanism on metric values for the first telemetry metric and the second telemetry metric using backpropagation.
9 . The method of claim 1 , wherein determining that the loss function value for the ensemble of the attention mechanism and the autoencoder on the input of the plurality of event indications satisfies the first criterion comprises determining that the loss function value exceeds the threshold loss function value.
10 . The method of claim 1 , further comprising generating a causality chain for the first plurality of telemetry events based, at least, in part, on the indication of the first telemetry event as causing the second outlier.
11 . One or more non-transitory computer-readable media having program code stored thereon having program code stored thereon, the program code comprising instructions to: determine, with a set of one or more unsupervised learning models, a plurality of outliers in a first plurality of telemetry values of a plurality of telemetry metrics for computing resources; input, into an ensemble of an attention mechanism and an autoencoder, indications of a first outlier of the plurality of outliers and a second outlier of the plurality of outliers that occurs after the first outlier, wherein the first outlier is of a first telemetry metric of the plurality of telemetry metrics and the second outlier is of a second telemetry metric of the plurality of telemetry metrics; determine whether a loss function value output from the autoencoder based on input to the ensemble satisfies a first criterion based on a threshold loss function value; based on a determination that the loss function value satisfies the first criterion, evaluate a set of one or more values output by a first internal layer of the attention mechanism to identify a first telemetry value of the first telemetry metric as having a causal link to the second outlier, wherein the instructions to evaluate the set of one or more values to identify the first telemetry value comprise instructions to determine that the first telemetry value at least one of exceeds one or more threshold values and exceeds a threshold percentile of values observed at one or more internal layers of the attention mechanism including the first internal layer; and indicate the causal link between the first telemetry value and the second outlier.
12 . The one or more non-transitory computer-readable media of claim 11 , wherein the program code further comprises instructions to input, into the ensemble of the attention mechanism and the autoencoder, indications of a plurality of intervening telemetry values occurring after the first outlier and before the second outlier.
13 . The one or more non-transitory computer-readable media of claim 11 , wherein the instructions to determine whether the loss function value output from the autoencoder based on input to the ensemble satisfies the first criterion comprise instructions to determine that the loss function value exceeds the threshold loss function value.
14 . The one or more non-transitory computer-readable media of claim 11 , wherein the program code further comprises instructions to generate a causality chain for at least the plurality of telemetry metrics based, at least in part, on indications of the causal link between the first telemetry value and the second outlier.
15 . An apparatus comprising: a processor; and a computer-readable medium having instructions stored thereon that are executable by the processor to cause the apparatus to, determine, with a set of one or more unsupervised learning models, a plurality of outliers in a first plurality of telemetry events across a plurality of telemetry metrics for computing resources; input a plurality of event indications into an ensemble of an attention mechanism and an autoencoder, wherein the plurality of event indications at least include indications of a first outlier of the plurality of outliers and a second outlier of the plurality of outliers that occurs after the first outlier, and wherein the first outlier is of a first telemetry metric of the plurality of telemetry metrics and the second outlier is of a second telemetry metric of the plurality of telemetry metrics; determine that a loss function value for the ensemble of the attention mechanism and the autoencoder on the input of the plurality of event indications satisfies a first criterion based on a threshold loss function value; based on determining that the loss function value satisfies the first criterion, determine which of a set of one or more values output by a first internal layer of the attention mechanism satisfies a second criterion, wherein the second criterion comprises a criterion that a value output by the first internal layer at least one of exceeds one or more threshold values and exceeds a threshold percentile of values observed at one or more internal layers of the attention mechanism including the first internal layer; and based on determining that a first value of the set of one or more values at the first internal layer corresponding to a first telemetry event of the first telemetry metric satisfies the second criterion, indicate the first telemetry event of the first telemetry metric as causing the second outlier.
16 . The apparatus of claim 15 , wherein each of the plurality of telemetry metrics for the computing resources corresponds to one or more telemetry metric values and one or more corresponding time stamps.
17 . The apparatus of claim 15 , further comprising: based on determining that the loss function value satisfies the first criterion, determining which of a set of one or more values at a plurality of layers of the attention mechanism satisfies the second criterion; and based on a second telemetry event of the first telemetry metric corresponding to a second value of the set of one or more values at the plurality of layers that satisfies the second criterion, wherein the second value corresponds to a second telemetry event of the first telemetry metric, indicating the second telemetry event of the first telemetry metric as causing the second outlier.
18 . The method of claim 1 , wherein the first internal layer comprises at least one of a mask layer and a softmax layer that focuses attention of the attention mechanism on one or more sections of the plurality of event indications that at least correspond to the first telemetry event.
19 . The one or more non-transitory computer-readable media of claim 11 , wherein the first internal layer comprises at least one of a mask layer and a softmax layer that focuses attention of the attention mechanism on one or more sections of the input to the ensemble of the attention mechanism and the autoencoder that at least correspond to the first telemetry value.
20 . The apparatus of claim 15 , wherein the first internal layer comprises at least one of a mask layer and a softmax layer that focuses attention of the attention mechanism on one or more sections of the plurality of event indications that at least correspond to the first telemetry event.

Description

BACKGROUND The disclosure generally relates to the field of information security, and to modeling, design, simulation, or emulation. Autoencoders are neural networks which, contrary to typical neural network architectures, comprise a first set of contractive layers that progressively decrease the number of internal nodes at each layer, then a second set of expansive layers that progressively increase the number of internal nodes at each layer until the output layer that is an equivalent length to the input layer. The loss function that guides training (e.g., via gradient descent with backpropagation through the layers) is a loss between the input and the output as opposed to a loss between outputs and labels for corresponding inputs for a supervised neural network. Once trained, the loss function for a trained autoencoder can be used for outlier detection. Input/output pairs that have loss above a threshold value indicate that the input tends to statistically deviate from the training data and can be identified as an outlier. The use of attention mechanisms is a methodology in machine learning for isolating specific inputs to a neural network that have a significant impact on the output. The attention mechanism is itself a neural network comprising various layers including mask, softmax, scaling, alignment, and context operations. The attention mechanism is trained as an ensemble with the neural network so that inputs to the attention mechanism/neural network ensemble generate weights at internal nodes of the attention mechanism that indicate which parts of the input are significant for the output values. BRIEF DESCRIPTION OF THE DRAWINGS Embodiments of the disclosure may be better understood by referencing the accompanying drawings. FIG. 1 is a schematic diagram of an example system for determining event causality chains in telemetry event logs across data sources. FIG. 2 is a schematic diagram of a system for identifying event causalities in telemetry events. FIG. 3 is a schematic diagram of an event causality identification engine for telemetry events. FIG. 4 is a flowchart of example operations for detecting causality chains in telemetry events using attention mechanism/autoencoder ensembles. FIG. 5 is a flowchart of example operations for determining event causality chains in outlier telemetry events using attention mechanism/autoencoder ensembles. FIG. 6 is a flowchart of example operations for evaluating values at internal layers of an attention mechanism to identify a cause event(s). FIG. 7 is a flowchart of example operations for evaluating scores for potential cause events across internal layers of an attention mechanism to identify cause event(s). FIG. 8 is a flowchart of example operations for training an attention mechanism/autoencoder ensemble to determine causal telemetry event pairs. FIG. 9 depicts an example computer system with an event causality identification engine. DESCRIPTION The description that follows includes example systems, methods, techniques, and program flows that embody aspects of the disclosure. However, it is understood that this disclosure may be practiced without these specific details. For instance, this disclosure refers to identification of causality chains in telemetry metric data via attention mechanism/autoencoder ensembles in illustrative examples. Aspects of this disclosure can be instead applied to causality chain identification with ensembles of attention mechanisms and other unsupervised learning models. In other instances, well-known instruction instances, protocols, structures and techniques have not been shown in detail in order not to obfuscate the description. Overview Outlier detection and causal analysis for telemetry metrics across disparate devices, networks, programs, and other sources rely heavily on domain-level expertise. In order to perform both outlier detection and causal analysis for telemetry events and subsequently identify root causes for outlier telemetry events, an automated machine-learning framework is presented herein. A data center collects telemetry events logs across a cloud of sources. An event causality identification engine identifies telemetry metrics in the telemetry events and uses pretrained telemetry metric outlier models to detect outlier telemetry events. After outlier telemetry event detection, the event causality identification engine deploys an event causality model to identify telemetry event causality chains. The event causality identification engine determines, for each detected outlier telemetry event, sets of candidate causal telemetry events that may have caused to the outlier telemetry event. An autoencoder/attention mechanism ensemble is trained on values for telemetry metrics for pairs of telemetry metrics to detect causal relationships between telemetry events across each pair of telemetry metrics. The attention mechanism/autoencoder ensemble is configured to identify specific events in each set of candidate c