US-12619518-B2 - Automated root cause identification using data flow analysis of plural execution traces

US12619518B2US 12619518 B2US12619518 B2US 12619518B2US-12619518-B2

Abstract

Automated root cause identification using data flow analysis of plural execution traces. A computer system generates data flow dependency graphs from first and second execution traces an entity. These graphs represent input/output data flows of corresponding executions of the entity. The computer system generates topological sortings of those graphs and identifies output pairings across these graphs based on outputs having common labels and topological correspondence. The computer system identifies output pairing(s) that are mismatched as having different values and, for at least one mismatched output pairing, traverses the graphs in order to identify input pairing(s) that are topological root(s) to the mismatched output pairing(s) and that are causal to the mismatch(es). Each input pairing comprises inputs that have a common label, a common topological correspondence, and mismatched values. The computer system returns these input pairings as a root cause for at least one difference between first and second execution traces.

Inventors

Jordi Mola

Assignees

MICROSOFT TECHNOLOGY LICENSING, LLC

Dates

Publication Date: 20260505
Application Date: 20240611
Priority Date: 20210506

Claims (18)

1 . A computer system for automated root cause identification using data flow analysis of plural execution traces, comprising: a processor; a memory; and a computer storage medium that stores computer-executable instructions that are executable by the processor to at least: based on accessing a first execution trace that records execution of an entity, (i) generate a first data flow dependency graph within the memory, the first data flow dependency graph representing a first data flow from a first set of one or more entity inputs to a first set of one or more entity outputs through a first set of one or more activities, each entity input in the first set of entity inputs and each entity output in the first set of entity outputs having a corresponding value and a corresponding label; and (ii) generate a first topological sorting for the first data flow dependency graph; based on accessing a second execution trace that records execution of the entity, (i) generate a second data flow dependency graph within the memory, the second data flow dependency graph representing a second data flow from a second set of one or more entity inputs to a second set of one or more entity outputs through a second set of one or more activities, each entity input in the second set of entity inputs and each entity output in the second set of entity outputs having a corresponding value and a corresponding label; and (ii) generate a second topological sorting for the second data flow dependency graph; identify a set of one or more entity output pairings, each entity output pairing in the set comprising (i) a corresponding first entity output from the first set of entity outputs, and (ii) a corresponding second entity output from the second set of entity outputs, and in which the corresponding first entity output and the corresponding second entity output in each entity output pairing have a common label and a common topological correspondence, wherein the topological correspondence is based on the generated topological sortings; identify a set of one or more mismatched entity output pairings from the set of entity output pairings, each mismatched entity output pairing in the set being mismatched due at least to a corresponding first value of its corresponding first entity output differing from a corresponding second value of its corresponding second entity output; prune at least one of the first data flow dependency graph or the second data flow dependency graph, including performing at least one of: removing at least one entity input in the data flow dependency graph that lacks any data flow path to an entity output in the data flow dependency graph that is part of one of the set of mismatched entity output pairings; removing at least one entity output in the data flow dependency graph that is part of one of the set of entity output pairings, and that is a matched entity output pairing due at least to a corresponding first value of its corresponding first entity output matching a corresponding second value of its corresponding second entity output; or removing at least one entity output in the data flow dependency graph that is part of one of the set of one or more mismatched entity output pairings, and whose mismatch is fully attributable to one or more ancestor entity outputs that are also part of the set of one or more mismatched entity output pairings; after pruning at least one of the first data flow dependency graph or the second data flow dependency graph, for at least one mismatched entity output pairing, traverse each of the first data flow dependency graph and the second data flow dependency graph toward a root of each data flow dependency graph, and identify a set of one or more entity input pairings as a topological root to the least one mismatched entity output pairing that is causal to the mismatch of the least one mismatched entity output pairing, each entity input pairing in the set comprising (i) a corresponding first entity input from the first set of entity inputs, and (ii) a corresponding second entity inputs from the second set of entity inputs, and in which the corresponding first entity inputs and the corresponding second entity inputs in each entity inputs pairing have (i) a common label, a (ii) common topological correspondence, and mismatched values; and return the set of entity input pairings as a root cause for at least one difference between the first execution trace and the second execution trace.
2 . The computer system of claim 1 , wherein each data flow dependency graph is a directed acyclic graph in which each directed edge connects a corresponding first vertex representing either an input or an output and a corresponding second vertex representing a corresponding activity, and in which, when the directed edge is directed from the corresponding first vertex to the corresponding second vertex, the corresponding first vertex represents an input to the corresponding activity; or when the directed edge is directed from the corresponding second vertex to the corresponding first vertex, the corresponding first vertex represents an output from the corresponding activity.
3 . The computer system of claim 1 , wherein each data flow dependency graph is a directed acyclic graph, in which each directed edge is directed from a corresponding first vertex representing a corresponding first activity to a corresponding second vertex representing a corresponding second activity, and in which the directed edge represents an output from the corresponding first activity and an input to the corresponding second activity.
4 . The computer system of claim 1 , wherein, the computer-executable instructions are also executable by the processor to: identify the first set of entity inputs and the first set of entity outputs from the first execution trace; and identify the second set of entity inputs and the second set of entity outputs from the second execution trace, and identifying at least one entity input or at least one entity output is based on identifying at least one of, an annotation describing a system call; or a region of shared memory.
5 . The computer system of claim 1 , wherein, the first execution trace records execution of a process, and the first set of entity inputs are one or more data values received from one or more system calls, and the first set of entity outputs are one or more data values passed to the one or more system calls; or the first execution trace records execution of a function, and the first set of entity inputs are one or more data values read by the function that the function did not produce, and the first set of entity outputs are one or more data values written by the function to memory that is not deallocated by the function.
6 . The computer system of claim 1 , wherein each of the first set of entity inputs and the second set of entity inputs comprises at least one of: (i) a first input comprising first data produced by a system call that was initiated by the entity; (ii) a second input comprising second data read by the entity from shared memory; (iii) a third input comprising third data read by the entity from an execution context associated with the entity, the third data having been placed into the execution context by an external entity; (iv) a fourth input comprising fourth data read by the entity from a memory-mapped input/output location; or (v) a fifth input comprising fifth data read by the entity from an executable binary.
7 . The computer system of claim 1 , wherein each of the first set of entity outputs and the second set of entity outputs comprises at least one of: (i) a first output comprising first data input by the entity to a system call; or (ii) a second output comprising second data written by the entity to shared memory.
8 . The computer system of claim 1 , wherein each entity input and each entity output also has a corresponding address.
9 . The computer system of claim 1 , wherein, for at least one entity input or at least one entity output, the corresponding label comprises at least one of: a file name, a file offset, a registry key, an object name, a system call name, or a parameter name.
10 . The computer system of claim 1 , wherein the first entity outputs and the second entity outputs in each entity output pairing are associated with a common system call.
11 . The computer system of claim 1 , wherein the first topological sorting for the first data flow dependency graph is generated based on the first set of entity outputs, and the second topological sorting for the second data flow dependency graph is generated based on the second set of entity outputs.
12 . The computer system of claim 1 , wherein the computer-executable instructions are also executable by the processor to return, for at least one of the set of entity input pairings, a system call chain to reach one or more of its corresponding first entity input, or its corresponding second entity input.
13 . The computer system of claim 1 , wherein the computer-executable instructions are also executable by the processor to return, for at least one of the set of entity input pairings, an indication of a corresponding first value of its corresponding first entity input, and of a corresponding second value of its corresponding second entity input.
14 . The computer system of claim 1 , wherein the computer-executable instructions are also executable by the processor to return, for at least one of the set of entity input pairings, an indication of a corresponding first label or first address of its corresponding first entity input, and of a corresponding second label or second address of its corresponding second entity input.
15 . A hardware storage device that stores computer-executable instructions that are executable by a processor to perform an automated root cause identification using data flow analysis of plural execution traces, the computer-executable instructions being executable by the processor at least: based on accessing a first execution trace that records execution of an entity, (i) generate a first data flow dependency graph within a memory, the first data flow dependency graph representing a first data flow from a first set of one or more entity inputs to a first set of one or more entity outputs through a first set of one or more activities, each entity input in the first set of entity inputs and each entity output in the first set of entity outputs having a corresponding value and a corresponding label; and (ii) generate a first topological sorting for the first data flow dependency graph; based on accessing a second execution trace that records execution of the entity, (i) generate a second data flow dependency graph within the memory, the second data flow dependency graph representing a second data flow from a second set of one or more entity inputs to a second set of one or more entity outputs through a second set of one or more activities, each entity input in the second set of entity inputs and each entity output in the second set of entity outputs having a corresponding value and a corresponding label; and (ii) generate a second topological sorting for the second data flow dependency graph; identify a set of one or more entity output pairings, each entity output pairing in the set comprising (i) a corresponding first entity output from the first set of entity outputs, and (ii) a corresponding second entity output from the second set of entity outputs, and in which the corresponding first entity output and the corresponding second entity output in each entity output pairing have a common label and a common topological correspondence, wherein the topological correspondence is based on the generated topological sortings; identify a set of one or more mismatched entity output pairings from the set of entity output pairings, each mismatched entity output pairing in the set being mismatched due at least to a corresponding first value of its corresponding first entity output differing from a corresponding second value of its corresponding second entity output; prune at least one of the first data flow dependency graph or the second data flow dependency graph, including performing at least one of: removing at least one entity input in the data flow dependency graph that lacks any data flow path to an entity output in the data flow dependency graph that is part of one of the set of mismatched entity output pairings; removing at least one entity output in the data flow dependency graph that is part of one of the set of entity output pairings, and that is a matched entity output pairing due at least to a corresponding first value of its corresponding first entity output matching a corresponding second value of its corresponding second entity output; or removing at least one entity output in the data flow dependency graph that is part of one of the set of one or more mismatched entity output pairings, and whose mismatch is fully attributable to one or more ancestor entity outputs that are also part of the set of one or more mismatched entity output pairings; after pruning at least one of the first data flow dependency graph or the second data flow dependency graph, for at least one mismatched entity output pairing, traverse each of the first data flow dependency graph and the second data flow dependency graph toward a root of each data flow dependency graph, and identify a set of one or more entity input pairings as a topological root to the least one mismatched entity output pairing that is causal to the mismatch of the least one mismatched entity output pairing, each entity input pairing in the set comprising (i) a corresponding first entity input from the first set of entity inputs, and (ii) a corresponding second entity inputs from the second set of entity inputs, and in which the corresponding first entity inputs and the corresponding second entity inputs in each entity inputs pairing have (i) a common label, a (ii) common topological correspondence, and mismatched values; and return the set of entity input pairings as a root cause for at least one difference between the first execution trace and the second execution trace.
16 . The hardware storage device of claim 15 , wherein each data flow dependency graph is a directed acyclic graph in which each directed edge connects a corresponding first vertex representing either an input or an output and a corresponding second vertex representing a corresponding activity, and in which, when the directed edge is directed from the corresponding first vertex to the corresponding second vertex, the corresponding first vertex represents an input to the corresponding activity; or when the directed edge is directed from the corresponding second vertex to the corresponding first vertex, the corresponding first vertex represents an output from the corresponding activity.
17 . The hardware storage device of claim 15 , wherein each data flow dependency graph is a directed acyclic graph, in which each directed edge is directed from a corresponding first vertex representing a corresponding first activity to a corresponding second vertex representing a corresponding second activity, and in which the directed edge represents an output from the corresponding first activity and an input to the corresponding second activity.
18 . The hardware storage device of claim 15 , wherein, the computer-executable instructions are also executable by the processor to: identify the first set of entity inputs and the first set of entity outputs from the first execution trace; and identify the second set of entity inputs and the second set of entity outputs from the second execution trace, and identifying at least one entity input or at least one entity output is based on identifying at least one of, an annotation describing a system call; or a region of shared memory.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS The present application is a continuation of U.S. application Ser. No. 18/557,425, filed Oct. 26, 2023, and entitled “AUTOMATED ROOT CAUSE IDENTIFICATION USING DATA FLOW ANALYSIS OF PLURAL EXECUTION TRACES,” which is a U.S. National Stage of International Application No. PCT/US2022/072038, filed on May 2, 2022, designating the United States and claiming the priority of Luxembourg Patent Application No. LU500132 filed with the Luxembourg Intellectual Property Office on May 6, 2021. All of the aforementioned applications are incorporated herein in their respective entirety by this reference. TECHNICAL FIELD The present disclosure relates to systems, methods, and devices that analyze replayable execution traces for identification of execution behaviors. BACKGROUND Tracking down and correcting undesired software behaviors is a core activity in software development. Undesired software behaviors can include many things, such as execution crashes, runtime exceptions, slow execution performance, incorrect data results, data corruption, and the like. Undesired software behaviors are triggered by a vast variety of factors such as data inputs, user inputs, race conditions (e.g., when accessing shared resources), etc. Given the variety of triggers, undesired software behaviors are often rare and seemingly random, and extremely difficult to reproduce. As such, it is often very time-consuming and difficult for a developer to identify a given undesired software behavior. Once an undesired software behavior has been identified, it is again often time-consuming and difficult to determine its root cause (or causes). Developers use a variety of approaches to identify undesired software behaviors, and to then identify one or more locations in an application's code that cause the undesired software behavior. For example, developers often test different portions of an application's code against different inputs (e.g., unit testing). As another example, developers often reason about execution of an application's code in a debugger (e.g., by setting breakpoints/watchpoints, by stepping through lines of code, etc. as the code executes). As another example, developers often observe code execution behaviors (e.g., timing, coverage) in a profiler. As another example, developers often insert diagnostic code (e.g., trace statements) into the application's code. While conventional diagnostic tools (e.g., debuggers, profilers, etc.) have operated on “live” forward-executing code, an emerging form of diagnostic tools enable “historic” debugging (also referred to as “time travel” or “reverse” debugging), in which the execution of at least a portion of an execution context is recorded into one or more trace files (i.e., an execution trace). Using some tracing techniques, an execution trace contains “bit-accurate” historic execution trace data, which enables any recorded portion the traced execution context to be virtually “replayed” (e.g., via emulation) down to the granularity of individual instructions (e.g., machine code instructions, intermediate language code instructions, etc.). Thus, using bit-accurate trace data, diagnostic tools enable developers to reason about a recorded prior execution of a subject execution context, as opposed to conventional debugging which is limited to a “live” forward execution. For example, using replayable execution traces, some historic debuggers provide user experiences that enable both forward and reverse breakpoints/watchpoints, that enable code to be stepped through both forwards and backwards, etc. Some historic profilers, on the other hand, are able to derive code execution behaviors (e.g., timing, coverage) from prior-executed code. Since modern processors commonly execute at the rate of tens—to hundreds—of thousands of MIPS (millions of instructions per second), replayable execution traces of executing code can generate vast amounts of information, even if mere fractions of a second are captured. As such, analyzing and presenting information about execution traces can consume considerable computing resources (e.g., processor and memory resources). BRIEF SUMMARY At least some embodiments described herein are directed to automated root cause identification using data flow analysis of plural execution traces. In particular, embodiments use a data flow analysis of plural execution traces of execution of an entity (e.g., a thread or a process that previously executed at a computer processor) to identify which input (or inputs) into the entity caused it to produce different execution results (i.e., a divergence of outputs from the entity) between those execution traces. These embodiments operate by analyzing plural execution traces to identify the inputs to, and outputs from, different prior executions of an entity (e.g., based on interactions by a process with system calls and/or shared memory). For each of those prior executions, these embodiments generate a cor