US-12625688-B2 - Optimized execution of cells in directed acyclic graph-driven notebook environment
Abstract
A system performs optimized execution of cells of a notebook by pruning certain cells from execution while evaluating a particular cell, even though the particular cell depends on the pruned cells. The system generates a directed acyclic graph. The system transforms code of a target cell to include code of one or more source cells. The system receives a request to execute at least a portion of the notebook comprising a cell from the sequence of cells. The system identifies a subset of cells of the sequence of cells for execution based on the directed acyclic graph. The system determines based on various factors whether a source cell can be excluded from execution of the notebook based on properties of the source cell. If the system determines that the source cell can be excluded, the system executes the subset of cells without the source cell.
Inventors
- Glen Takahashi
- Caitlin Royden Colgrove
- Dylan McCoy Scott
Assignees
- HEX TECHNOLOGIES INC.
Dates
- Publication Date
- 20260512
- Application Date
- 20240329
Claims (20)
- 1 . A method comprising: receiving via a user interface, a sequence of cells of a notebook, wherein each cell of the sequence of cells comprises code; generating a directed acyclic graph comprising cells of the sequence of cells connected by edges, wherein the directed acyclic graph includes an edge from a first cell to a second cell if the second cell depends on execution of the first cell; transforming code of a target cell to include code of one or more source cells; receiving a request to execute at least a portion of the notebook comprising a cell from the sequence of cells; identifying a subset of cells of the sequence of cells for execution based on the directed acyclic graph, the subset of cells including precedent cells from which the cell depends; determining whether a source cell can be excluded from execution of the notebook based on properties of the source cell; and responsive to determining that the source cell can be excluded, executing the subset of cells without the source cell.
- 2 . The method of claim 1 , further comprising: receiving a request to execute a particular cell from the sequence of cells; traversing the directed acyclic graph to identify a subset of cells on which the execution of the particular cell depends; executing the subset of cells based on an order specified by the directed acyclic graph; and executing the particular cell based on values determined based on execution of the subset of cells.
- 3 . The method of claim 1 , wherein the source cell is a first source cell, the method further comprising: determining that a second source cell cannot be excluded from execution of the notebook if the second source cell generates a dataframe processed by a second target cell.
- 4 . The method of claim 3 , wherein the second target cell includes code in a programming language, wherein the code processes the dataframe.
- 5 . The method of claim 1 , wherein the source cell is a first source cell, the method further comprising: determining that a second source cell cannot be excluded from execution of the notebook if the second source cell includes code that performs a mutation operation.
- 6 . The method of claim 1 , further comprising: responsive to transforming code of a target cell optimizing the transformed code, such that execution of the target cell is more efficient compared to execution of the target cell prior to transformation.
- 7 . The method of claim 1 , wherein executing the subset of cells comprises: executing the subset of cells in an order determined by the directed acyclic graph, wherein a plurality of cells is executed in parallel responsive to each cell of the plurality of cells being unreachable from remaining of the plurality of cells by following edges of the directed acyclic graph.
- 8 . The method of claim 1 , wherein generating the directed acyclic graph comprises: identifying a subsequence of cells belonging to the sequence of cells; determining that a particular cell from the subsequence is unsafe for parallelization; and introducing at least one of (1) an edge from a cell of the subsequence that occurs before the particular cell to the particular cell, or 2) an edge from the particular cell to a cell of the subsequence that occurs after the particular cell.
- 9 . A non-transitory computer-readable medium comprising memory with instructions encoded thereon that, when executed, cause one or more computer processors to: receive via a user interface, a sequence of cells of a notebook, wherein each cell of the sequence of cells comprises code; generate a directed acyclic graph comprising cells of the sequence of cells connected by edges, wherein the directed acyclic graph includes an edge from a first cell to a second cell if the second cell depends on execution of the first cell; transform code of a target cell to include code of one or more source cells; receive a request to execute at least a portion of the notebook comprising a cell from the sequence of cells; identify a set of cells of the sequence of cells for execution based on the directed acyclic graph, the set of cells including precedent cells from which the cell depends; determine whether a source cell can be excluded from execution of the notebook based on properties of the source cell; and responsive to determining that the source cell can be excluded, execute the set of cells without the source cell.
- 10 . The non-transitory computer-readable medium of claim 9 , wherein the instructions further cause the one or more computer processors to: receive a request to execute a particular cell from the sequence of cells; traverse the directed acyclic graph to identify a subset of cells on which the execution of the particular cell depends; execute the subset of cells based on an order specified by the directed acyclic graph; and execute the particular cell based on values determined based on execution of the subset of cells.
- 11 . The non-transitory computer-readable medium of claim 9 , wherein the source cell is a first source cell, wherein the instructions further cause the one or more computer processors to: determine that a second source cell cannot be excluded from execution of the notebook if at least one of following is true: the second source cell generates a dataframe processed by a second target cell; or the second source cell includes code that performs a mutation operation.
- 12 . The non-transitory computer-readable medium of claim 9 , wherein the instructions further cause the one or more computer processors to: responsive to transforming code of a target cell, optimize the transformed code, such that execution of the target cell is more efficient compared to execution of the target cell prior to transformation.
- 13 . The non-transitory computer-readable medium of claim 9 , wherein instructions to execute the set of cells further cause the one or more computer processors to: execute the set of cells in an order determined by the directed acyclic graph, wherein a plurality of cells is executed in parallel responsive to each cell of the plurality of cells being unreachable from remaining of the plurality of cells by following edges of the directed acyclic graph.
- 14 . The non-transitory computer-readable medium of claim 9 , wherein instructions to generate the directed acyclic graph further cause the one or more computer processors to: identify a subsequence of cells belonging to the sequence of cells; determine that a particular cell from the subsequence is unsafe for parallelization; and introduce at least one of (1) an edge from a cell of the subsequence that occurs before the particular cell to the particular cell, or 2) an edge from the particular cell to a cell of the subsequence that occurs after the particular cell.
- 15 . A computer system comprising: one or more computer processors; and a non-transitory computer-readable medium comprising memory with instructions encoded thereon that, when executed, cause the one or more computer processors to: receive via a user interface, a sequence of cells of a notebook, wherein each cell of the sequence of cells comprises code; generate a directed acyclic graph comprising cells of the sequence of cells connected by edges, wherein the directed acyclic graph includes an edge from a first cell to a second cell if the second cell depends on execution of the first cell; transform code of a target cell to include code of one or more source cells; receive a request to execute at least a portion of the notebook comprising a cell from the sequence of cells; identify a set of cells of the sequence of cells for execution based on the directed acyclic graph, the set of cells including precedent cells from which the cell depends; determine whether a source cell can be excluded from execution of the notebook based on properties of the source cell; and responsive to determining that the source cell can be excluded, execute the set of cells without the source cell.
- 16 . The computer system of claim 15 , wherein the instructions further cause the one or more computer processors to: receive a request to execute a particular cell from the sequence of cells; traverse the directed acyclic graph to identify a subset of cells on which the execution of the particular cell depends; execute the subset of cells based on an order specified by the directed acyclic graph; and execute the particular cell based on values determined based on execution of the subset of cells.
- 17 . The computer system of claim 15 , wherein the source cell is a first source cell, wherein the instructions further cause the one or more computer processors to: determine that a second source cell cannot be excluded from execution of the notebook if at least one of following is true: the second source cell generates a dataframe processed by a second target cell; or the second source cell includes code that performs a mutation operation.
- 18 . The computer system of claim 15 , wherein the instructions further cause the one or more computer processors to: responsive to transforming code of a target cell optimize the transformed code, such that execution of the target cell is more efficient compared to execution of the target cell prior to transformation.
- 19 . The computer system of claim 15 , wherein instructions to execute the set of cells further cause the one or more computer processors to: execute the set of cells in an order determined by the directed acyclic graph, wherein a plurality of cells is executed in parallel responsive to each cell of the plurality of cells being unreachable from remaining of the plurality of cells by following edges of the directed acyclic graph.
- 20 . The computer system of claim 15 , wherein instructions to generate the directed acyclic graph further cause the one or more computer processors to: identify a subsequence of cells belonging to the sequence of cells; determine that a particular cell from the subsequence is unsafe for parallelization; and introduce at least one of (1) an edge from a cell of the subsequence that occurs before the particular cell to the particular cell, or 2) an edge from the particular cell to a cell of the subsequence that occurs after the particular cell.
Description
TECHNICAL FIELD Aspects of this disclosure generally relate to execution of cell-driven notebooks and more specifically to parallel execution and optimized execution of cells in a directed acyclic graph driven notebook environment. BACKGROUND A system provides an interactive development environment for users, for example, a web-based interactive computing platform. Users interact with the system using a client device to interactively develop documents including code using programming languages such as Python or database query languages such as the structured query language (SQL). The code provided by user may be complex and may involve computation intensive operations or input/output (I/O) intensive operations such as file read/write operations. Furthermore, the code may cause interactions with external systems such as web services that may be slow or cause the execution to hang indefinitely. A cell of the notebook that has slow performance or blocks on certain call slows the execution of entire notebook. This provides poor user experience and may consume significant processing resources. SUMMARY A system performs parallel execution of cells of a notebook. The system may be an online system or operate in an offline mode. The system receives a sequence of cells of a notebook. The cells of the sequence may include executable code, for example, code specified using programming languages such as Python or database query languages such as the structured query language (SQL). The system generates a directed acyclic graph of nodes and edges, where the nodes correspond to cells and edges represent dependencies between cells. The system generates the directed acyclic graph by identifying dependencies between cells. According to an embodiment, the system identifies a subsequence of cells and determines whether a particular cell from the subsequence is unsafe for parallelization. If the system determines that the particular cell is unsafe for parallelization, the system introduces at least one of (1) an edge from a cell of the subsequence that occurs before the particular cell to the particular cell, or 2) an edge from the particular cell to a cell of the subsequence that occurs after the particular cell. The system executes the cells in an order determined by the directed acyclic graph. The system executes a plurality of cells in parallel if the cells of the plurality of cells are independent of each other, i.e., each cell is unreachable from the remaining cells of the plurality of cells by following edges of the directed acyclic graph. According to an embodiment, a cell is determined unsafe for parallelization if the executable code of the cell comprises one or more of insert, update, delete, or an alter table operation. According to an embodiment, the system stores a list of functions determined to be unsafe for parallelization and determines that the particular cell is unsafe for parallelization if the particular cell invokes a function from the list of functions. According to an embodiment, the system dynamically updates the directed acyclic graph if the user modifies the code of the cells. The system receives a request to modify code of a particular cell. Responsive to the cell being modified, the system determines changes to inputs and outputs of the particular cell based on modification of the particular cell. The system modifies the directed acyclic graph based on the changes to inputs and outputs of the particular cell, the modifying causing one or more of: adding a new edge or deleting an existing edge. If subsequent requests to execute a cell are received the system traverses the modified directed acyclic graph to identify the set of cells for execution of the cell. According to an embodiment, the system performs pruning of cells by excluding certain cells from execution while evaluating a particular cell. The system may transform code of a target cell to include code of one or more source cells. For example, assume that there is an edge from cells C1 and C2 to cell C3 and cells C1 and C2 include database queries Q1 and Q2 respectively, and cell C3 includes a database query Q3 that joins queries Q1 and Q2. The query Q3 may refer to queries Q1 and Q2 by an identifier, for example, a query name. In this case, the system may modify the query Q3 to include the definition (or code) of the queries Q1 and Q2 rather than their identifiers. Accordingly, the code of the queries Q1 and Q2 is in-lined into the query Q3. For several queries, this results in improvement of the performance of the resulting query Q3 since the query compiler is better able to optimize the query since the resulting query includes all the information needed to optimize it. The optimized query executes more efficiently, for example, in terms of computation or storage. According to an embodiment, the user interface displaying the notebook allows users to execute the entire notebook or identify a specific cell for execution. Accordingly, the sy