Search

US-20260127413-A1 - AUTOMATED SETUP AND COMMUNICATION COORDINATION FOR TRAINING AND UTILIZING MASSIVELY PARALLEL NEURAL NETWORKS

US20260127413A1US 20260127413 A1US20260127413 A1US 20260127413A1US-20260127413-A1

Abstract

A method is disclosed for training and utilizing massively parallel neural networks. A distributed computing system may be configured to perform various operations. The distributed computing system may divide a directed acyclic graph (“DAG”) that comprises a plurality of vertices linked in pairwise relationships via a plurality of edges among a plurality of nodes. Each node may comprise a computing device. The distributed computing system may provide a map of the DAG that described a flow of data through the vertices to each of the vertices of the DAG. The distributed computing system may perform a topological sort of the vertices of the DAG and may traverse the DAG.

Inventors

  • Bradley David Safnuk

Assignees

  • FORD GLOBAL TECHNOLOGIES, LLC

Dates

Publication Date
20260507
Application Date
20260106

Claims (20)

  1. 1 . A method, comprising: defining a flow of data through a plurality of worker nodes as a directed acyclic graph (DAG); transmitting, to each worker node of the plurality of worker nodes, information describing: an overall structure of a particular DAG that is either an original DAG and or a cloned DAG, and data dependencies between vertices connected by edges of the particular DAG, wherein each vertex of the vertices corresponds to a computation assigned to a worker node of the plurality of worker nodes; determining, at each worker node, a deterministic execution order for traversing the particular DAG based on the overall structure; and executing, by each worker node of the plurality of worker nodes, one or more vertices, of the vertices, assigned to the worker node in accordance with the deterministic execution order.
  2. 2 . The method of claim 1 , further comprising transmitting, during or after executing the one or more vertices, tensor data between a subset of worker nodes of the plurality of worker nodes without prior knowledge of a size, rank, shape, or data type of the tensor data.
  3. 3 . The method of claim 2 , wherein transmitting the tensor data comprises: transmitting metadata specifying the rank, the shape, and the data type of the tensor data; and transmitting the tensor data only after the metadata has been transmitted.
  4. 4 . The method of claim 1 , further comprising identifying an edge of the edges, and inserting a data-exchange vertex at the edge to produce the cloned DAG, wherein transmitting the information is based on identifying the edge.
  5. 5 . The method of claim 4 , wherein the data-exchange vertex comprises a recursive directed acyclic graph including a send vertex and a receive vertex that execute in coordination to synchronize communication between the worker nodes.
  6. 6 . The method of claim 1 , wherein the particular DAG is the cloned DAG and is derived from the original DAG, and the method further comprises generating the cloned DAG by replicating vertices of the original DAG and assigning the replicated vertices to different worker nodes of the plurality of worker nodes.
  7. 7 . The method of claim 6 , further comprising synchronizing parameters between the original DAG and the cloned DAG by aggregating gradient information associated with corresponding vertices of the original DAG and the cloned DAG.
  8. 8 . The method of claim 7 , wherein aggregating the gradient information comprises performing an all-reduce operation in the deterministic execution order.
  9. 9 . The method of claim 1 , wherein an entry vertex of the particular DAG is configured to receive input data, and the method further comprises signaling worker nodes other than a worker node of the plurality of worker nodes to which the entry vertex is assigned regarding an iteration state of the input data.
  10. 10 . The method of claim 9 , wherein the signaling comprises traversing a subordinate directed acyclic graph to indicate whether one or more additional data iterations remain.
  11. 11 . A system, comprising: one or more memories; one or more processors, coupled to the one or more memories, configured to: define a flow of data through a plurality of worker nodes as a directed acyclic graph (DAG); transmit, to each worker node of the plurality of worker nodes, information describing: an overall structure of a particular DAG that is either an original DAG and or a cloned DAG, and data dependencies between vertices connected by edges of the particular DAG, wherein each vertex of the vertices corresponds to a computation assigned to a worker node of the plurality of worker nodes; determine, at each worker node, a deterministic execution order for traversing the particular DAG based on the overall structure; and execute, by each worker node of the plurality of worker nodes, one or more vertices, of the vertices, assigned to the worker node in accordance with the deterministic execution order.
  12. 12 . The system of claim 11 , wherein the one or more processors are further configured to transmit, during or after executing the one or more vertices, a tensor between a subset of worker nodes of the plurality of worker nodes without prior knowledge of a size, rank, shape, or data type of the tensor data.
  13. 13 . The system of claim 11 , wherein the one or more processors are further configured to identify an edge of the edges, and insert a data-exchange vertex at the edge to produce the cloned DAG, wherein transmitting the information is based on identifying the edge.
  14. 14 . The system of claim 11 , wherein the particular DAG is the cloned DAG and derived from the original DAG, and the one or more processors are further configured to generate the cloned DAG by replicating vertices of the original DAG and assigning the replicated vertices to different worker nodes of the plurality of worker nodes.
  15. 15 . The system of claim 11 , wherein an entry vertex of the particular DAG is configured to receive input data, and the one or more processors are further configured to signal worker nodes other than a worker node of the plurality of worker nodes to which the entry vertex is assigned regarding an iteration state of the input data.
  16. 16 . A non-transitory computer-readable medium storing a set of instructions, the set of instructions comprising: one or more instructions that, when executed by one or more processors of a wireless node, cause the wireless node to: define a flow of data through a plurality of worker nodes as a directed acyclic graph (DAG); transmit, to each worker node of the plurality of worker nodes, information describing: an overall structure of a particular DAG that is either an original DAG or a cloned DAG, and data dependencies between vertices connected by edges of the particular DAG, wherein each vertex of the vertices corresponds to a computation assigned to a worker node of the plurality of worker nodes; determine, at each worker node, a deterministic execution order for traversing the particular DAG based on the overall structure; and execute, by each worker node of the plurality of worker nodes, one or more vertices, of the vertices, assigned to the worker node in accordance with the deterministic execution order.
  17. 17 . The non-transitory computer-readable medium of claim 16 , wherein the one or more instructions, when executed by the one or more processors of the wireless node, further cause the wireless node to transmit, during or after executing the one or more vertices, a tensor between a subset of worker nodes of the plurality of worker nodes without prior knowledge of a size, rank, shape, or data type of the tensor data.
  18. 18 . The non-transitory computer-readable medium of claim 16 , wherein the one or more instructions, when executed by the one or more processors of the wireless node, further cause the wireless node to identify an edge of the edges, and insert a data-exchange vertex at the edge to produce the cloned DAG, wherein transmitting the information is based on identifying the edge.
  19. 19 . The non-transitory computer-readable medium of claim 16 , wherein the particular DAG is the cloned DAG and derived from the original DAG, and the one or more instructions, when executed by the one or more processors of the wireless node, further cause the wireless node to generate the cloned DAG by replicating vertices of the original DAG and assigning the replicated vertices to different worker nodes of the plurality of worker nodes.
  20. 20 . The non-transitory computer-readable medium of claim 16 , wherein an entry vertex of the particular DAG is configured to receive input data, and the one or more instructions, when executed by the one or more processors of the wireless node, further cause the wireless node to signal worker nodes other than a worker node of the plurality of worker nodes to which the entry vertex is assigned regarding an iteration state of the input data.

Description

RELATED APPLICATION This application is a continuation of U.S. patent application Ser. No. 17/072,709, filed Oct. 16, 2020, which is incorporated herein by reference in its entirety. BACKGROUND Computer use is increasingly dominating engineering design. This has improved design quality, but new design challenges are stretching limits of current computing systems. For example, a computer can be used to simulate airflow over the exterior body of a car. To generate a useful simulation, this requires massive amounts of input data and calculations. In addition to the sheer data volume for such a simulation, as the relationship between the input data and the outputs may be complex, the computational load for such a simulation may also be massive. Such difficulties can be addressed via distributed computing. Currently, distributed computing operates according to a master-slave model in which one node maintains an overview of, and control of, computing operations performed by slaves. The slaves execute operations upon receiving instruction from the master but have no overview or knowledge of operations performed by other slaves alone or in aggregate. Such master-slave configurations may have considerable benefits but drawbacks may also be present. Specifically, such configurations may rely on heavy user involvement in programming the operation of the slaves and in programming the master's control of the slaves. This near-custom program may limit flexibility of computing according to a master-slave model as any change in configuration or operation of one or several nodes in the distributed computing network may desire re-programming. In light of the growing computing demands and the limitations of current methods of distributed computing, new and improved methods of distributed computing, and specifically of distributed training may be desired. SUMMARY In some embodiments, a method is disclosed for training and utilizing massively parallel neural networks. A distributed computing system may be configured to perform various operations. The distributed computing system may divide a directed acyclic graph (“DAG”) that comprises a plurality of vertices linked in pairwise relationships via a plurality of edges among a plurality of nodes. Each node may comprise a computing device. The distributed computing system may provide a map of the DAG that described a flow of data through the vertices to each of the vertices of the DAG. The distributed computing system may perform a topological sort of the vertices of the DAG and may traverse the DAG. In some embodiments, the distributed computing system can create at least one clone DAG identical to the DAG and/or to a portion of the DAG, the clone DAG comprising a plurality of clone vertices, identify a corresponding vertex in the DAG for each of the clone vertices, calculating aggregate gradient data based on gradient data from each of the clone vertices and its corresponding vertex in the DAG during training of the DAG and the clone DAG, and update at least one weight of the DAG and the clone DAG based on the aggregate gradient data. In some embodiments, one of a plurality of vertices of the DAG can be an entry vertex, and in some embodiments, the distributed computing system can identify the nodes underlying the DAG, generate a subordinate DAG in the entry vertex, the subordinate DAG including a plurality of subordinate vertices, each of the plurality of subordinate vertices corresponding to a one of the nodes underlying the DAG, receive data and metadata at the entry vertex, deliver the data to a next vertex in the DAG, and communicating the metadata to nodes underlying the DAG via the subordinate DAG. In other embodiments, a system is disclosed for training and utilizing massively parallel neural networks. The system may include one or more processors and may include one or more memories storing computer-executable instructions that, when executed by the one or more processors, configure the one or more processors to perform various operations. The one or more processors may divide a directed acyclic graph (“DAG”) that comprises a plurality of vertices linked in pairwise relationships via a plurality of edges among a plurality of nodes. Each node may comprise a computing device. The one or more processors may provide a map of the DAG that described a flow of data through the vertices to each of the vertices of the DAG. The one or more processors may perform a topological sort of the vertices of the DAG and may traverse the DAG. In some embodiments, a method is disclosed for training and utilizing cloned neural networks. A computing system may be configured to perform various operations. The computing system may identify a directed acyclic graph (“DAG”), which DAG can include a plurality of vertices linked in pairwise relationships via a plurality of edges among. The computing system can create at least one clone DAG identical to the DAG and/or identical to a portion of the D