Search

US-12619868-B2 - Techniques for combining independent operations in a graph structure

US12619868B2US 12619868 B2US12619868 B2US 12619868B2US-12619868-B2

Abstract

Apparatuses, systems, and techniques to combine operations. In at least one embodiment, a processor causes two or more operations in a graph to be combined based, at least in part, on another combination of two or more independent operations.

Inventors

  • Alexander James Collins
  • Vinod Grover

Assignees

  • NVIDIA CORPORATION

Dates

Publication Date
20260505
Application Date
20210629

Claims (20)

  1. 1 . A processor, comprising: one or more circuits to: cause two or more portions of one or more graphs to be combined based, at least in part, on whether the two or more portions are to operate independently on the same data; cause a list of operations to be updated based, at least in part, on the combined two or more portions; and cause two or more additional portions to be combined based, at least in part, on the updated list of operations.
  2. 2 . The processor of claim 1 , wherein the graph is a representation of a machine learning computer program.
  3. 3 . The processor of claim 1 , wherein the two or more portions include a first independent operation in a first set of nodes of the graph, and a second independent operation in a second set of nodes of the graph, and the one or more circuits cause the first set of nodes to be combined with the second set of nodes according to a combination rule.
  4. 4 . The processor of claim 1 , wherein the two or more additional portions include one or more operations introduced by combination of the two or more portions.
  5. 5 . The processor of claim 1 , wherein the two or more portions include two or more matrix multiplication operations.
  6. 6 . The processor of claim 1 , wherein the two or more portions include two or more convolution operations.
  7. 7 . The processor of claim 1 , wherein the one or more graphs is a second version of a graph, and the one or more circuits cause the second version of the graph to be generated based, at least in part on a first version of the graph and a combination of two or more independent operations, and the one or more circuits cause the two or more portions to be combined based, at least in part, on traversing the second version of the graph.
  8. 8 . A non-transitory machine-readable medium having stored thereon a set of instructions, which if performed by a processor, cause the processor to at least: combine two or more portions of a graph based, at least in part, on whether the two or more portions are to operate independently on the same data; update a list of operations based, at least in part, on the combined two or more portions; and combine two or more additional portions based, at least in part, on the updated list of operations.
  9. 9 . The non-transitory machine-readable medium of claim 8 , wherein the graph is a representation of a neural network and the two or more portions include one or more convolution operations.
  10. 10 . The non-transitory machine-readable medium of claim 8 , wherein the two or more portions include a split operation and a concatenation operation introduced to the graph by the combination of two or more independent operations.
  11. 11 . The non-transitory machine-readable medium of claim 8 , wherein the instructions which if performed by the processor, cause the processor to update a worklist based, at least in part, on the combined two or more portions, and combine two or more additional portions based, at least in part, on the updated worklist.
  12. 12 . The non-transitory machine-readable medium of claim 11 , wherein the two or more portions include two or more convolution operations, and the two or more additional portions include a split operation and a concatenation operation.
  13. 13 . The non-transitory machine-readable medium of claim 8 , wherein the two or more portions include two or more independent pointwise operations.
  14. 14 . A system, comprising: one or more processors to combine two or more portions of one or more graphs based, at least in part, on whether the two or more portions are to operate independently on the same data; and one or more memories to store an updated graph that includes a set of nodes based, at least in part, on the combined two or more portions; wherein the one or more processors are to combined two or more additional portions based, at least in part, on the updated graph.
  15. 15 . The system of claim 14 , wherein the one or more graphs is a representation of a machine learning computer program and the two or more portions include a split operation and a concatenation operation introduced to the graph by a combination of two or more independent operations.
  16. 16 . The system of claim 14 , wherein the one or more processors are to update a worklist based, at least in part, on the combined two or more portions, and combine two or more additional operations based, at least in part, on the updated worklist.
  17. 17 . The system of claim 16 , wherein the two or more portions include two or more convolution operations and the two or more additional operations include a concatenation operation.
  18. 18 . The system of claim 16 , wherein the two or more operations include a split operation.
  19. 19 . The system of claim 14 , wherein the one or more processors are to update a worklist based, at least in part, on the combined two or more portions, wherein the worklist includes a group of operations associated with a group key, and the two or more portions comprise operations included in the group of operations.
  20. 20 . A method, comprising: combining two or more portions of a graph based, at least in part, on whether the two or more portions are to operate independently on the same data; updating a list of operations based, at least in part, on the combined two or more portions; and combining two or more additional portions based, at least in part, on the updated list of operations.

Description

FIELD At least one embodiment pertains to processing resources used to perform and facilitate artificial intelligence. For example, at least one embodiment pertains to processors or computing systems used to perform training and/or inferencing using neural networks according to various novel techniques described herein. BACKGROUND Training neural networks and/or inferencing using neural networks can use significant memory, time, or computing resources. The amount of memory, time, or computing resources used to train neural networks and/or inference using neural networks can be improved. BRIEF DESCRIPTION OF DRAWINGS FIG. 1 is a block diagram that illustrates a system to combine operations, according to at least one embodiment; FIG. 2 is a block diagram that illustrates a system to execute instructions that include combined operations, according to at least one embodiment; FIG. 3 is a flowchart of a technique of generating instructions that include combined operations, according to at least one embodiment; FIG. 4 is a flowchart of a technique of combining operations, according to at least one embodiment; FIG. 5 is a block diagram that illustrates types of fusion rules, according to at least one embodiment; FIG. 6 is a block diagram that illustrates versions of a graph following successive application of fusion rules, according to at least one embodiment; FIG. 7 is a block diagram that illustrates an initial graph and worklist, according to at least one embodiment; FIG. 8 is a block diagram that illustrates an initial graph with group key annotations, according to at least one embodiment; FIG. 9 is a block diagram that illustrates a graph and a worklist after application of a horizontal matrix multiplication fusion rule, according to at least one embodiment; FIG. 10 is a block diagram that illustrates updated graphs and worklists, according to at least one embodiment; FIG. 11 is a block diagram illustrating a rule that merges repeated rectified linear unit activation function (relu) operations, according to at least one embodiment; FIG. 12 is a block diagram illustrating a rule that pushes pointwise relu operations into neighboring convolutions, according to at least one embodiment; FIG. 13 is a block diagram illustrating a rule that pushes transpose operations into matrix multiplications, according to at least one embodiment; FIG. 14 is a block diagram illustrating a rule that removes redundant casts, according to at least one embodiment; FIG. 15 is a block diagram illustrating a rule that removes split operations followed by concatenation operations, according to at least one embodiment; FIG. 16 is a block diagram illustrating a rule that pushes split operations, according to at least one embodiment; FIG. 17 is a block diagram illustrating a rule that pushes concatenation operations, according to at least one embodiment; FIG. 18 is a block diagram illustrating a rule that combines nested splits, according to at least one embodiment; FIG. 19 is a block diagram illustrating a rule that combines nested concatenation operations, according to at least one embodiment; FIG. 20 is a block diagram illustrating a rule that pushes transpose operations through concatenation operations, according to at least one embodiment; FIG. 21 is a block diagram illustrating a rule that pushes transpose operations through splits, according to at least one embodiment; FIG. 22 is a block diagram illustrating a rule that fuses pointwise operations together, according to at least one embodiment; FIG. 23 is a block diagram illustrating a rule that horizontally fuses matrix multiplications where left hand inputs are shared, according to at least one embodiment; FIG. 24 is a block diagram illustrating a rule that horizontally fuses matrix multiplications where right hand inputs are shared, according to at least one embodiment; FIG. 25 is a block diagram illustrating a rule that fuses matrix multiplication operations that have same shapes but different input tensors, according to at least one embodiment; FIG. 26 is a block diagram illustrating a rule that fuses convolution operations that operate over a shared image, according to at least one embodiment; FIG. 27 is a block diagram illustrating a rule that fuses convolution operations using a widened filter, according to at least one embodiment; FIG. 28 is a block diagram illustrating a rule that fuses reduction operations that operate over same reduction axis, according to at least one embodiment; FIG. 29 is a block diagram illustrating a rule that removes redundant subgraphs, according to at least one embodiment; FIG. 30A illustrates inference and/or training logic, according to at least one embodiment; FIG. 30B illustrates inference and/or training logic, according to at least one embodiment; FIG. 31 illustrates training and deployment of a neural network, according to at least one embodiment; FIG. 32 illustrates an example data center system, according to at least one embodiment; FIG. 3