Search

CN-121981208-A - Fine-grained computational communication execution for deep learning frameworks

CN121981208ACN 121981208 ACN121981208 ACN 121981208ACN-121981208-A

Abstract

One embodiment provides a system for configuring distributed training of a neural network. The system includes a memory to store a library to facilitate data transmission during distributed training of the neural network, a network interface to send and receive gradient data associated with the trainable parameters, a general purpose processor to execute instructions provided by the library to cause the general purpose processor to configure the network interface to send and receive the gradient data associated with the trainable parameters during a workflow of a machine learning framework, and a graphics processor to perform computing operations associated with the machine learning framework workflow to generate the gradient data associated with the trainable parameters, wherein the library interleaves the computing operations on the graphics processor with the sending and receiving of gradient data via the network interface based on the machine learning framework workflow.

Inventors

  • S. Sriharan
  • D Moodie gray

Assignees

  • 英特尔公司

Dates

Publication Date
20260505
Application Date
20180507
Priority Date
20180112

Claims (17)

  1. 1. A system for configuring distributed training of a neural network using a plurality of interconnected worker nodes of a distributed training network, the plurality of interconnected worker nodes of the distributed training network being interconnected via a communication fabric, the system comprising: a memory for storing a library for facilitating data transmission during distributed training of the neural network, the data being associated with trainable parameters of the neural network, the library providing instructions for configuring the worker node to send gradient data associated with the distributed training; a plurality of worker nodes, wherein each worker node comprises: A fabric interface configured for connection to the communication fabric for transmitting and receiving gradient data associated with the trainable parameter, wherein during training the worker node transmits and receives the gradient data associated with the trainable parameter via the fabric interface, and A graphics processor for performing computing operations associated with a machine learning framework workflow to generate the gradient data associated with the trainable parameters, wherein each worker node is for overlapping computing operations with communications via the fabric interface based on one or more of the instructions provided by the machine learning framework workflow and the library.
  2. 2. The system of claim 1, wherein a computing operation is configured to overlap with a communication operation that transmits or receives gradient data via the fabric interface.
  3. 3. The system of claim 2, wherein the graphics processor is to perform a computing operation associated with the machine learning framework workflow, the computing operation associated with a first portion of a first layer of the neural network.
  4. 4. The system of claim 3, wherein one or more of the instructions provided by the library are to cause the fabric interface to send a result of the computing operation in response to a notification of completion of the computing operation associated with a first portion of a first layer of the neural network.
  5. 5. The system of claim 4, the fabric interface to send the results according to a communication mode for messages to be transmitted between the plurality of worker nodes during distributed training of the neural network.
  6. 6. The system of claim 5, wherein the communication mode is aggregation, dissemination, full aggregation, full interchange, reduction, reduction_dissemination, or full reduction.
  7. 7. The system of claim 1, wherein the fabric interface is a peripheral component interconnect express interface.
  8. 8. The system of claim 1, wherein the fabric interface is a NVLink interface.
  9. 9. The system of any of claims 1-8, wherein the graphics processor comprises at least a portion of the fabric interface.
  10. 10. A method of performing distributed training of a neural network using a plurality of interconnected worker nodes of a distributed training network, the plurality of interconnected worker nodes of the distributed training network being interconnected via a communication fabric and each comprising a fabric interface configured for connection to the communication fabric, the method comprising: Storing a library into a memory, the library for facilitating data transmission during distributed training of the neural network, the data being associated with trainable parameters of the neural network, the library providing instructions for configuring the worker node for sending gradient data associated with the distributed training; In each worker node: Transmitting and receiving gradient data associated with the trainable parameter via the fabric interface of the worker node, wherein during training the worker node transmits and receives the gradient data associated with the trainable parameter via the fabric interface, and A computing operation associated with a machine learning framework workflow is performed via a graphics processor of the worker node to generate the gradient data associated with the trainable parameters, the computing operation overlapping with communication via the fabric interface.
  11. 11. The method of claim 10, additionally comprising configuring computing operations to overlap with communication operations that send or receive gradient data via the fabric interface.
  12. 12. The method of claim 11, additionally comprising performing, via the graphics processor, a computing operation associated with the machine learning framework workflow, the computing operation associated with a first portion of a first layer of the neural network.
  13. 13. The method of claim 12, wherein one or more of the instructions provided via the library cause the fabric interface to send a result of the computing operation in response to completing a notification of the computing operation associated with a first portion of a first layer of the neural network.
  14. 14. The method of claim 13, additionally comprising transmitting the results according to a communication mode for messages to be transmitted between the plurality of worker nodes during distributed training of the neural network.
  15. 15. A non-transitory machine-readable medium storing instructions which, when executed by one or more processors, cause the one or more processors to perform the operations of the method of any of claims 10 to 14.
  16. 16. A computer program product comprising a computer program which, when executed by one or more processors, causes the one or more processors to perform the operations of the method of any of claims 10 to 14.
  17. 17. An apparatus, comprising: Apparatus for performing the operations of the method of any one of claims 10 to 14.

Description

Fine-grained computational communication execution for deep learning frameworks The patent application of the invention is a divisional application of patent application with the application number 201810427289.3 filed on 5 th month 7 of 2018 and named as fine granularity computing communication execution for deep learning framework. Cross reference The present application claims the benefit of U.S. provisional application No. 62/502,453 filed 5/2017, which is hereby incorporated by reference. Technical Field Embodiments relate generally to data processing and, more particularly, to data processing via a general purpose graphics processing unit. Background Current parallel graphics data processing includes systems and methods developed to perform certain operations on graphics data, such as, for example, linear interpolation, tessellation, rasterization, texture mapping, depth testing, and the like. Graphics processors have traditionally used fixed function computing units to process graphics data, however, more recently, portions of graphics processors have become programmable, enabling such processors to support a wider variety of operations for processing vertex and fragment data. To further improve performance, graphics processors often implement processing techniques (e.g., pipelining) that attempt to process as much graphics data in parallel as possible throughout different portions of the graphics pipeline. Parallel graphics processors with Single Instruction Multithreading (SIMT) architecture are designed to maximize the amount of parallel processing in the graphics pipeline. In the SIMT architecture, multiple sets of parallel threads attempt to execute program instructions together in synchronization as often as possible to improve processing efficiency. General overview of software and hardware for SIMT architecture can be found in both CUDA Programming (CUDA Programming) by Shane Cook, chapter 3, pages 37-51 (2013), and/or CUDA manual (Integrated guide for GPU Programming (A Comprehensive Guide to GPU Programming)), chapters 2.6.2 through 3.1.2 (month 6 in 2013). Drawings So that the manner in which the features of the present invention can be understood in detail, a more particular description of the invention may be had by reference to embodiments, some of which are illustrated in the accompanying drawings. It should be noted, however, that the appended drawings illustrate only typical embodiments and are therefore not to be considered limiting of the scope of all embodiments. FIG. 1 is a block diagram illustrating a computer system configured to implement one or more aspects of the embodiments described herein; FIGS. 2A-2D illustrate parallel processor components according to an embodiment; 3A-3B are block diagrams of a graphics multiprocessor, according to an embodiment; FIGS. 4A-4F illustrate an exemplary architecture in which a plurality of GPUs are communicatively coupled to a plurality of multicore processors; FIG. 5 illustrates a graphics processing pipeline in accordance with an embodiment; FIG. 6 illustrates a machine learning software stack according to an embodiment; FIG. 7 illustrates a highly parallel general purpose graphics processing unit, according to an embodiment; FIG. 8 illustrates a multi-GPU computing system in accordance with an embodiment; 9A-9B illustrate layers of an exemplary deep neural network; FIG. 10 illustrates an exemplary recurrent neural network; FIG. 11 illustrates training and deployment of deep neural networks; FIG. 12 is a block diagram showing distributed learning; FIG. 13 illustrates an exemplary inference System On Chip (SOC) suitable for performing inference using a training model; 14A-14E illustrate communication modes used during distributed machine learning computing operations performed across multiple computing nodes according to embodiments described herein; 15A-15C illustrate architectural details of a machine learning extension library provided by embodiments described herein; 16A-16B illustrate distributed machine learning training enabled by embodiments described herein; FIG. 16C illustrates inter-node communication using point-to-point primitives in accordance with an embodiment; FIG. 17A illustrates a multi-node computing system according to an embodiment; FIG. 17B illustrates a point-to-point network with distributed virtual addresses according to an embodiment; FIG. 18 illustrates an alternative MLSL architecture according to an embodiment; FIG. 19A illustrates tensor computation operations suitable for fine-grained computation and communication overlap; FIG. 19B illustrates synchronized memory accesses between multi-node systems according to an embodiment; FIG. 19C illustrates memory communication semantics extended to achieve coarse-grained cache coherency of cache memory data; 20A-20B illustrate a flowchart describing operations for implementing distributed machine learning via MLSL API; 21A-21B illustrate a method of performing distri