US-12625745-B2 - Optimized placement for efficiency for accelerated deep learning

US12625745B2US 12625745 B2US12625745 B2US 12625745B2US-12625745-B2

Abstract

Techniques in optimized placement for efficiency for accelerated deep learning provide improvements in one or more of accuracy, performance, and energy efficiency. An array of processing elements comprising a portion of a neural network accelerator performs flow-based computations on wavelets of data. Each processing element comprises a compute element to execute programmed instructions using the data and a router to route the wavelets. The routing is in accordance with virtual channel specifiers of the wavelets and controlled by routing configuration information of the router. A software stack determines optimized placement based on a description of a neural network. The determined placement is used to configure the routers including usage of the respective colors. The determined placement is used to configure the compute elements including the respective programmed instructions each is configured to execute.

Inventors

Vladimir KIBARDIN
Michael Edwin JAMES
Michael Morrison
Sean Lie
Gary R. Lauterbach
Stanislav Funiak

Assignees

CEREBRAS SYSTEMS INC.

Dates

Publication Date: 20260512
Application Date: 20201030

Claims (12)

1 . A method comprising: extracting a model from a neural network description; computing delays based on convergent nodes of the extracted model; determining routing to implement data communication based on arcs of the extracted model; determining accelerator configuration information usable to configure a deep learning accelerator to provide a trained model, wherein the accelerator configuration information indicates delay buffer placement based on the delays and on the routing, and wherein the deep learning accelerator comprises a fabric and a plurality of processing elements enabled to communicate packets with each other via the fabric in accordance with a plurality of communication pathways identifiable by respective virtual channel identifiers; and configuring the deep learning accelerator based on the accelerator configuration information.
2 . The method of claim 1 , wherein the determining the routing ignores interactions between routes.
3 . The method of claim 2 , further comprising scanning results based on the determining the routing to produce hotspot information to repeat the determining the routing in accordance therewith.
4 . The method of claim 1 , wherein the determining the routing ignores coloring and bandwidth interactions with other routes.
5 . A non-transitory computer-readable medium comprising one or more instructions encoded thereon that, when executed by one or more processors, cause the one or more processors to perform actions comprising: extracting a model from a neural network description; computing delays based on convergent nodes of the extracted model; determining routing to implement data communication based on arcs of the extracted model; determining accelerator configuration information usable to configure a deep learning accelerator to provide a trained model, wherein the accelerator configuration information indicates delay buffer placement based on the delays and on the routing, and wherein the deep learning accelerator comprises a fabric and a plurality of processing elements enabled to communicate packets with each other via the fabric in accordance with a plurality of communication pathways identifiable by respective virtual channel identifiers; and configuring the deep learning accelerator based on the accelerator configuration information.
6 . The non-transitory computer-readable medium of claim 5 , wherein the determining the routing ignores interactions between routes.
7 . The non-transitory computer-readable medium of claim 6 , further comprising scanning results based on the determining the routing to produce hotspot information to repeat the determining the routing in accordance therewith.
8 . The non-transitory computer-readable medium of claim 5 , wherein the determining the routing ignores coloring and bandwidth interactions with other routes.
9 . A deep learning accelerator comprising: a fabric; and circuitry configured as a plurality of processing elements that is enabled to communicate packets with each other via the fabric in accordance with a plurality of communication pathways identifiable by respective virtual channel identifiers, wherein the circuitry is configured to: extract a model from a neural network description; compute delays based on convergent nodes of the extracted model; determine routing to implement data communication based on arcs of the extracted model; determine accelerator configuration information usable to configure the deep learning accelerator to provide a trained model, wherein the accelerator configuration information indicates delay buffer placement based on the delays and on the routing; and configure the deep learning accelerator based on the accelerator configuration information.
10 . The system of claim 9 , wherein the circuitry, when determining the routing, is configured to ignore interactions between routes.
11 . The system of claim 10 , wherein the circuitry is further configured to scan results based on the determining the routing to produce hotspot information to repeat the determining the routing in accordance therewith.
12 . The system of claim 9 , wherein the circuitry, when determining the routing, is configured to ignore coloring and bandwidth interactions with other routes.

Description

CROSS REFERENCE TO RELATED APPLICATIONS To the extent permitted by the type of the instant application, this application incorporates by reference for all purposes the following applications, all commonly owned with the instant application not later than the effective filing date of the instant application: U.S. Provisional Application Ser. No. 62/928,198, filed 2019 Oct. 30, first named inventor Vladimir KIBARDIN, and entitled TENSOR FLOW ON A WAFER SCALE COMPUTE ENGINE; andU.S. Provisional Application Ser. No. 62/929,055, filed 2019 Oct. 31, first named inventor Vladimir KIBARDIN, and entitled TECHNIQUES FOR ACCELERATED DEEP LEARNING. BACKGROUND Field: Advancements in accelerated deep learning are needed to provide improvements in one or more of accuracy, performance, and energy efficiency. Related Art: Unless expressly identified as being publicly or well known, mention herein of techniques and concepts, including for context, definitions, or comparison purposes, should not be construed as an admission that such techniques and concepts are previously publicly known or otherwise part of the prior art. All references cited herein (if any), including patents, patent applications, and publications, are hereby incorporated by reference in their entireties, whether specifically incorporated or not, for all purposes. Synopsis The invention may be implemented in numerous ways, e.g., as a process, an article of manufacture, an apparatus, a system, a composition of matter, and a computer readable medium such as a computer readable storage medium (e.g., media in an optical and/or magnetic mass storage device such as a disk, an integrated circuit having non-volatile storage such as flash storage), or a computer network wherein program instructions are sent over optical or electronic communication links. The Detailed Description provides an exposition of one or more embodiments of the invention that enable improvements in cost, profitability, performance, efficiency, and utility of use in the field identified above. The Detailed Description includes an Introduction to facilitate understanding of the remainder of the Detailed Description. The Introduction includes Example Embodiments of one or more of systems, methods, articles of manufacture, and computer readable media in accordance with concepts described herein. As is discussed in more detail in the Conclusions, the invention encompasses all possible modifications and variations within the scope of the issued claims. BRIEF DESCRIPTION OF DRAWINGS FIG. 1 illustrates selected details of an embodiment of a system for neural network training and inference, using a deep learning accelerator. FIG. 2 illustrates selected details of an embodiment of software elements associated with neural network training and inference, using a deep learning accelerator. FIG. 3 illustrates selected details of an embodiment of processing associated with training a neural network and performing inference using the trained neural network, using a deep learning accelerator. FIG. 4A illustrates selected details of an embodiment of a deep learning accelerator. FIG. 4B illustrates selected details of a first embodiment of a scaled compute fabric for a deep learning accelerator. FIG. 4C illustrates selected details of a second embodiment of a scaled compute fabric for a deep learning accelerator. FIG. 5 illustrates selected details of an embodiment of a processing element of a deep learning accelerator. FIG. 6 illustrates selected details of an embodiment of a router of a processing element. FIG. 7A illustrates selected details of an embodiment of processing associated with a router of a processing element. FIG. 7B illustrates selected details of an embodiment of generating and providing backpressure information associated with a compute element of a processing element. FIG. 7C illustrates selected details of an embodiment of generating and providing backpressure information associated with a router of a processing element. FIG. 7D illustrates selected details of an embodiment of stalling processing associated with a compute element of a processing element. FIG. 8 illustrates selected details of an embodiment of a compute element of a processing element. FIG. 9A illustrates selected details of an embodiment of processing a wavelet for task initiation. FIG. 9B illustrates selected details of an embodiment of task activating. FIG. 10 illustrates selected details of an embodiment of a multiple operand instruction. FIG. 11 illustrates selected details of an embodiment of a one source, no destination operand instruction. FIG. 12 illustrates selected details of an embodiment of an immediate instruction. FIG. 13A illustrates selected details of an embodiment of a sparse wavelet. FIG. 13B illustrates selected details of an embodiment of a dense wavelet. FIG. 14 illustrates selected details of an embodiment of creating and transmitting a wavelet. FIG. 15 illustrates selected details of an embodiment of r