US-12627726-B2 - Edge server with deep learning accelerator and random access memory

US12627726B2US 12627726 B2US12627726 B2US 12627726B2US-12627726-B2

Abstract

Systems, devices, and methods related to a Deep Learning Accelerator and memory are described. An edge server may be implemented using an integrated circuit device having: a Deep Learning Accelerator configured to execute instructions with matrix operands; random access memory configured to store first instructions of an Artificial Neural Network executable by the Deep Learning Accelerator and second instructions of a server application executable by a Central Processing Unit; and an interface to a communication device on a computer network. The Central Processing Unit may be part of the integrated circuit device, or be connected to the integrated circuit device. The server application may be configured to provide services over the computer network based on output of the Artificial Neural Network and input received from one or more local devices via a bus, or a wired or wireless local area network.

Inventors

Poorna Kale
Jaime Cummins

Assignees

MICRON TECHNOLOGY, INC.

Dates

Publication Date: 20260512
Application Date: 20200409

Claims (20)

1 . An apparatus, comprising: an edge server having: a printed circuit board; a transceiver configured on the printed circuit board; a central processing unit configured on the printed circuit board and coupled to the transceiver; and an integrated circuit device configured on the printed circuit board and enclosed within an integrated circuit package, the integrated circuit device having: a deep learning accelerator comprising at least one processing unit configured to execute instructions having matrix operands, wherein: the at least one processing unit a matrix-matrix unit having: a plurality of matrix-vector units configured to operate in parallel; a plurality of maps banks each storing a vector of a first matrix operand of the matrix operands, each of the plurality of maps banks connected to each of the matrix-vector units via a crossbar array; and a plurality of kernel buffers connected to a respective one of the plurality of matrix-vector units, each of the plurality of kernel buffers storing a vector of a second matrix operand of the matrix operands; and each of the plurality of matrix-vector units are configured to, concurrently with respect to one another, multiply the vector of the first matrix operand from each of the plurality of maps banks by the vector of a respective one of the plurality of kernel buffers to which each of the plurality of matrix-vector units is connected; random access memory configured to store: matrices of an artificial neural network; the instructions executable by the at least one processing unit to implement the artificial neural network; and a server application programmed for execution by the central processing unit to provide, using the artificial neural network, services over a computer network connected to the transceiver; a memory interface configured to provide the central processing unit with access to the random access memory; and a separate interface connected to the transceiver; wherein the transceiver is configured to receive sensor data from the computer network; wherein the server application executed by the central processing unit is configured to process the sensor data using the artificial neural network accelerated via the at least one processing unit to generate an output; and wherein the transceiver is further configured to communicate the output to a remote server via the computer network.
2 . The apparatus of claim 1 , wherein the integrated circuit package further encloses the central processing unit.
3 . The apparatus of claim 2 , wherein the transceiver is configured to communicate with one or more devices using a protocol of local area network, wireless local area network, or wireless personal area network.
4 . The apparatus of claim 3 , wherein the transceiver is configured to store data received from the one or more devices as an input to the artificial neural network; the at least one processing unit is configured to execute the instructions to generate the output and store the output in the random access memory; and the server application executed in the central processing unit provides the services based on the output.
5 . The apparatus of claim 4 , wherein the server application executed in the central processing unit is configured to provide the output to the one or more devices.
6 . The apparatus of claim 5 , wherein the server application executed in the central processing unit is configured to generate an alert, a notification, or a response to a query, or any combination thereof based on the output.
7 . The apparatus of claim 6 , wherein the server application executed in the central processing unit is configured to transmit the output to the remote server over a telecommunications network, a cellular communications network, or Internet, or any combination thereof.
8 . The apparatus of claim 1 , further comprising: circuitry of a network interface card, a router, a hub of internet of things, an access point of a wireless computer network, or a base station of a cellular communications network, or any combination thereof; wherein the transceiver is coupled to the circuitry.
9 . The apparatus of claim 1 , further comprising: a port configured on the printed circuit board and adapted to be connected to a local area network.
10 . The apparatus of claim 1 , further comprising: one or more sensors configured to provide data as input to the artificial neural network; and a user interface.
11 . The apparatus of claim 9 , further comprising: an interface to a bus configured in a host device.
12 . The apparatus of claim 11 , wherein the bus is in accordance with a protocol of Universal Serial Bus (USB), Serial Advanced Technology Attachment (SATA) bus, or Peripheral Component Interconnect express (PCIe).
13 . The apparatus of claim 1 ; wherein each of the plurality of matrix-vector units comprises a plurality of vector-vector units configured to operate in parallel; and wherein each of the plurality of vector-vector units comprises a plurality of multiply-accumulate units configured to operate in parallel.
14 . A method, comprising: storing, in random access memory configured in an integrated circuit device: matrices of an artificial neural network; first instructions executable by at least one processing unit enclosed within the integrated circuit device to implement the artificial neural network using the matrices, wherein the at least one processing unit comprises a matrix-matrix unit having a plurality of matrix-vector units configured to operate in parallel, and wherein each of the plurality of matrix-vector units are configured to, concurrently with respect to one another, multiply a vector of a first matrix operand stored in each of a plurality of maps banks by a vector of a second matrix operand stored in a respective one of a plurality of kernel buffers to which each of the plurality of matrix-vector units is connected; and second instructions of a server application programmed for execution by a central processing unit; loading data into the random access memory as input to the artificial neural network; executing, by the at least one processing unit, the first instructions to generate output from the artificial neural network responsive to the input; storing, into the random access memory, the output from the artificial neural network; and executing, by the central processing unit, the second instructions of the server application to provide, based on the output, services over a computer network.
15 . The method of claim 14 , wherein the central processing unit is configured in the integrated circuit device; and the method further comprises: receiving from a local area network, the data as the input to the artificial neural network; and providing the services as a proxy of a computer system configured on Internet.
16 . The method of claim 15 , wherein the integrated circuit device has a deep learning accelerator with processing units, a control unit and local memory; the processing units comprise at least a matrix-matrix unit configured to execute an instruction having two matrix operands; the matrix-matrix unit comprises a plurality of matrix-vector units configured to operate in parallel; each of the matrix-vector units comprises a plurality of vector-vector units configured to operate in parallel; and each of the vector-vector units comprises a plurality of multiply-accumulate units configured to operate in parallel.
17 . The method of claim 15 , further comprising: transmitting the output to the computer system in providing the services.
18 . The method of claim 17 , further comprising: transmitting the data to the computer system in response to a request from the computer system.
19 . A system, comprising: a printed circuit board; a modem connected to the printed circuit board and adapted to be connected to a computer network; a central processing unit having at least one arithmetic-logic unit connected to the printed circuit board and coupled with the modem; a field-programmable gate array or application specific integrated circuit having a plurality of processing units configured to, in parallel, operate on two matrix operands of an instruction executable in the field-programmable gate array or application specific integrated circuit, wherein each of the plurality of processing units are configured to, concurrently with respect to one another, multiply a vector of a first matrix operand stored in each of a plurality of maps banks by a vector of a second matrix operand stored in a respective one of a plurality of kernel buffers to which each of the plurality of matrix-vector units is connected; and an integrated circuit package comprising random access memory, wherein the random access memory is coupled to the central processing unit and the field-programmable gate array or application specific integrated circuit and the random access memory is configured to store: matrices of an artificial neural network; instructions executable by the field-programmable gate array or application specific integrated circuit to implement the artificial neural network; and a server application executable by the central processing unit to provide, using the artificial neural network, services over the computer network via the modem.
20 . The system of claim 19 , wherein the integrated circuit package is configured to enclose at least the random access memory and the field-programmable gate array or application specific integrated circuit.

Description

FIELD OF THE TECHNOLOGY At least some embodiments disclosed herein relate to edge servers in general and more particularly, but not limited to, edge servers implemented using integrated circuit devices having accelerators for Artificial Neural Networks (ANNs), such as ANNs configured through machine learning and/or deep learning. BACKGROUND An Artificial Neural Network (ANN) uses a network of neurons to process inputs to the network and to generate outputs from the network. For example, each neuron in the network receives a set of inputs. Some of the inputs to a neuron may be the outputs of certain neurons in the network; and some of the inputs to a neuron may be the inputs provided to the neural network. The input/output relations among the neurons in the network represent the neuron connectivity in the network. For example, each neuron can have a bias, an activation function, and a set of synaptic weights for its inputs respectively. The activation function may be in the form of a step function, a linear function, a log-sigmoid function, etc. Different neurons in the network may have different activation functions. For example, each neuron can generate a weighted sum of its inputs and its bias and then produce an output that is the function of the weighted sum, computed using the activation function of the neuron. The relations between the input(s) and the output(s) of an ANN in general are defined by an ANN model that includes the data representing the connectivity of the neurons in the network, as well as the bias, activation function, and synaptic weights of each neuron. Based on a given ANN model, a computing device can be configured to compute the output(s) of the network from a given set of inputs to the network. For example, the inputs to an ANN network may be generated based on camera inputs; and the outputs from the ANN network may be the identification of an item, such as an event or an object. In general, an ANN may be trained using a supervised method where the parameters in the ANN are adjusted to minimize or reduce the error between known outputs associated with or resulted from respective inputs and computed outputs generated via applying the inputs to the ANN. Examples of supervised learning/training methods include reinforcement learning and learning with error correction. Alternatively, or in combination, an ANN may be trained using an unsupervised method where the exact outputs resulted from a given set of inputs is not known before the completion of the training. The ANN can be trained to classify an item into a plurality of categories, or data points into clusters. Multiple training algorithms can be employed for a sophisticated machine learning/training paradigm. Deep learning uses multiple layers of machine learning to progressively extract features from input data. For example, lower layers can be configured to identify edges in an image; and higher layers can be configured to identify, based on the edges detected using the lower layers, items captured in the image, such as faces, objects, events, etc. Deep learning can be implemented via Artificial Neural Networks (ANNs), such as deep neural networks, deep belief networks, recurrent neural networks, and/or convolutional neural networks. Deep learning has been applied to many application fields, such as computer vision, speech/audio recognition, natural language processing, machine translation, bioinformatics, drug design, medical image processing, games, etc. BRIEF DESCRIPTION OF THE DRAWINGS The embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings in which like references indicate similar elements. FIG. 1 shows an integrated circuit device having a Deep Learning Accelerator and random access memory configured according to one embodiment. FIG. 2 shows a processing unit configured to perform matrix-matrix operations according to one embodiment. FIG. 3 shows a processing unit configured to perform matrix-vector operations according to one embodiment. FIG. 4 shows a processing unit configured to perform vector-vector operations according to one embodiment. FIG. 5 shows a Deep Learning Accelerator and random access memory configured to autonomously apply inputs to a trained Artificial Neural Network according to one embodiment. FIG. 6 shows an integrated circuit device having a Deep Learning Accelerator and random access memory configured with separate memory access connections according to one embodiment. FIG. 7 shows an integrated circuit device having a Deep Learning Accelerator and random access memory with a camera interface according to one embodiment. FIG. 8 shows a system on a chip according to one embodiment. FIG. 9 shows a user device configured with an edge server according to one embodiment. FIG. 10 shows an edge server implemented according to one embodiment. FIG. 11 shows a method implemented in an edge server according to one embodiment. DETAILED DESCRIPTION At lea