EP-3973464-B1 - SYSTEM, METHOD AND COMPUTER PROGRAM FOR ACCELERATING NEURON COMPUTATIONS IN ARTIFICIAL NEURAL NETWORKS WITH DUAL SPARSITY

EP3973464B1EP 3973464 B1EP3973464 B1EP 3973464B1EP-3973464-B1

Inventors

LARZUL, LUDOVIC
DELERSE, SEBASTIEN

Dates

Publication Date: 20260506
Application Date: 20190520

Claims (15)

A system for accelerating computation of an artificial neural network, ANN, the system (1100) comprising: one or more arithmetic units (540-i); a number of accumulation units (560-i) that is greater than the number of arithmetic units (540-i); and one or more processing units (530) coupled with the one or more arithmetic units (540-i) and the accumulation units (560-i); wherein the one or more processing units (530) are configured to: receive (1005) a first plurality of first values and a second plurality of second values associated with a plurality of neurons of the ANN; determine (1010) a plurality of pairs, wherein each pair of the plurality of pairs has a first value of the first plurality and a second value of the second plurality and wherein the first value and the second value satisfy criteria; and assign an identifier to each pair of the plurality of pairs, each identifier being associated with an accumulation unit of the accumulation units (560-i), the accumulation unit being assigned to a neuron of the ANN; wherein an arithmetic unit of the one or more arithmetic units (540-i) is configured to perform (1015) mathematical operations on each pair of the plurality of pairs to obtain a result for each pair, and different accumulation units are configured to accumulate results of the mathematical operations performed by the arithmetic unit; wherein, for each pair of the plurality of pairs, the one or more processing units (530) are configured to select, based on the identifier, the associated accumulation unit from the accumulation units (560-i) to accumulate a result of mathematical operations performed on the pair; wherein the selected accumulation unit is configured to accumulate (1020) the result to obtain an accumulated result; wherein the one or more processing units (530) are configured to determine (1025), based on the accumulated result, an output of the corresponding neuron.
The system of claim 1, wherein the one or more processing units (530) are configured to determine that the first value and the second value satisfies the criteria by comparing the first value to a first reference number or comparing the second value to a second reference number.
The system of claim 2, wherein at least one of the first reference number or the second reference number is zero.
The system of any of the preceding claims, wherein a count of pairs in the plurality of pairs is less than a count of all possible pairs including a first value of the first plurality and a second value of the second plurality.
The system of any of the preceding claims, wherein the one or more processing units (530) are configured to: for each first value of the first plurality, generate a first label indicative of whether the first value can be omitted in the computations of the one or more neurons or the first value cannot be omitted in the computations of the one or more neurons; and for each second value of the second plurality, generate a second label indicative of whether the second value can be omitted in the computations of the one or more neurons or the second value cannot be omitted in the computations of the one or more neurons; and wherein the first label and the second label are used to configure the accumulation units (560-i).
The system of claim 5, wherein the first label includes a first binary enable signal and the second label includes a second binary enable signal; and wherein the determining that the first value and the second value satisfy criteria includes a Boolean operation on the first binary enable and the second binary enable.
The system of any of the preceding claims, wherein one of: the first plurality of first values includes inputs of the plurality of neurons of the ANN and the second plurality of the second values includes weights associated with the inputs of the plurality of neurons of the ANN; or the first plurality of first values includes the weights associated with the inputs of the plurality of neurons of the ANN and the second plurality of the second values includes the inputs of the plurality of neurons.
The system of any of the preceding claims, wherein the arithmetic unit of the one or more arithmetic units (540-i) includes at least one electronic circuit configured to perform the mathematical operations; and wherein preferably the at least one electronic circuit includes one or more clock signals to trigger the performing of mathematical operations by the at least one electronic circuit.
The system of claim 8, wherein the at least one electronic circuit includes one or more clock signals, wherein the arithmetic unit is configured to, based on a clock of the one or more clock signals: at a first cycle of the clock, execute a mathematical operation on a first pair of the plurality of pairs to obtain a first result to be accumulated by a first accumulation unit of the accumulation units (560-i); at a second cycle of the clock, execute a mathematical operation on a second pair of the plurality of pairs to obtain a second result to be accumulated by a second accumulation unit of the accumulation units (560-i); and at a third cycle of the clock, execute a mathematical operation on a third pair of the plurality of pairs to obtain a third result to be accumulated by the first accumulation unit of the accumulation units (560-i), wherein the first accumulation unit differs from the second accumulation unit.
The system of claim 9, wherein the electronic circuit includes one or more enable signals to trigger at least one of the accumulation units (560-i) to accumulate a result of mathematical operations performed by the arithmetic unit.
The system of any of the preceding claims, wherein the performing (1015) the mathematical operations on each pair of the plurality of the pairs includes multiplication of a first value of the pair and a second value of the pair.
The system of any of the preceding claims, wherein at least one accumulation unit of the accumulation units (560-i) includes: at least one adder unit (720); at least one multiplexer unit (740); and a plurality of register units (710-1, 710-2, 710-N); wherein the at least one accumulation unit is configured to receive a result from at least one of the arithmetic units (540-i) and an information on selection; and wherein the at least one multiplexer unit (740) is configured to: select, based on the information of selection, a register unit from the plurality of register units (710-1, 710-2, 710-N); and provide a value stored in the selected register unit to the at least one adder unit (720); wherein the adder unit (720) is configured to perform an addition of the stored value and the result to obtain a sum; and wherein the sum is stored back to a register unit of the plurality of register units (710-1, 710-2, 710-N).
A method for accelerating computation of an artificial neural network, ANN, the method comprising: receiving (1005), by one or more processing units (530) coupled with one or more arithmetic units (540-i) and a number of accumulation units (560-i) larger than the number of arithmetic units (540-i), a first plurality of first values and a second plurality of second values associated with a plurality of neurons of the ANN; determining (1010), by the one or more processing units (530), a plurality of pairs, wherein each pair of the plurality of pairs has a first value of the first plurality and a second value of the second plurality and wherein the first value and the second value satisfy criteria; assigning, by the one or more processing units (530), an identifier to each pair of the plurality of pairs, each identifier being associated with an accumulation unit of the accumulation units (560-i), the accumulation unit being assigned to a neuron of the ANN; performing (1015), by an arithmetic unit of the one or more arithmetic units (540-i), mathematical operations on each pair of the plurality of pairs to obtain a result for each pair; for each pair, selecting, by the one or more processing units (530) and based on the identifier, the associated accumulation unit from the accumulation units (560-i) to accumulate the result of mathematical operations performed on the pair; accumulating (1020), by the selected accumulation unit, the result to obtain an accumulated result; and determining (1025), by the one or more processing units (530) and based on the accumulated result, an output of the corresponding neuron; wherein different accumulation units accumulate the results of the mathematical operations performed by the arithmetic unit.
The method of claim 13, wherein the determining that the first value and the second value satisfy the criteria includes comparing the first value to a first reference number or comparing the second value to a second reference number.
A computer program for accelerating computation of an artificial neural network, ANN, wherein the computer program comprises control instructions which, when executed by a system comprising one or more arithmetic units (540-i), a number of accumulation units (560-i) larger than the number of arithmetic units (540-i), and one or more processing units (530) coupled with the one or more arithmetic units (540-i) and the accumulation units (560-i), ,cause the system to perform the method according to claim 13 or 14.

Description

TECHNICAL FIELD The present disclosure relates generally to data processing and, more particularly, to a system and method for accelerating neuron computations in artificial neural networks (ANNs) by exploiting dual sparsity of ANN. BACKGROUND Artificial Neural Networks (ANNs) are simplified and reduced models reproducing the behavior of human brain. The human brain contains 10-20 billion neurons connected through synapses. Electrical and chemical messages are passed from neurons to neurons based on input information and their resistance to passing information. In the ANNs, a neuron can be represented by a node performing a simple operation of addition coupled with a saturation function. A synapse can be represented by a connection between two nodes. Each of the connections can be associated with an operation of multiplication by a constant. The ANNs are particularly useful for solving problems that cannot be easily solved by classical computer programs. While forms of the ANNs may vary, they all have the same basic elements similar to the human brain. A typical ANN can be organized into layers, and each of the layers may include many neurons sharing similar functionality. The inputs of a layer may come from a previous layer, multiple previous layers, any other layers, or even the layer itself. Major architectures of ANNs include Convolutional Neural Network (CNN), Recurrent Neural Network (RNN) and Long Term Short Memory (LTSM) network, but other architectures of ANN can be developed for specific applications. While some operations have a natural sequence, for example a layer depending on previous layers, most operations can be carried out in parallel within the same layer. The ANNs can then be computed in parallel on many different computing elements similar to neurons of the brain. A single ANN may have hundreds of layers. Each of the layers can involve millions of connections. Thus, a single ANN may potentially require billions of simple operations like multiplications and additions. Because of the larger number of operations and their parallel nature, ANNs can result in a very heavy load for processing units (e.g., CPU), even ones running at high rates. Sometimes, to overcome limitations of CPUs, graphics processing units (GPUs) can be used to process large ANNs because GPUs have a much higher throughput capacity of operations in comparison to CPUs. Because this approach solves, at least partially, the throughput limitation problem, GPUs appear to be more efficient in the computations of ANNs than the CPUs. However, GPUs are not well suited to the computations of ANNs because the GPUs have been specifically designed to compute graphical images. The GPUs may provide a certain level of parallelism in computations. However, the GPUs are constraining the computations in long pipes implying latency and lack of reactivity. To deliver the maximum throughput, very large GPUs can be used, which may involve excessive power consumption, which is a typical issue of GPUs. Since the GPUs may require more power consumption for the computations of ANNs, the deployment of GPUs can be difficult. To summarize, CPUs provide a very generic engine that can execute very few sequences of instructions with a minimum effort in terms of programming, but lack the power of computing for ANN. The GPUs are slightly more parallel and require a larger effort of programming than CPUs, which can be hidden behind libraries with some performance costs but are not very well suitable for ANNs. Field Programmable Gate Arrays (FPGAs) are professional components that can be programmed at the hardware level after they are manufactured. The FPGAs can be configured to perform computations in parallel. Therefore, FPGAs can be well suited to compute ANNs. One of the challenges of FPGAs is the programming, which requires a much larger effort than programming CPUs and GPUs. Adaption of FPGAs to perform ANN computations can be more challenging than for CPUs and GPUs. Most attempts in programming FPGAs to compute ANNs have being focusing on a specific ANN or a subset of ANNs requiring modification of the ANN structure to fit into a specific limited accelerator, or providing a basic functionality without solving the problem of computing ANN on FPGAs globally. The computation scale is typically not considered for existing FPGA solutions, with much of the research being limited to a single or few computation engines, which could be replicated. The existing FPGA solutions do not solve the problem of massive data movement required at large scale for the actual ANN involved in real industrial applications. The inputs to be computed with an ANN are typically provided by an artificial intelligence (AI) framework. Those programs are used by the AI community to develop new ANN or global solutions based on ANN. Furthermore, the FPGAs lack integration in those software environments. U.S. Patent Application Publication No. US 2018/164866 A1 is related to an apparat