DE-112019000336-B4 - MASSIVELY PARALLEL NEURONAL INFERENCE DATA PROCESSING ELEMENTS

DE112019000336B4DE 112019000336 B4DE112019000336 B4DE 112019000336B4DE-112019000336-B4

Abstract

System, exhibiting: a plurality of multipliers (206; 302; 504; 1202), wherein the plurality of multipliers is arranged in a plurality of equally sized groups, each of the plurality of multipliers being designed to apply a weighting to an input activation in parallel to produce an output; a plurality of adders (204; 304; 506; 1204), each of the plurality of adders being operatively connected to one of the groups of multipliers, each of the plurality of adders being designed to add the outputs of the multipliers within their respective groups in parallel to produce a partial sum (306); a first plurality of function blocks (508; 1206), wherein each from the first plurality of function blocks is operatively connected to one from the plurality of adders, wherein each from the first plurality of function blocks is designed to apply a function in parallel to the partial sum of its associated adder in order to produce an output value; a vector register (116), wherein the vector register is operatively connected to the first plurality of function blocks, wherein the vector register is designed to store the output values of the first plurality of function blocks, wherein the first plurality of function blocks is designed to combine the output values stored in the vector register with subsequently calculated output values of the first plurality of function blocks, wherein output values of this combination are stored in the vector register; a second plurality of function blocks, each of which is operatively connected to the vector register, each of which is designed to apply a function to the stored output values in parallel.

Inventors

Jun Sawada
Hartmut Penner
Jennifer Klamo
Dharmendra Modha
John Vernon Arthur
Steven Kyle Esser
Rathinakumar Appuswamy
Brian Seisho Taba
Andrew Stephen Cassidy
Pallab Datta
Myron Flickner

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260513
Application Date: 20190311
Priority Date: 20180330

Claims (20)

System comprising: a plurality of multipliers (206; 302; 504; 1202), wherein the plurality of multipliers are in a plurality of equally sized groups an order, wherein each from the plurality of multipliers is designed to apply a weight to an input activation in parallel to produce an output; a plurality of adders (204; 304; 506; 1204), wherein each from the plurality of adders is operatively connected to one of the groups of multipliers, wherein each from the plurality of adders is designed to add the outputs of the multipliers within their respective group in parallel to produce a partial sum (306); a first plurality of function blocks (508; 1206), wherein each from the first plurality of function blocks is operatively connected to one from the plurality of adders, wherein each from the first plurality of function blocks is designed to apply a function to the partial sum of its associated adder in parallel to produce an output value; a vector register (116), wherein the vector register is operatively connected to the first plurality of function blocks, wherein the vector register is designed to store the output values of the first plurality of function blocks, wherein the first plurality of function blocks is designed to combine the output values stored in the vector register with subsequently computed output values of the first plurality of function blocks, wherein output values of this combination are stored in the vector register; a second plurality of function blocks, wherein each of the second plurality of function blocks is operatively connected to the vector register, wherein each of the second plurality of function blocks is designed to apply a function to the stored output values in parallel.
System according Claim 1 , which is designed to receive a matrix of weights and a vector of activations.
System according Claim 1 , where each of the plurality of adders has a tree of adders.
System according to Claim 3 , where the tree of adders is a binary tree.
System according Claim 3 , where the tree of adders has a plurality of carry-store adders.
System according Claim 2 , whereby each activation of the vector of activations is broadcast to all groups of multipliers.
System according Claim 2 , furthermore exhibiting a systolic pipeline that is operatively connected to each of the groups of multipliers.
System according Claim 1 , where the groups of multipliers are structured as a pipeline.
System according Claim 1 , where the weightings are balanced ternary values.
System according Claim 1 , where each of the plurality of multipliers has a multiplexer.
System according to Claim 2 , wherein the matrix of weights is compressed and wherein the system is designed to decompress the compressed matrix of weights.
System according to Claim 1 , where each of the plurality of multipliers has a ternary multiplier that is realized by a multiplexer.
System according Claim 1 , further comprising: a plurality of shifters, wherein each shifter is operatively connected to one of the first plurality of function blocks, wherein each shifter is designed to shift the output value of its corresponding function block in parallel, and wherein the first plurality of function blocks is designed to combine the shifted values with subsequently calculated output values.
System according Claim 1 , where the function of each of the first plurality of function blocks is an activation function.
System according Claim 1 , where the function of each of the first plurality of function blocks is programmable.
System according Claim 1 , where the function of each of the first plurality of function blocks is an addition.
System according Claim 1 , where the function of each of the first plurality of function blocks is a multiplication.
System according to Claim 1 , where the function of each of the first plurality of function blocks is an identity function.
System according Claim 1 , furthermore comprising a reference table, wherein the function of each of the first plurality of function blocks has a reference from the reference table.
System according Claim 19 , where the reference table is programmable.

Description

BACKGROUND Embodiments of the present disclosure relate to neural network inference and, in particular, to massively parallel neural inference data processing elements. The publication "Massively Parallel Coprocessor for Convolutional Neural Networks" concerns a massively parallel coprocessor for convolutional neural networks (CNNs). The coprocessor features parallel clusters of vector processing elements, with each cluster consisting of hand-optimized 2D convolver units and other hardware specifically designed for CNNs. A key feature of the coprocessor is the use of off-chip memory on the coprocessor board as a scratchpad for intermediate CNN data. This is made possible by the high-bandwidth memory architecture and the reduction of data precision to pack multiple words per memory operation. SANKARADAS, M. et al.: A Massively Parallel Coprocessor for Convolutional Neural Networks. In: 2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors, 2009, 53-60. https://ieeexplore.ieee.org/document/5200010 ). The printed matter GB 2 552 243 A This relates to a computer-implemented method for configuring a hardware implementation of a CNN, wherein the method comprises: determining, for each of a plurality of layers of the CNN, a first numerical format to represent weight values in the layer based on a distribution of weight values for the layer, wherein the first numerical format comprises a first integer of a first predetermined bit length and a first exponent value specified for the layer; determining a second numerical format for each of a plurality of layers of the CNN to represent data values in the layer based on a distribution of expected data values for the layer, wherein the second numerical format comprises a second integer of a second predetermined bit length and a second exponent value specified for the layer; and storing the determined numerical formats for use in configuring the hardware implementation of a CNN. SUMMARY According to embodiments of the present disclosure, systems, methods, and computer program products for massively parallel neural inference data processing are provided. A plurality of multipliers is arranged in a plurality of equally sized groups. Each of the plurality of multipliers is designed to apply a weight to an input activation in parallel to produce an output. A plurality of adders is operatively connected to one of the groups of multipliers. Each of the plurality of adders is designed to add the outputs of the multipliers within their respective group in parallel to produce a partial sum. A plurality of function blocks is operatively connected to one of the plurality of adders. Each of the plurality of function blocks is designed to apply a function to the partial sum of its associated adder in parallel to produce an output value. According to one aspect, a system is provided comprising: a plurality of multipliers, wherein the plurality of multipliers are arranged in a plurality of equally sized groups, each from the plurality of multipliers being designed to apply a weight to an input activation in parallel to produce an output; a plurality of adders, wherein each from the plurality of adders is operatively connected to one of the groups of multipliers, each from the plurality of adders being designed to add the outputs of the multipliers within their respective associated groups in parallel to produce a partial sum. According to one aspect, a procedure is provided comprising: applying a plurality of weights in parallel to a plurality of input activations by a plurality of equally sized groups of multipliers to generate a plurality of outputs for each group of multipliers; adding the plurality of outputs from each group of multipliers in parallel to generate a partial sum from each group of multipliers. According to one aspect, a system is provided comprising: a plurality of multipliers, wherein the plurality of multipliers are arranged in a plurality of equally sized groups; a plurality of adders, wherein each of the plurality of adders is operatively connected to one of the groups of multipliers; a computer-readable storage medium containing program instructions, wherein the program instructions are executable to perform a procedure comprising: applying a weight to an input activation in parallel by the plurality of multipliers to produce an output; adding the outputs of the multipliers within their respective groups in parallel by each of the plurality of adders to produce a partial sum. Brief description of the different views of the drawings With reference to the accompanying drawings, embodiments of the invention are now described, which are to be understood as merely exemplary and in which: 1 represents an inference processor architecture with multiple neural cores according to embodiments of the present disclosure. 2 represents a massively parallel vector matrix multiplier for calculating partial sums according to embodime