EP-3607496-B1 - CONDITIONAL GRAPH EXECUTION BASED ON PRIOR SIMPLIFIED GRAPH EXECUTION

EP3607496B1EP 3607496 B1EP3607496 B1EP 3607496B1EP-3607496-B1

Inventors

BAJIC, LJUBISA
TRAJKOVIC, Milos
HAMER, IVAN

Dates

Publication Date: 20260506
Application Date: 20180406

Claims (12)

A computer-implemented method for executing a directed graph, wherein (i) the directed graph is a neural network including a set of weights including at least one weight tensor, (ii) the directed graph includes a set of vertices and a set of edges interconnecting the set of vertices, (iii) the set of edges of the directed graph are calculations involving the set of weights for the neural network, (iv) at least a subset of the set of vertices are weights for the neural network, (v) a conditional execution of the directed graph produces an inference tensor, and (vi) the inference tensor is a response of the neural network to a live input tensor, and in which each step is conducted by a processor, wherein the method comprises: deriving (201) a simplified version of the directed graph, including reducing a number of bits used to represent a set of original values of the directed graph to obtain a set of replacement values for the simplified version of the directed graph; applying (202) a pilot input tensor to the simplified version of the directed graph; obtaining (203) a collection of execution data during the application of the pilot input tensor to the simplified version of the directed graph; storing the execution data in a distributed set of memory locations, wherein (i) the execution data includes a set of execution data values, (ii) the set of execution data values and the set of vertices have uniquely corresponding elements, (iii) each uniquely corresponding vertex in the set of vertices produces a contribution to the inference tensor in response to the pilot input tensor, and (iv) each execution data value in the set of execution data values is proportional in magnitude to the contribution to the inference tensor of each uniquely corresponding vertex in the set of vertices; generating a markup of the directed graph using the collection of execution data, wherein the markup identifies a priority value for the at least one weight tensor; applying (205) the live input tensor to the directed graph; obtaining, from a memory location in the distributed set of memory locations using a single address, both: (i) a subset of execution data from the execution data; and (ii) the at least one weight tensor from the set of weights; conditioning (204) the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data and using the markup, and wherein conditioning (204) the execution of the directed graph comprises reducing an accuracy of a computation using the at least one weight tensor based on the priority value; and obtaining (206) an output tensor from the conditional execution of the directed graph; wherein the conditioning (204) of the execution of the directed graph is conducted in real time using the execution data and the set of weights.
The computer-implemented method from claim 1, wherein: the pilot input tensor and the live input tensor are not identical; and the pilot input tensor and the live input tensor are stochastically dependent.
The computer-implemented method from claim 1, further comprising: priming the directed graph for the conditional execution, prior to the conditional execution of the directed graph, using the stored execution data.
The computer-implemented method from claim 1, wherein: an edge in the set of edges is a calculation using a four dimensional tensor.
The computer-implemented method from claim 1, wherein: the set of original values refers to the set of weights; the deriving of the simplified version of the directed graph includes replacing the set of original values with the set of replacement values; and the simplified version of the directed graph has a same number of layers as the directed graph.
The computer-implemented method from claim 5, wherein the replacing comprises: calculating the set of replacement values using a set of exponents of the set of original values.
The computer-implemented method from claim 1, further comprising: generating a markup of the directed graph using the collection of execution data, wherein the markup tags the directed graph with different levels of priority; storing the markup in the distributed set of memory locations; and conditioning an update of the set of weights using the markup.
The computer-implemented method from claim 1, further comprising: storing the markup in the distributed set of memory locations, wherein the markup tags the directed graph with different levels of priority; and obtaining the priority value and the at least one weight tensor from the memory location in the distributed set of memory locations using the single address.
A processor, operating with a computer-readable non-transitory medium storing instructions, which when executed by the processor, cause the processor to conduct a method for generating an inference from a neural network, wherein (i) the neural network includes a set of weights including at least one weight tensor, (ii) the neural network includes a set of vertices and a set of edges interconnecting the set of vertices, (iii) the set of edges of the neural network are calculations involving the set of weights for the neural network, (iv) at least a subset of the set of vertices are weights for the neural network, (v) a conditional execution of the neural network produces an inference tensor, and (vi) the inference tensor is a response of the neural network to a second input, the method comprising: deriving a simplified version of the neural network, including reducing a number of bits used to represent a set of original values of the neural network to obtain a set of replacement values for the simplified version of the neural network; applying a first input to the simplified version of the neural network; obtaining a collection of execution data during the application of the first input to the neural network; storing the execution data in a distributed set of memory locations, wherein (i) the execution data includes a set of execution data values, (ii) the set of execution data values and the set of vertices have uniquely corresponding elements, (iii) each uniquely corresponding vertex in the set of vertices produces a contribution to the inference tensor in response to the first input, and (iv) each execution data value in the set of execution data values is proportional in magnitude to the contribution to the inference tensor of each uniquely corresponding vertex in the set of vertices; generating a markup of the neural network using the collection of execution data, wherein the markup identifies a priority value for the at least one weight tensor; applying the second input to the neural network; obtaining, from a memory location in the distributed set of memory locations using a single address, both: (i) a subset of execution data from the execution data; and (ii) the at least one weight tensor from the set of weights; conditioning a computation of the neural network, during the application of the second input to the neural network, using the collection of execution data and using the markup, and wherein conditioning the execution of the neural network comprises reducing an accuracy of the computation using the at least one weight tensor based on the priority value; and obtaining the inference from the conditional computation of the neural network; wherein the conditional computation of the neural network is conducted in real time using the execution data and the set of weights.
A computer-implemented method for executing a directed graph, wherein (i) the directed graph is a neural network including a set of weights including at least one weight tensor, (ii) the directed graph includes a set of vertices and a set of edges interconnecting the set of vertices, (iii) the set of edges of the directed graph are calculations involving the set of weights for the neural network, (iv) at least a subset of the set of vertices are weights for the neural network, (v) a conditional execution of the directed graph produces an inference tensor, and (vi) the inference tensor is a response of the neural network to a live input tensor, and in which each step is conducted by a processor, wherein the method comprises: deriving (201) a simplified version of the directed graph, including utilizing polynomial interpolation to produce a function that gives an approximation of the at least one weight tensor; applying (202) a pilot input tensor to the simplified version of the directed graph; obtaining (203) a collection of execution data during the application of the pilot input tensor to the simplified version of the directed graph; storing the execution data in a distributed set of memory locations, wherein (i) the execution data includes a set of execution data values, (ii) the set of execution data values and the set of vertices have uniquely corresponding elements, (iii) each uniquely corresponding vertex in the set of vertices produces a contribution to the inference tensor in response to the pilot input tensor, and (iv) each execution data value in the set of execution data values is proportional in magnitude to the contribution to the inference tensor of each uniquely corresponding vertex in the set of vertices; generating a markup of the directed graph using the collection of execution data, wherein the markup identifies a priority value for the at least one weight tensor; applying (205) the live input tensor to the directed graph; obtaining, from a memory location in the distributed set of memory locations using a single address, both: (i) a subset of execution data from the execution data; and (ii) the at least one weight tensor from the set of weights; conditioning (204) the execution of the directed graph, during the application of the live input tensor to the directed graph, using the collection of execution data and using the markup, and wherein conditioning (204) the execution of the directed graph comprises reducing an accuracy of a computation using the at least one weight tensor based on the priority value; and obtaining (206) an output tensor from the conditional execution of the directed graph; wherein the conditioning (204) of the execution of the directed graph is conducted in real time using the execution data and the set of weights.
The computer-implemented method from claim 10, wherein: the deriving of the simplified version of the directed graph includes down-sampling the directed graph by a sampling factor; the simplified version of the directed graph is thereby a down-sampled version of the directed graph; a first complete set of tensors used for executing the simplified version of the directed graph has a rank; and a second complete set of tensors used for executing the directed graph has the rank.
The computer-implemented method from claim 10, wherein: the deriving of the simplified version of the directed graph utilizes polynomial interpolation to produce a function that gives an approximation of the at least one weight tensor; and the function is a polynomial with a set of coefficients equal to one plus an order of the polynomial.

Description

BACKGROUND The recent surge in the performance of machine intelligence systems is not due to the development of revolutionary new algorithms. Indeed, the core algorithms used in machine intelligence applications today stem from a body of work that is now over half a century old. Instead, it has been improvements in the hardware and software that implement machine intelligence algorithms in an efficient manner that has fueled the recent surge. Algorithms that were once too computationally intensive to implement in a useful manner with even the most sophisticated of computers can now be executed with specialized hardware on an individual user's smart phone. The improvements in hardware and software take various forms. For example, graphical processing units traditionally used to process the vectors used to render polygons for computer graphics have been repurposed in an efficient manner to manipulate the data elements used in machine intelligence processes. As another example, certain classes of hardware have been designed from the ground-up to implement machine intelligence algorithms by using specialized processing elements such as systolic arrays. Further advances have centered around using collections of transistors and memory elements to mimic, directly in hardware, the behavior of neurons in a traditional artificial neural network (ANN). There is no question that the field of machine intelligence has benefited greatly from these improvements. However, despite the intense interest directed to these approaches, machine intelligence systems still represent one of the most computationally and energy intensive computing applications of the modern age, and present a field that is ripe for further advances. The reason machine intelligence applications are so resource hungry is that the data structures being operated on are generally very large, and the number of discrete primitive computations that must be executed on each of the data structures is likewise immense. A traditional ANN takes in an input vector, conducts calculations using the input vector and a set of weight vectors, and produces an output vector. Each weight vector in the set of weight vectors is often referred to as a layer of the network, and the output of each layer serves as the input to the next layer. In a traditional network, the layers are fully connected, which requires every element of the input vector to be involved in a calculation with every element of the weight vector. Therefore, the number of calculations involved increases with a power law relationship to the size of each layer. Furthermore, this aspect of machine intelligence algorithms makes them difficult to parallelize because the calculations for each layer depend on the output of the prior layer. The problems mentioned in the prior paragraph are further exacerbated by modern ANNs. Modern ANN approaches are often referred to in the industry and literature as "deep learning" approaches. This is often a reference to the large number of layers involved, or the complexity of the relationships between the outputs of one layer and the inputs of the other layers. For example, in a modern deep learning ANN, the outputs of a downstream layer could be fed back to a prior layer which thereby adds a recursive element to the overall computation. Both the increase in layers, and the additional complexity associated with recursive relationships between the layers, increase the computational resources needed to implement a modern ANN. Fig.1 illustrates a directed graph 100 for the computation of a modern machine intelligence system. The input to directed graph 100 is an input tensor X. The output of directed graph 100 is an output tensor Y. The input could be an encoding for a picture, such as an image of a cat 101. In this example, execution of directed graph 100 involves the graph providing an encoding of a textual guess as to what the content of the encoded image contained. The graph output can be referred to as an inference generated by the directed graph because the machine intelligence system is effectively inferring what the picture shows from the encoding of the picture. As such, if directed graph 100 represented a properly trained machine intelligence system, execution of graph 100 with input tensor X would produce an output tensor Y which encoded the word "CAT" as illustrated. The edges of directed graph 100 represent calculations that must be conducted to execute the graph. In this example, the graph is broken into two sections-a convolutional section 102 and a fully connected section 103. The convolutional portion can be referred to as a convolutional neural network (CNN). The vertices in the directed graph of CNN 102 form a set of layers which includes layers 106, 107, and 108. The layers each include sets of tensors such as tensors 109, 110, and 111. The vertices in the directed graph of fully connected section 103 also form a set of layers which includes layers 112 and 113.