US-12626113-B2 - Neural processing units (NPUs) and computational systems employing the same

US12626113B2US 12626113 B2US12626113 B2US 12626113B2US-12626113-B2

Abstract

Introduced here are integrated circuits (also referred to as “chips”) that can be implemented in a neural processing unit. At a high level, the goal of these chips is to provide higher performance for machine learning algorithms than conventional processing units would. To accomplish this, the neural processing unit can include multiple computing components, each of which is able to independently determine the overlap between encoded data provided as input and values stored in a memory.

Inventors

Harold B Noyes
David Roberts
Russell Lloyd
William Tiffany
Jeffery Tanner
Terrence Leslie
Daniel Skinner
Indranil Roy

Assignees

Natural Intelligence Systems, Inc.

Dates

Publication Date: 20260512
Application Date: 20211119

Claims (20)

1 . A neural processing unit comprising: a memory in which an array of multi-bit values is stored; a comparator circuit configured to compare each of the multi-bit values against a threshold, so as to produce a signal as output; and a calculator circuit configured to identify, based on the signal, each multi-bit value that exceeds the threshold, determine a count of the identified multi-bit values, and modulate the count by (i) multiplying the count by a programmable operand and (ii) adding the count to the programmable operand, so as to produce a boosted count.
2 . The neural processing unit of claim 1 , wherein the programmable operand is a multi-bit value that is provided to the calculator circuit as input.
3 . The neural processing unit of claim 1 , further comprising: a math unit configured to implement an algorithm that, in operation, indicates whether the multi-bit values should be updated.
4 . The neural processing unit of claim 3 , wherein the math unit is representative of an arithmetic logic unit that determines, for each multi-bit value, whether an update is necessary based on one or more inputs.
5 . The neural processing unit of claim 4 , wherein the arithmetic logic unit includes an adder that adds the one or more inputs to produce a signal that indicates, for each multi-bit value, whether an update is necessary.
6 . The neural processing unit of claim 1 , wherein the memory, the comparator circuit, and the calculator circuit are collectively representative of a computing component, wherein the computing component is one of multiple computing components, and wherein the neural processing unit further comprises: an activity monitor circuit configured to monitor a number of times that the boosted count is among a programmable number of highest boosted counts output by the multiple computing components.
7 . The neural processing unit of claim 6 , further comprising: a boosting factor table in which the programmable operand is stored.
8 . The neural processing unit of claim 7 , further comprising: an update control circuit configured to evaluate an output produced by the activity monitor circuit to determine whether an update of the programmable operand is necessary.
9 . The neural processing unit of claim 8 , wherein in response to a determination that the number of times falls below a lower bound of a window, the activity monitor circuit is further configured to generate an instruction to increase the programmable operand, and the update control circuit is further configured to generate a load command based on the instruction.
10 . The neural processing unit of claim 8 , wherein in response to a determination that the number of times exceeds an upper bound of a window, the activity monitor circuit is further configured to generate an instruction to decrease the programmable operand, and the update control circuit is further configured to generate a load command based on the instruction.
11 . A method performed by a neural processing unit, the method comprising: storing, by a memory, an array of multi-bit values; comparing, by a comparator circuit, each of the multi-bit values against a threshold, so as to produce a signal as output; identifying, by a calculator circuit and based on the signal, each multi-bit value that exceeds the threshold; determining a count of the identified multi-bit values; and modulating the count by (i) multiplying the count by a programmable operand and (ii) adding the count to the programmable operand, so as to produce a boosted count.
12 . The method of claim 11 , wherein the programmable operand is a multi-bit value that is provided to the calculator circuit as input.
13 . The method of claim 11 , further comprising: implementing, by a math unit, an algorithm that, in operation, indicates whether the multi-bit values should be updated.
14 . The method of claim 13 , wherein the math unit is representative of an arithmetic logic unit that determines, for each multi-bit value, whether an update is necessary based on one or more inputs.
15 . The method of claim 14 , wherein the arithmetic logic unit includes an adder that adds the one or more inputs to produce a signal that indicates, for each multi-bit value, whether an update is necessary.
16 . The method of claim 11 , wherein the memory, the comparator circuit, and the calculator circuit are collectively representative of a computing component, wherein the computing component is one of multiple computing components, and wherein the method further comprises: monitoring, by an activity monitor circuit, a number of times that the boosted count is among a programmable number of highest boosted counts output by the multiple computing components.
17 . The method of claim 16 , further comprising: storing, by a boosting factor table, the programmable operand.
18 . The method of claim 17 , further comprising: evaluating, by an update control circuit, an output produced by the activity monitor circuit to determine whether an update of the programmable operand is necessary.
19 . The method of claim 18 , further comprising: in response to determining that the number of times falls below a lower bound of a window, generating, by the activity monitor circuit, an instruction to increase the programmable operand, and generating, by the update control circuit, a load command based on the instruction.
20 . The method of claim 18 , further comprising: in response to determining that the number of times exceeds an upper bound of a window, generating, by the activity monitor circuit, an instruction to decrease the programmable operand, and generating, by the update control circuit, a load command based on the instruction.

Description

CROSS-REFERENCE TO RELATED APPLICATIONS This application claims priority to U.S. Provisional Application No. 63/116,608, titled “Neural Processing Units (NPUs) and Artificial Intelligence (AI) and/or Machine Learning (ML) Systems Employing the Same” and filed on Nov. 20, 2020, and U.S. Provisional Application No. 63/227,590, titled “Explainable Machine Learning (ML) and Artificial Intelligence (AI) Methods and Systems Using Encoders, Neural Processing Units (NPUs), and Classifiers” and filed on Jul. 30, 2021, each of which is incorporated by reference herein in its entirety. TECHNICAL FIELD Various embodiments concern processing units with hardware architectures suitable for artificial intelligence and machine learning processes, as well as computational systems capable of employing the same. BACKGROUND Historically, artificial intelligence (AI) and machine learning (ML) processes have been implemented by computational systems (or simply “systems”) that execute sophisticated software using conventional processing units, such as central processing units (CPUs) and graphics processing units (GPUs). While the hardware architectures of these conventional processing units are able to execute the necessary computations, actual performance is slow relative to desired performance. Simply put, performance is impacted because too much data and too many computations are required. This impact on performance can have significant ramifications. As an example, if performance suffers to such a degree that delay occurs, then AI and ML processes may not be implementable in certain situations. For instance, delays of less than one second may prevent implementation of AI and ML processes where timeliness is necessary, such as for automated driving systems where real-time AI and ML processing affects passenger safety. Another real-time system example is military targeting systems, where friend-or-foe decisions must be made and acted upon before loss of life occurs. Any scenario where real-time decisions can impact life, safety, or capital assets are applications where faster AI and ML processing is needed. Entities have historically attempted to address this impact on performance by increasing the computational resources that are available to the system. There are several drawbacks to this approach, however. First, increasing the computational resources may be impractical or impossible. This is especially true if the AI and ML processes are intended to be implemented by systems that are included in computing devices such as mobile phones, tablet computers, and the like. Second, increasing the computational resources will lead to an increase in power consumption. The power available to a system can be limited (e.g., due to battery constraints), so limiting power consumption is an important aspect of developing new technologies. BRIEF DESCRIPTION OF THE DRAWINGS FIG. 1 includes a diagrammatic illustration of a hardware-based architecture of a digital neuron that is implementable in a neural processing unit (NPU). FIG. 2 includes a diagrammatic illustration of a hardware-based architecture of a digital neuron that is able to implement a basic learning mechanism. FIG. 3 includes a diagrammatic illustration of a hardware-based architecture of a digital neuron that is able to implement an enhanced learning mechanism. FIG. 4 includes a diagrammatic illustration of a hardware-based architecture of a digital neuron that is able to perform a learning process locally, so as determine and then implement adjustments to synaptic strength values (SVVs) stored in memory as necessary. FIG. 5 includes a simplified block diagram of one possible implementation of the update math unit of FIG. 4. FIG. 6 includes a diagrammatic illustration of a hardware-based architecture of a digital neuron that can locally update a boost factor in an accelerated manner. FIG. 7 includes a diagrammatic illustration of the activity monitor circuit of FIG. 6. Features of the technology described herein will become more apparent to those skilled in the art from a study of the Detailed Description in conjunction with the drawings. Various embodiments are depicted in the drawings for the purpose of illustration. However, those skilled in the art will recognize that alternative embodiments may be employed without departing from the principles of the present disclosure. Accordingly, although specific embodiments are shown in the drawings, the technology is amenable to various modifications. DETAILED DESCRIPTION Introduced here are integrated circuit devices (also referred to as “chips”) that can be implemented in a neural processing unit. The terms “neural processing unit,” “neural processor,” and “NPU” may be used to refer to an electronic circuit that is designed to implement some or all of the control and arithmetic logic necessary to execute ML algorithms, usually with a separate data memory (or simply “memory”) and dedicated instruction set architecture. At a high lev