US-20260127423-A1 - COMPUTE IN MEMORY-BASED MACHINE LEARNING ACCELERATOR ARCHITECTURE

US20260127423A1US 20260127423 A1US20260127423 A1US 20260127423A1US-20260127423-A1

Abstract

Certain aspects of the present disclosure provide techniques for processing machine learning model data with a machine learning task accelerator, including: configuring one or more signal processing units (SPUs) of the machine learning task accelerator to process a machine learning model; providing model input data to the one or more configured SPUs; processing the model input data with the machine learning model using the one or more configured SPUs; and receiving output data from the one or more configured SPUs.

Inventors

Ren Li

Assignees

QUALCOMM INCORPORATED

Dates

Publication Date: 20260507
Application Date: 20251230

Claims (20)

1 . A machine learning task accelerator, comprising: one or more mixed signal processing units (MSPUs), each respective MSPU of the one or more MSPUs comprising: a compute-in-memory (CIM) circuit; a local activation buffer connected to the CIM circuit and configured to store activation data for processing by the CIM circuit; one or more analog to digital converters (ADCs) connected to the CIM circuit and configured to convert analog computation result signals from the CIM circuit to digital computation result data; a first nonlinear operation circuit connected to one or more outputs of the one or more ADCs and configured to perform nonlinear processing on the digital computation result data; a hardware sequencer circuit configured to execute instructions received from a host system and control operation of the MSPU; and a local direct memory access (DMA) controller configured to control access to a shared activation buffer; a digital element-wise multiplication and accumulation circuit connected to the one or more MSPUs and configured to: perform element-wise multiplication and element-wise accumulation operations on activation data output from one or more of the one or more MSPUs; and generate an output signal based on the element-wise multiplication and element-wise accumulation operations, wherein the output signal is provided back as input to the digital element-wise multiplication and accumulation circuit via a loop; and a second nonlinear operation circuit connected to the one or more MSPUs, wherein the second nonlinear operation circuit is configured to receive the output signal.
2 . The machine learning task accelerator of claim 1 , further comprising one or more digital signal processing units (DSPUs), each respective DSPU of the one or more DSPUs comprising: a DSPU digital multiplication and accumulation (DMAC) circuit configured to perform digital multiplication and accumulation operations; a DSPU local activation buffer connected to the DMAC circuit and configured to store activation data for processing by the DMAC circuit; a DSPU nonlinear operation circuit connected to the DMAC circuit and configured to perform nonlinear processing on data output from the DMAC circuit; a DSPU hardware sequencer circuit connected to configured to execute instructions received from the host system and control operation of the respective DSPU; and a DSPU local direct memory access (DMA) controller configured to control access to a shared activation buffer.
3 . The machine learning task accelerator of claim 1 , further comprising a shared activation buffer connected to the one or more MSPUs and configured to store output activation data generated by the one or more MSPUs.
4 . The machine learning task accelerator of claim 1 , wherein the first nonlinear operation circuit comprises a cubic approximator and a gain block.
5 . The machine learning task accelerator of claim 1 , wherein at least one respective MSPU of the one or more MSPUs further comprises a CIM finite state machine (FSM) configured to control writing of weight data and activation data to the respective MSPU's CIM circuit.
6 . The machine learning task accelerator of claim 1 , further comprising a plurality of registers connected to the one or more MSPUs and configured to enable data communication directly between the MSPUs.
7 . The machine learning task accelerator of claim 1 , wherein at least one respective MSPU of the one or more MSPUs further comprises a digital post processing circuit configured to apply one of a gain, a bias, a shift or a pooling operation.
8 . The machine learning task accelerator of claim 7 , wherein the digital post processing circuit comprises at least one ADC of the one or more ADCs of the respective MSPU.
9 . The machine learning task accelerator of claim 1 , further comprising a tiling control circuit configured to: cause weight data for a single layer of a neural network model to be loaded into at least two separate CIM circuits of two separate MSPUs of the one or more MSPUs; receive partial output from the two separate MSPUs; and generate final output based on the partial outputs.
10 . The machine learning task accelerator of claim 9 , wherein the tiling control circuit is further configured to control an interconnection of rows between the at least two separate CIM circuits.
11 . The machine learning task accelerator of claim 1 , wherein the one or more MSPUs are configured to perform processing of a convolutional neural network layer of a convolutional neural network model.
12 . The machine learning task accelerator of claim 11 , wherein the one or more MSPUs are configured to perform processing of a fully connected layer of the convolutional neural network model.
13 . The machine learning task accelerator of claim 11 , further comprising: a shared nonlinear operation circuit configured to perform processing of a pointwise convolution of the convolutional neural network layer, wherein: the convolutional neural network layer comprises a depthwise separable convolutional neural network layer, and at least one of the one or more MSPUs is configured to perform processing of a depthwise convolution of the convolutional neural network layer.
14 . The machine learning task accelerator of claim 1 , wherein the one or more MSPUs are configured to perform processing of at least one of a recurrent layer of a neural network model, a long short-term memory (LSTM) layer of a neural network model, or a gated recurrent unit (GRU) layer of a neural network model.
15 . The machine learning task accelerator of claim 1 , wherein the loop comprises a delay loop.
16 . The machine learning task accelerator of claim 1 , further comprising a digital multiplication and accumulation (DMAC) circuit connected to the one or more MSPUs and configured to perform multiplication and accumulation operations on activation data output from one or more of the one or more MSPUs.
17 . The machine learning task accelerator of claim 1 , wherein the one or more MSPUs are configured to perform processing of a transformer layer of a neural network model.
18 . The machine learning task accelerator of claim 17 , wherein the transformer layer comprises an attention component and a feed forward component.
19 . The machine learning task accelerator of claim 1 , further comprising a hardware sequencer memory connected to the hardware sequencer circuit and configured to store the instructions received from the host system.
20 . A method of processing machine learning model data with a machine learning task accelerator, comprising: configuring one or more mixed signal processing units (MSPUs) of the machine learning task accelerator to process a machine learning model; providing model input data to the one or more configured MSPUs; processing the model input data with the machine learning model using the one or more configured MSPUs; and receiving output data from the one or more configured MSPUs, wherein the machine learning task accelerator comprises: a digital element-wise multiplication and accumulation circuit connected to the one or more MSPUs and configured to: perform element-wise multiplication and element-wise accumulation operations on activation data output from one or more of the one or more MSPUs; and generate an output signal based on the element-wise multiplication and element-wise accumulation operations, wherein the output signal is provided back as input to the digital element-wise multiplication and accumulation circuit via a loop; and a second nonlinear operation circuit connected to the one or more MSPUs, wherein the second nonlinear operation circuit is configured to receive the output signal.

Description

CROSS-REFERENCE TO RELATED APPLICATION(S) This application is a divisional of U.S. patent application Ser. No. 17/359,297 filed Jun. 25, 2021, which is hereby incorporated by reference herein. INTRODUCTION Aspects of the present disclosure relate to improved architectures for performing machine learning tasks, and in particular to compute in memory-based architectures for supporting advanced machine learning architectures. Machine learning is generally the process of producing a trained model (e.g., an artificial neural network, a tree, or other structures), which represents a generalize fit to a set of training data that is known a priori. Applying the trained model to new data produces inferences, which may be used to gain insights into the new data. In some cases, applying the model to the new data is described as “running an inference” on the new data. As the use of machine learning has proliferated for enabling various machine learning (or artificial intelligence) tasks, the need for more efficient processing of machine learning model data has arisen. In some cases, dedicated hardware, such as machine learning (or artificial intelligence) accelerators, may be used to enhance a processing system's capacity to process machine learning model data. However, such hardware requires space and power, which is not always available on the processing device. For example, “edge processing” devices, such as mobile devices, always on devices, internet of things (IoT) devices, and the like, have to balance processing capabilities with power and packaging constraints. Consequently, other aspects of a processing system are being considered for processing machine learning model data. Memory devices are one example of another aspect of a processing system that may be leveraged for performing processing of machine learning model data through so-called compute-in-memory (CIM) processes. Unfortunately, conventional CIM processes may not be able to perform processing of all aspects of advanced model architectures, such as recurrent neural networks (RNNs), attention models (e.g., attention-based neural networks), bidirectional encoder representations from transformers (BERT) models, and the like. These advanced model architectures have significant utility in many technical domains, including healthcare, natural language processing, speech recognition, self-driving cars, recommender systems, and others. Accordingly, systems and methods are needed for performing computation in memory of a wider variety of machine learning model architectures. BRIEF SUMMARY Certain aspects provide a machine learning task accelerator, comprising: one or more mixed signal processing units (MSPUs), each respective MSPU of the one or more MSPUs comprising: a compute-in-memory (CIM) circuit; a local activation buffer connected to the CIM circuit and configured to store activation data for processing by the CIM circuit; one or more analog to digital converters (ADCs) connected to the CIM circuit and configured to convert analog computation result signals from the CIM circuit to digital computation result data; a first nonlinear operation circuit connected to one or more outputs of the one or more ADCs and configured to perform nonlinear processing on the digital computation result data; a hardware sequencer circuit configured to execute instructions received from a host system and control operation of the MSPU; and a local direct memory access (DMA) controller configured to control access to the CIM circuit; a digital multiplication and accumulation (DMAC) circuit connected to the one or more MSPUs and configured to perform multiplication and accumulation operations on activation data output from one or more of the one or more MSPUs; a digital element-wise multiplication and accumulation circuit connected to the one or more MSPUs and configured to perform element-wise multiplication and element-wise accumulation operations on activation data output from one or more of the one or more MSPUs; and a second nonlinear operation circuit connected to the one or more MSPUs. Further aspects provide a method of processing machine learning model data with a machine learning task accelerator, comprising: configuring one or more mixed signal processing units (MSPUs) of the machine learning task accelerator to process a machine learning model; providing model input data to the one or more configured MSPUs; processing the model input data with the machine learning model using the one or more configured MSPUs; and receiving output data from the one or more configured MSPUs. Other aspects provide processing systems configured to perform the aforementioned methods as well as those described herein; non-transitory, computer-readable media comprising instructions that, when executed by one or more processors of a processing system, cause the processing system to perform the aforementioned methods as well as those described herein; a computer program product embodied on a