US-12619678-B2 - Acceleration of 1X1 convolutions in convolutional neural networks

US12619678B2US 12619678 B2US12619678 B2US 12619678B2US-12619678-B2

Abstract

A convolutional accelerator includes a feature line buffer, a kernel buffer, a multiply-accumulate cluster, and mode control circuitry. In a first mode of operation, the mode control circuitry stores feature data in a feature line buffer and stores kernel data in a kernel buffer. The data stored in the buffers is transferred to the MAC cluster of the convolutional accelerator for processing. In a second mode of operation the mode control circuitry stores feature data in the kernel buffer and stores kernel data in the feature line buffer. The data stored in the buffers is transferred to the MAC cluster of the convolutional accelerator for processing. The second mode of operation may be employed to efficiently process 1×N kernels, where N is an integer greater than or equal to 1.

Inventors

Michele Rossi
Thomas Boesch
Giuseppe Desoli

Assignees

STMICROELECTRONICS S.R.L.
STMICROELECTRONICS INTERNATIONAL N.V.

Dates

Publication Date: 20260505
Application Date: 20220623

Claims (20)

1 . A convolutional accelerator, comprising: a feature line buffer; a kernel buffer separate from the feature line buffer; a Multiply-ACcumulate (MAC) cluster; and mode control circuitry coupled to the feature line buffer, the kernel buffer and the MAC cluster, wherein the mode control circuitry: in a first mode of operation of the convolutional accelerator: stores feature data in the feature line buffer; stores kernel data in the kernel buffer; transfers feature data from the feature line buffer to the MAC cluster; and transfers kernel data from the kernel buffer to the MAC cluster; and in a second mode of operation of the convolutional accelerator: stores feature data in the kernel buffer; stores kernel data in the feature line buffer; transfers feature data from the kernel buffer to the MAC cluster; and transfers kernel data from the feature line buffer to the MAC cluster.
2 . The convolutional accelerator of claim 1 , wherein the mode control circuitry, in the first mode of operation: stores three lines of feature line data having a depth of up to 1024 elements in the feature line buffer; and stores 3×3 kernels in the kernel buffer.
3 . The convolutional accelerator of claim 2 , wherein the mode control circuitry, in the second mode of operation: stores six lines of feature line data having a depth of up to 128 elements in the kernel buffer; and stores 1×1 kernels in the feature line buffer.
4 . The convolutional accelerator of claim 3 , wherein the mode control circuitry, in the second mode of operation: transfers three lines of feature line data from the kernel buffer to the MAC clusters in a cycle; and transfers twenty-four kernel data values to the MAC clusters in the cycle.
5 . The convolutional accelerator of claim 4 , wherein the MAC clusters, in operation, generate 72 output values in the cycle.
6 . The convolutional accelerator of claim 1 , wherein: the feature line buffer is a single-port memory; and the kernel buffer comprises a plurality of dual-port buffers.
7 . The convolutional accelerator of claim 6 , wherein the mode control circuitry, in the second mode of operation: stores feature line data in a first subset of the plurality of dual-port buffers; and buffers kernel data in a second subset of the plurality of dual-port buffers.
8 . The convolutional accelerator of claim 7 , wherein the buffering kernel data in the second subset of the plurality of dual-port buffers comprises: storing kernel data in a first dual-port buffer of the second subset; transferring kernel data from the first dual-port buffer of the second subset to the feature line buffer; transferring kernel data from the feature line buffer to a second dual-port buffer of the second subset; and transferring kernel data from the second dual-port buffer of the second subset to the MAC clusters.
9 . The convolutional accelerator of claim 7 , wherein the buffering kernel data in the second subset of the plurality of dual-port buffers comprises: transferring kernel data from the feature line buffer to a dual-port buffer of the second subset of dual-port buffers; and transferring kernel data from the dual-port buffer of the second subset of dual-port buffers to the MAC clusters.
10 . The convolutional accelerator of claim 1 , wherein the mode control circuitry, in the second mode of operation, serializes output values generated by the MAC clusters.
11 . The convolutional accelerator of claim 1 , comprising a configuration register, wherein the mode control circuitry, in operation, determines whether to operate in the first mode of operation or the second mode of operation based on a configuration parameter stored in the configuration register.
12 . The convolutional accelerator of claim 1 , wherein in the second mode of operation, the kernel data has a size of 1×N, where N is an integer greater than or equal to 1.
13 . A system, comprising: a stream engine, which, in operation, streams feature and kernel data; and a convolutional accelerator coupled to the stream engine, wherein the convolutional accelerator, in operation, receives streams of feature and kernel data from the stream engine, the convolutional accelerator including: a feature line buffer; a kernel buffer; a Multiply-ACcumulate (MAC) cluster coupled to the feature line buffer and the kernel buffer; and mode control circuitry coupled to the feature line buffer, the kernel buffer and the MAC cluster, wherein the mode control circuitry: in a first mode of operation of the convolutional accelerator: stores feature data in the feature line buffer; stores kernel data in the kernel buffer; transfers feature data from the feature line buffer to the MAC cluster; and transfers kernel data from the kernel buffer to the MAC cluster; and in a second mode of operation of the convolutional accelerator: stores feature data in the kernel buffer; stores kernel data in the feature line buffer; transfers feature data from the kernel buffer to the MAC cluster; and transfers kernel data from the feature line buffer to the MAC cluster.
14 . The system of claim 13 , wherein the mode control circuitry: in the first mode of operation: stores three lines of feature line data having a depth of up to 1024 elements in the feature line buffer; and in the second mode of operation: stores six lines of feature line data having a depth of up to 128 elements in the kernel buffer; and stores 1×N kernels in the feature line buffer, where N is an integer greater than or equal to 1.
15 . The system of claim 14 , wherein the mode control circuitry, in the second mode of operation: transfers three lines of feature line data from the kernel buffer to the MAC clusters in a cycle; and transfers twenty-four kernel data values to the MAC clusters in the cycle.
16 . The system of claim 13 , wherein: the feature line buffer is a single-port memory; and the kernel buffer comprises a plurality of dual-port buffers.
17 . The system of claim 16 , wherein the mode control circuitry, in the second mode of operation: stores feature line data in a first subset of the plurality of dual-port buffers; and buffers kernel data in a second subset of the plurality of dual-port buffers.
18 . A method, comprising: streaming feature data and kernel data to a convolutional accelerator; and convolving streamed kernel data with streamed feature data, the convolving including: in a first mode of operation of the convolutional accelerator: storing feature data in a feature line buffer of the convolutional accelerator; storing kernel data in a kernel buffer of the convolutional accelerator; transferring feature data from the feature line buffer to a MAC cluster of the convolutional accelerator; and transferring kernel data from the kernel buffer to the MAC cluster; and in a second mode of operation of the convolutional accelerator: storing feature data in the kernel buffer; storing kernel data in the feature line buffer; transferring feature data from the kernel buffer to the MAC cluster; and transferring kernel data from the feature line buffer to the MAC cluster.
19 . The method of claim 18 , wherein: the first mode of operation includes storing three lines of feature line data having a depth of up to 1024 elements in the feature line buffer, and storing 3×3 kernels in the kernel buffer; and the second mode of operation includes storing six lines of feature line data having a depth of up to 128 elements in the kernel buffer, and storing 1×N kernels in the feature line buffer, where N is an integer greater than or equal to 1.
20 . The method of claim 18 , wherein: the kernel buffer comprises a plurality of dual-port buffers; and in the second mode of operation: the storing feature data in the kernel buffer comprises storing feature data in a first subset of the plurality of dual-port buffers; and the storing kernel data in the feature line buffer comprising buffering kernel data in a second subset of the plurality of dual-port buffers.

Description

BACKGROUND Technical Field The present disclosure generally relates to convolutional accelerators, such as convolutional accelerators used in a learning/inference machine (e.g., an artificial neural network (ANN), such as a convolutional neural network (CNN)). Description of the Related Art Various computer vision, speech recognition, and signal processing applications may benefit from the use of learning/inference machines, which may quickly perform hundreds, thousands, or even millions of concurrent operations. Learning/inference machines, as discussed in this disclosure, may fall under the technological titles of machine learning, artificial intelligence, neural networks, probabilistic inference engines, accelerators, and the like. Conventional learning/inference machines can deliver hundreds of teraflops (e.g., one million millions (1012) floating-point operations per second) of computing power. Such learning/inference machines may include or otherwise utilize CNNs, such as deep convolutional neural networks (DCNN). A DCNN is a computer-based tool that processes large quantities of data and adaptively “learns” by conflating proximally related features within the data, making broad predictions about the data, and refining the predictions based on reliable conclusions and new conflations. The DCNN is arranged in a plurality of “layers,” and different types of predictions are made at each layer. Hardware accelerators including convolutional accelerators are often employed to accelerate the processing of large amounts of data by a DCNN. BRIEF SUMMARY In an embodiment, a convolutional accelerator comprises a feature line buffer, a kernel buffer separate from the feature line buffer, a Multiply-ACcumulate (MAC) cluster, and mode control circuitry coupled to the feature line buffer, the kernel buffer and the MAC cluster. The mode control circuitry, in a first mode of operation of the convolutional accelerator, stores feature data in the feature line buffer, stores kernel data in the kernel buffer, transfers feature data from the feature line buffer to the MAC cluster, and transfers kernel data from the kernel buffer to the MAC cluster. In a second mode of operation of the convolutional accelerator, the mode control circuitry stores feature data in the kernel buffer, stores kernel data in the feature line buffer, transfers feature data from the kernel buffer to the MAC cluster, and transfers kernel data from the feature line buffer to the MAC cluster. The second mode of operation may be employed to efficiently process 1×N kernels, where N is an integer greater than or equal to 1. In an embodiment, a system comprises a stream engine, which, in operation, streams feature and kernel data, and a convolutional accelerator coupled to the stream engine, wherein the convolutional accelerator, in operation, receives streams of feature and kernel data from the stream engine. The convolutional accelerator includes a feature line buffer, a kernel buffer, a multiply-accumulate cluster coupled to the feature line buffer and the kernel buffer, mode control circuitry coupled to the feature line buffer, the kernel buffer and the MAC cluster. The mode control circuitry, in a first mode of operation of the convolutional accelerator, stores feature data in the feature line buffer, stores kernel data in the kernel buffer, transfers feature data from the feature line buffer to the MAC cluster, and transfers kernel data from the kernel buffer to the MAC cluster. In a second mode of operation of the convolutional accelerator, the mode control circuitry stores feature data in the kernel buffer, stores kernel data in the feature line buffer, transfers feature data from the kernel buffer to the MAC cluster, and transfers kernel data from the feature line buffer to the MAC cluster. The second mode of operation may be employed to efficiently process 1×N kernels, where N is an integer greater than or equal to 1. In an embodiment, a method comprises streaming feature data and kernel data to a convolutional accelerator, and convolving streamed kernel data with streamed feature data. The convolving includes, in a first mode of operation of the convolutional accelerator, storing feature data in a feature line buffer of the convolutional accelerator, storing kernel data in a kernel buffer of the convolutional accelerator, transferring feature data from the feature line buffer to a MAC cluster of the convolutional accelerator, and transferring kernel data from the kernel buffer to the MAC cluster. In a second mode of operation of the convolutional accelerator the convolving includes storing feature data in the kernel buffer, storing kernel data in the feature line buffer, transferring feature data from the kernel buffer to the MAC cluster, and transferring kernel data from the feature line buffer to the MAC cluster. The second mode of operation may be employed to efficiently process 1×N kernels, where N is an integer greater than or equal to 1. I