WO-2026095918-A2 - LARGE TENSOR TILING

WO2026095918A2WO 2026095918 A2WO2026095918 A2WO 2026095918A2WO-2026095918-A2

Abstract

Techniques for performing large tensor tiling (LTT) in hardware are enabled. LTT divides a large tensor (e.g., of unsupported size) into overlapping tiles (e.g., having supported tensor size(s)). A tensor may be processed processing the tiles. The output of each processed tile is stored, for example, in a systolic array considering the tile's placement in the large tensor. The output of all processed tiles is identical to the output of processing the large tensor. Tiles may be processed by reusing data overlapping boundaries shared with other tiles. In some examples, overlapping data may be reused (e.g., written once) or partly reused (e.g., written twice). Tiling large tensors with boundary duplication supports dynamic adaptation to a wide variety of tensor sizes, avoids re-reading duplicated data, and avoids reorganizing hardware for large tiles, which reduces power consumption and area, reduces complexity, and increases processing efficiency.

Inventors

SHAPIRO, Yaron Baruch
WATTAD, Khalil Abdul-Hamid
ROYZEN, EVGENY
LEVY, ASAF

Assignees

MICROSOFT TECHNOLOGY LICENSING, LLC

Dates

Publication Date: 20260507
Application Date: 20240523
Priority Date: 20230608

Claims (20)

1. A computing system (100/100A, 200), comprising: a systolic array (124) comprising an array of interconnected processing elements (PEs) (301), each PE (301) associated with a PE data memory (302) configured to store at least a portion of a tensor; and a data router (122) configured to perform tensor tiling of an input tensor (136), the data router (122) configured to: determine (504) a split of the input tensor (136) into a plurality of tiles (130) based on the array of interconnected PEs (301) and dimensions of the input tensor (136); and split (506) the input tensor (136) into the plurality of tiles (130), including a first tile (142) and a second tile (144) overlapping a shared edge (150), by routing the input tensor data (136) to the PE data memories (302) that store the plurality of tiles (130).
2. The computing system of claim 1, further comprising: an input handler configured to provide an indication of the determined split to the data router.
3. The computing system of claim 1, wherein each PE is associated with a PE convolution engine configured to perform a convolution on a respective portion of a tile stored in the associated PE data memory.
4. The computing system of claim 3, further comprising a systolic controller configured to control each of the PE convolution engines to perform the convolution on the respective portion of the tile stored in the associated PE data memory based on the split.
5. The computing system of claim 3, wherein the PE convolution engine is configured to perform the convolution on respective portions of multiple tiles stored in the associated PE data memory by reusing data in the associated PE data memory that overlaps the shared edge.
6. The computing system of claim 1 , wherein the data router is configured to route the input tensor to the PE data memories that store the plurality of tiles with data along the shared edge duplicated in a first PE data memory associated w ith a first PE and in a second PE data memory associated with a second PE.
7. The computing system of claim 1 , wherein the data router is further configured to transpose the plurality of tiles in the PE data memories by storing tile rows as columns in the PE data memories.
8. The computing system of claim 1, wherein each PE is further associated with a PE weight memory and wherein the data router is further configured to route weights to the PE weight memories based on the routing of the input tensor to store the plurality of tiles in the PE data memories.
9. The computing system of claim 1, wherein the data router comprises a hardware- implemented algorithm.
10. The computing system of claim 1, wherein the systolic array comprises a scalable array of interconnected PEs.
11. A method (500A), comprising: performing (502), by a data router, a tensor tiling of an input tensor comprising: determining (504) a split of the input tensor into a plurality of tiles based on dimensions of the input tensor and a systolic array comprising an array of interconnected processing elements (PEs), each PE associated with a PE data memory configured to store at least a portion of the input tensor; and splitting (506) the input tensor into the plurality of tiles, including a first tile and a second tile overlapping a shared edge, by routing the input tensor to the PE data memories that store the plurality of tiles.
12. The method of claim 11. further comprising: performing a convolution on the input tensor by performing, by a PE convolution engine associated with each PE, a convolution on respective portions of the input tiles stored in the associated PE data memory'.
13. The method of claim 12, wherein the PE convolution engine is configured to perform the convolution on respective portions of multiple tiles stored in the associated PE data memory by reusing data in the associated PE data memory that overlaps the shared edge.
14. The method of claim 11, wherein the routing of the input tensor comprises routing the input tensor to the PE data memories that store the plurality of tiles with data along the shared edge duplicated in a first PE data memory associated with a first PE and in a second PE data memory associated with a second PE.
15. The method of claim 11, wherein the routing of the input tensor comprises transposing the plurality of tiles in the PE data memories by storing tile rows as columns in the PE data memories.
16. The method of claim 11, further comprising: routing weights to PE weight memories associated with each PE based on the routing of the input tensor to store the plurality' of tiles in the PE data memories.
17. A neural processing unit (NPU) (108, 208), comprising: a systolic array (124) comprising an array of interconnected processing elements (PEs) (131), each PE (301) associated with a PE data memory (302) configured to store at least a portion of a tensor; and a data router (122) configured to perform tensor tiling of an input tensor (136), the data router (122) configured to: determine (504) a split of the input tensor (136) into a plurality of tiles (130) based on the array of interconnected PEs (301) and dimensions of the input tensor (136); and split (506) the input tensor (136) into the plurality 7 of tiles (130), including a first tile (142) and a second tile (144) overlapping a shared edge (152), by routing the input tensor data (136) to the PE data memories (302) that store the plurality of tiles (130).
18. The NPU of claim 17, wherein the data router is further configured to route weights to the PE weight memories based on the routing of the input tensor to store the plurality of tiles in the PE data memories: and wherein each PE is associated with a PE convolution engine configured to perform a convolution on the input tensor by performing a convolution on respective portions of the input tiles stored in the associated PE data memory with the weights stored in the associated PE weight memories.
19. The NPU of claim 18, wherein the PE convolution engine is configured to perform the convolution on respective portions of multiple tiles stored in the associated PE data memory by reusing data in the associated PE data memory that overlaps the shared edge.
20. The NPU of claim 17, wherein the routing of the input tensor comprises routing the input tensor to the PE data memories that store the plurality 7 of tiles with data along the shared edge duplicated in a first PE data memory 7 associated w ith a first PE and in a second PE data memory 7 associated with a second PE.

Description

LARGE TENSOR TILING BACKGROUND [0001] A convolutional neural network (CNN) is a type of artificial neural network with various applications, including the analysis of images. CNNs implement at least one convolution and a mathematical operation. CNNs commonly convolve data tensors (e.g., image data) with weight tensors. Data tensors that are processed by one or more layers in CNNs may be different sizes. SUMMARY [0002] This Summary is provided to introduce a selection of concepts in a simplified form that are further described below in the Detailed Description. This Summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter. [0003] Techniques for large tensor tiling (LTT) are disclosed herein that accommodate convolution of input tensors of varying sizes. LTT divides a large input tensor into smaller tiles with at least some of the smaller tiles being overlapping or crossover tiles with duplicated or otherwise reused edges. Adjacent tiles are considered “overlapping” when one or both tiles have a row/column of data of the other tile added (“duplicated”) at an edge at which the tiles meet (a “shared edge”). A tensor is processed (e.g., convolved) by processing the tiles into which the tensor is divided. The output of each processed tile is stored, for example, in a systolic array, taking into account the placement of the tile in the large tensor. The output of all processed tiles is identical to the output of processing the large tensor. Tiles are processed by reusing data in overlapping boundaries shared with other tiles. In some aspects, overlapping data may be reused (e.g.. written once) or partly reused (e.g., written twice). Tiling large tensors with boundary duplication supports dynamic adaptation to a wide variety of tensor sizes, avoids re-reading duplicated data, and avoids reorganizing hardware for large tiles, which reduces power consumption and area, reduces complexify, and increases processing efficiency. [0004] In aspects, a computing system includes a neural processing unit (NPU) that includes a systolic array and a data router. The systolic array includes a scalable array of interconnected processing elements (PEs). Each PE has an associated PE data memory configured to store at least a portion of an input tensor. The data router is configured to perform tensor tiling of an input tensor by determining or receiving an indication how to split the input tensor into a plurality of tiles based on the array of interconnected PEs and dimensions of the input tensor; and splitting the input tensor into the plurality of tiles, including a first tile and a second tile overlapping a shared edge, by routing the input tensor data to the PE data memories that store the plurality of tiles. Tiles may be processed (e.g., convolved) using data that overlaps the tile boundaries. Depending on the configuration of the array of interconnected PEs and/or routing/storage of tile data, the overlapping data at shared tile boundaries may be stored once and reused or may be duplicated, e.g., stored in multiple PE data memories. For example, a 16x16x4 tensor may be split into four 9x9x4 overlapping tensors. The overlapping nature of the tensors may result in reuse or duplication of stored tensor tile data. [0005] In aspects, an input handler is configured to provide the indication to the data router. Each PE may be associated with a PE convolution engine (PE processing logic) configured to perform a convolution on a respective portion of a tile stored in the associated PE data memory. A systolic controller is configured to control the systolic array, with this control pipelined throughout the PEs, to perform the convolution on the respective portions of one or more tiles stored in the PE data memory7 based on the split and routing. The PE convolution engine is configured to perform the convolution on respective portions of multiple tiles stored in the associated PE data memory' by reusing data in the associated PE data memory overlapping the shared edge. In some examples, the input tensor may be routed to the PE data memories that store the plurality of tiles, including the first and second tiles, with data overlapping the shared edge w ritten once or duplicated in a first PE data memory' associated with a first PE and in a second PE data memory' associated with a second PE. In some examples, the tiles may be transposed in the PE data memories by storing tile rows as columns in the PE data memories. Each PE may be associated with a PE weight (e.g., convolution filter) memory. Weights may be routed to the PE weight memories based on the routing of the input tensor to store the plurality of tiles in the PE data memories. The data router may be a hardware-implemented algorithm. The systolic array may include a scalable array of interconnected PEs. [0006] Further features and advantages of the embodiments