US-20260127023-A1 - COMMAND STREAM STITCHING FOR HARDWARE ACCELERATION

US20260127023A1US 20260127023 A1US20260127023 A1US 20260127023A1US-20260127023-A1

Abstract

Command stream stitching for hardware acceleration includes generating, by a host processor, a stitched block representing a plurality of commands for a hardware accelerator. The host processor generates a stitched command from the plurality of commands. The stitched command references the stitched block. The hardware accelerator executes the stitched block in response to invoking the stitched command. The hardware accelerator generates a single notification directed to the host processor for the stitched command.

Inventors

Cheng Zhen
Sonal Santan
MIN MA
Pat Truong
Satish Rangarajan
Soren T. Soe
Yu Liu

Assignees

ADVANCED MICRO DEVICES, INC.
ATI TECHNOLOGIES ULC
XILINX, INC.

Dates

Publication Date: 20260507
Application Date: 20241105

Claims (20)

1 . A method, comprising: generating, by a host processor, a stitched block representing a plurality of commands for a hardware accelerator; generating, by the host processor, a stitched command from the plurality of commands, wherein the stitched command references the stitched block; executing, by the hardware accelerator, the stitched block in response to invoking the stitched command; and generating, by the hardware accelerator, a single notification directed to the host processor for the stitched command.
2 . The method of claim 1 , wherein the stitched command is a meta command and the stitched block comprises the plurality of commands concatenated together as a command list stored in a host memory of the host processor.
3 . The method of claim 2 , wherein the executing the stitched block comprises: fetching the command list from the host memory; and executing, by the hardware accelerator, each command of the command list; wherein each command of the command list points to a control code that is executable by the hardware accelerator.
4 . The method of claim 2 , wherein the hardware accelerator executes the stitched block as a plurality of individual commands.
5 . The method of claim 1 , wherein the stitched command is a fused command and the stitched block comprises merged control code including a control code extracted from each command of the plurality of commands.
6 . The method of claim 5 , wherein the generating the stitched block comprises: extracting control codes from the plurality of commands; and combining the control codes as extracted into the merged control code.
7 . The method of claim 5 , wherein the hardware accelerator executes the merged control code.
8 . The method of claim 1 , wherein the stitched block comprises a plurality of sub-lists that are linked.
9 . The method of claim 8 , wherein the host processor is capable of adding one or more additional sub-lists to the plurality of sub-lists while the hardware accelerator executes at least one of the plurality of sub-lists.
10 . The method of claim 8 , wherein each sub-list includes two or more commands for the hardware accelerator.
11 . The method of claim 8 , wherein each sub-list includes two or more control codes extracted from two or more commands for the hardware accelerator.
12 . A system, comprising: a hardware accelerator; a host processor coupled to the hardware accelerator; wherein the host processor is capable of implementing operations including: generating a stitched block representing a plurality of commands for the hardware accelerator; generating a stitched command from the plurality of commands, wherein the stitched command references the stitched block; wherein the hardware accelerator is capable of implementing operations including: executing the stitched block in response to invoking the stitched command; and generating a single notification directed to the host processor for the stitched command.
13 . The system of claim 12 , wherein the stitched command is a meta command and the stitched block comprises the plurality of commands concatenated together as a command list stored in a host memory of the host processor.
14 . The system of claim 13 , wherein the executing the stitched block by the hardware accelerator comprises: fetching the command list from the host memory; and executing each command of the command list; wherein each command of the command list points to a control code that is executable by the hardware accelerator.
15 . The system of claim 13 , wherein the hardware accelerator executes the stitched block as a plurality of individual commands.
16 . The system of claim 12 , wherein the stitched command is a fused command and the stitched block comprises merged control code including a control code extracted from each command of the plurality of commands.
17 . The system of claim 16 , wherein the generating the stitched block by the host processor comprises: extracting control codes from the plurality of commands; and combining the control codes as extracted into the merged control code.
18 . The system of claim 16 , wherein the hardware accelerator executes the merged control code.
19 . The system of claim 12 , wherein the stitched block comprises a plurality of sub-lists that are linked.
20 . The system of claim 19 , wherein the host processor is capable of adding one or more additional sub-lists to the plurality of sub-lists while the hardware accelerator executes at least one of the plurality of sub-lists.

Description

TECHNICAL FIELD This disclosure relates to hardware acceleration and, more particularly, to stitching together commands from an application for execution by a hardware accelerator. BACKGROUND A hardware accelerator is a device or circuitry adapted to perform particular processing tasks. The processing tasks may be delegated from a host processor such as a Central Processing Unit (CPU). In many cases, the data set operated on by a hardware accelerator is too large to fit in the available memory of the hardware accelerator or too large to be processed in a single invocation of the hardware accelerator. As such, the data set and/or task to be performed must be broken into smaller parts for processing by the hardware accelerator. Such is often the case for Neural Processing Unit (NPU) type hardware accelerators that are adapted to perform a task such as an artificial intelligence (AI) based inferencing operation. In the typical case, the inferencing operation is broken into many smaller parts that can be performed by the NPU. Each smaller part of the inferencing operation is initiated by way of a corresponding command. For example, the inferencing operation may be broken into hundreds or thousands of smaller operations each invoked by a corresponding command provided to the NPU. This approach also may be used when processing a data set through a plurality of different stages. Each stage may be broken down into smaller processing stages. Each smaller processing stage is initiated by a corresponding command. These commands and corresponding operations traverse through the software and hardware layers of the host processor and the hardware accelerator. As may be observed, with this approach, the number of commands issued from the host processor to the hardware accelerator to perform even a single inferencing operation increases significantly. Each command has overhead in terms of command submission and completion. With respect to command submission, the command must be forwarded from the host processor to the hardware accelerator. In terms of command completion, for each command submitted to the NPU that is successfully executed, the NPU generates a notification to the host processor indicating that execution of the command has completed. This overhead for each command is fixed and usually time consuming. In some cases, the time required to execute a command by the NPU is less than the amount of time needed for command submission and completion. SUMMARY In one or more embodiments, a method includes generating, by a host processor, a stitched block representing a plurality of commands for a hardware accelerator. The method includes generating, by the host processor, a stitched command from the plurality of commands. The stitched command references the stitched block. The method includes executing, by the hardware accelerator, the stitched block in response to invoking the stitched command. The method includes generating, by the hardware accelerator, a single notification directed to the host processor for the stitched command. In one or more embodiments, a system includes a hardware accelerator and a host processor coupled to the hardware accelerator. The host processor is capable of implementing operations including generating a stitched block representing a plurality of commands for the hardware accelerator. The host processor is capable of implementing operations including generating a stitched command from the plurality of commands. The stitched command references the stitched block. The hardware accelerator is capable of implementing operations including executing the stitched block in response to invoking the stitched command. The hardware accelerator is capable of implementing operations including generating a single notification directed to the host processor for the stitched command. This Summary section is provided merely to introduce certain concepts and not to identify any key or essential features of the claimed subject matter. Many other features and embodiments of the disclosed technology will be apparent from the accompanying drawings and from the following detailed description. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings show one or more embodiments of the disclosed technology. The drawings, however, should not be construed to be limiting of the inventive arrangements to only the embodiments shown. Various aspects and advantages will become apparent upon review of the following detailed description and upon reference to the drawings. FIG. 1 illustrates a computing environment in accordance with one or more embodiments of the disclosed technology. FIG. 2 is a method of operation of the computing environment of FIG. 1 in accordance with one or more embodiments of the disclosed technology. FIG. 3 illustrates a meta command implementation in accordance with one or more embodiments of the disclosed technology. FIG. 4 illustrates a method of using meta commands in accordance with one or m