US-20260127028-A1 - APPARATUS AND METHOD FOR EFFICIENT SCHEDULING OF ACCELERATOR WORKLOADS

US20260127028A1US 20260127028 A1US20260127028 A1US 20260127028A1US-20260127028-A1

Abstract

An apparatus and method for scheduling multiple contexts on a plurality of neural processing unit (NPU) tiles. One embodiment of an apparatus comprises: a plurality of neural processing unit (NPU) tiles; a scheduler to schedule a plurality of workloads for execution on the plurality of NPU tiles, the scheduler to: schedule a first workload associated with a first priority for execution on at least a first NPU tile of the plurality of NPU tiles; and responsive to an indication of a second workload associated with a second priority which is higher than the first priority submitted for execution, determining whether to: preempt the first workload, execute the second workload on a second NPU tile, or provide a grace period for the first workload to complete execution before executing the second workload on the first NPU tile.

Inventors

Paul Murphy

Assignees

INTEL CORPORATION

Dates

Publication Date: 20260507
Application Date: 20251218

Claims (19)

1 . An apparatus comprising: a plurality of neural processing unit (NPU) tiles; and a scheduler to schedule a plurality of workloads for execution on the plurality of NPU tiles, where in the scheduler configured to: schedule a first workload associated with a first priority for execution on at least a first NPU tile of the plurality of NPU tiles; and responsive to an indication of a second workload associated with a second priority which is higher than the first priority submitted for execution, determining whether to: preempt the first workload to execute the second workload on the first NPU tile, execute the second workload on a second NPU tile in parallel with the first workload, or provide a grace period for the first workload to complete execution before executing the second workload on the first NPU tile, wherein the determining based, at least in part, on one or more of: (i) whether there are any idle NPU tiles in the plurality of NPU tiles; (ii) estimated deadlines associated with the first workload and/or the second workload; (iii) a scheduling policy associated with workloads executing at different priority levels; and (iv) current power or thermal conditions of the plurality of NPU tiles or a processor in which the plurality of NPU tiles are integrated.
2 . The apparatus of claim 1 , wherein, based on an estimated deadline associated with the second workload, the scheduler is to provide a grace period for the first workload to complete execution on the first NPU tile before executing the second workload on the first NPU tile, the grace period having a duration selected so that the execution of the second workload is completed in accordance with the estimated deadline associated with the second workload.
3 . The apparatus of claim 2 , wherein the scheduler is to: cause a first context state associated with the first workload to be saved to memory and/or persistent storage; and preempt the first workload to execute the second workload on the first NPU tile when the first workload has not completed execution in accordance with the grace period.
4 . The apparatus of claim 3 , wherein, following execution of the second workload, the scheduler is to: resume execution of the first workload on the first NPU tile and cause the first context state to be restored to the first NPU tile from the memory and/or persistent storage.
5 . The apparatus of claim 1 , wherein the indication of the second workload is to be provided to the scheduler in a doorbell register or memory location updated by a host processor to indicate the second workload.
6 . The apparatus of claim 1 , wherein the first workload is associated with a first user context and the second workload is associated with a second user context, and wherein during execution of the first workload, the scheduler is to track a first context state corresponding to the first user context and during execution of the second workload, the scheduler is to track a second context state corresponding to the second user context.
7 . The apparatus of claim 6 , wherein the scheduler is to perform per-user context timeout tracking, comprising a first period of time or first number of execution cycles within which the first user context must complete execution and a second period of time or second number of execution cycles within which the second user context must complete execution.
8 . The apparatus of claim 7 , wherein if the first user context fails to complete execution within the first period of time or first number of execution cycles, then the scheduler is to generate a notification to a host processor, which is to subsequently cause a reset of at least the first NPU tile.
9 . The apparatus of claim 6 , wherein the first context state includes any errors generated during execution of the first workload and the second context state includes any errors generated during execution of the second workload.
10 . The apparatus of claim 1 , wherein when the scheduling policy indicates that any workload executing at the first priority will not be scheduled in parallel with any other workload, then the scheduler is to preempt the first workload in favor of the second workload.
11 . The apparatus of claim 1 , wherein if there is at least a second NPU tile which is idle and which is capable of executing the second workload in accordance with an estimated deadline associated with the second workload, then the scheduler is to schedule the second workload for execution on the second NPU tile, the second workload to be executed on the second NPU tile in parallel with the first workload being executed on the first NPU tile.
12 . The apparatus of claim 1 , wherein the scheduler is to determine to not execute the second workload in parallel with the first workload and/or is to determine to reduce a frequency of one or more of the plurality of NPU tiles if the current power or thermal conditions of the plurality of NPU tiles indicate that a power threshold or temperature threshold is exceeded.
13 . A machine-readable medium having program code stored thereon which, when executed by one or more processors, is to cause the one or more processors to perform operations, comprising: scheduling, by a scheduler, a plurality of workloads for execution on a plurality of NPU tiles, wherein scheduling further comprises: scheduling a first workload associated with a first priority for execution on at least a first NPU tile of the plurality of NPU tiles; and determining, responsive to an indication of a second workload associated with a second priority which is higher than the first priority submitted for execution, whether to: preempt the first workload to execute the second workload on the first NPU tile, execute the second workload on a second NPU tile in parallel with the first workload, or provide a grace period for the first workload to complete execution before executing the second workload on the first NPU tile, the determining based, at least in part, on one or more of: (i) whether there are any idle NPU tiles in the plurality of NPU tiles; (ii) estimated deadlines associated with the first workload and/or the second workload; (iii) a scheduling policy associated with workloads executing at different priority levels; and (iv) current power or thermal conditions of the plurality of NPU tiles or a processor in which the plurality of NPU tiles are integrated.
14 . The machine-readable medium of claim 13 , wherein based on an estimated deadline associated with the second workload, providing, by the scheduler, a grace period for the first workload to complete execution on the first NPU tile before executing the second workload on the first NPU tile, the grace period having a duration selected to ensure that the estimated deadline associated with the second workload will be met.
15 . The machine-readable medium of claim 14 , wherein the scheduler is to preempt the first workload to execute the second workload on the first NPU tile when the first workload has not completed execution in accordance with the grace period.
16 . The machine-readable medium of claim 15 , wherein to preempt the first workload, the scheduler is to cause a first context state associated with the first workload to be saved to memory and/or persistent storage.
17 . The machine-readable medium of claim 16 , wherein following execution of the second workload, the scheduler is to resume execution of the first workload on the first NPU tile, the scheduler to cause the first context state to be restored to the first NPU tile from the memory and/or persistent storage.
18 . The machine-readable medium of claim 13 , wherein the indication of the second workload is to be provided to the scheduler in a doorbell register or memory location updated by a host processor to indicate the second workload.
19 . The machine-readable medium of claim 13 , wherein the first workload is associated with a first user context and the second workload is associated with a second user context, and wherein during execution of the first workload, the scheduler is to track a first context state corresponding to the first user context and during execution of the second workload, the scheduler is to track a second context state corresponding to the second user context.

Description

BACKGROUND Field of the Disclosure This disclosure relates generally to the field of computer processors. More particularly, this disclosure relates to an apparatus and method for efficient scheduling of accelerator workloads. Description of the Related Art In current silicon-on-chip (SoC) implementations, firmware is responsible for scheduling user contexts on accelerators such as neural processing units (NPUs). When making scheduling decisions regarding the user context to schedule next, the absolute and relative priorities of each user context and the allowed quantum of each user context are evaluated. However, only one user context is selected to run at a time, potentially leaving hardware resources under-utilized, even when other user contexts could be scheduled to make use of the unused resources. In addition, the management algorithms to determine frequency and resource constraints of an NPU operate at a global level to control a single NPU frequency for all running workloads. For NPU devices which support concurrency, this means that lower priority work receives the benefit of higher frequencies of higher priority work executed in parallel, potentially working contrary to the expectation of the user who expected a lower power impact for the lower priority work. Additionally, the lower priority work may even compete with and slow down the higher priority work because it competes for shared resources. In current implementations, workloads are scheduled on NPUs using the maximum available resources. Because no consideration is provided for reduced resource options which also meet the requirements of a given workload, the NPU scheduler must schedule the work on fixed, maximum resource requirements, resulting in higher power consumption and reduced performance due to preemption which is not needed to satisfy the workload requirements. BRIEF DESCRIPTION OF THE DRAWINGS A better understanding of the present invention can be obtained from the following detailed description in conjunction with the following drawings, in which: FIG. 1 illustrates an example processor or system-on-chip (SoC) including a neural processing unit (NPU) accelerator. FIG. 2 illustrates an example NPU partitioned into a plurality of NPU tiles. FIG. 3 illustrates an example accelerator workloads processing environment comprising sequences of operations of multiple user contexts, an NPU scheduler, and NPU hardware. FIG. 4A illustrates an example of per-context timeout tracking for two user contexts. FIG. 4B illustrates an example of per-context timeout tracking for three user contexts. FIG. 5 illustrates an example power management subsystem including power circuits associated with an NPU, a host processor, and input-output (IO) circuitry. FIG. 6 illustrates an implementation in which the frequency and voltage of each NPU tile can be independently adjusted. FIG. 7A illustrates an SoC including techniques for communicating between an NPU and a CPU. FIG. 7B illustrates an example in which a frame is generated by a graphics processing unit (GPU) and then processed by an NPU. FIG. 8 illustrates an example indicating the state of a GPU and an NPU based on actions performed by an application. FIG. 9 illustrates an example in which a grace period and context deadlines of accelerator workloads scheduling are determined for two contexts. FIG. 10 illustrates an example with preemption in which a grace period and context deadlines of accelerator workloads scheduling are determined for two contexts. FIG. 11 illustrates an example in which a grace period allows a context to complete without preemption by a higher priority context. FIG. 12 illustrates a method for scheduling accelerator workloads on NPU tiles in accordance with one or more embodiments of this disclosure. FIG. 13 illustrates a method in accordance with some embodiments of this disclosure in which additional resource allocations are performed based on quality-of-service (QoS) levels associated with different workloads DETAILED DESCRIPTION OF EMBODIMENTS Disclosed herein are embodiments of instructions, embodiments of processors to perform the instructions, embodiments of methods performed by the processors when performing the instructions, embodiments of systems incorporating one or more processors to perform the instructions, and embodiments of programs or machine-readable mediums storing or otherwise providing the instructions. In the following description, numerous specific details are set forth (e.g., specific instruction operations, data formats, processor configurations, microarchitectural details, sequences of operations, etc.). However, embodiments may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown in detail to avoid obscuring the understanding of the description. Implementations of this disclosure efficiently schedule multiple user contexts on an accelerator, such as a neural processing unit