US-20260127035-A1 - TARGETED ACCELERATOR DISPATCH

US20260127035A1US 20260127035 A1US20260127035 A1US 20260127035A1US-20260127035-A1

Abstract

Computer implemented methods, systems, and computer program products include program code executing on a processor(s) which initiate, from a first chip in a computing system, a request to utilize an accelerator in the computing system to process work associated with a given workload. The program code determines that a local accelerator on the first chip is not available to accept a dispatch of the work. The program code obtains hardware monitoring data from local hardware counters associated with various elements of the computing system. The program code determines, based on the hardware monitoring data, a best remote accelerator for dispatching the work to by selecting an accelerator that utilizes bandwidth and one or more caches which are minimally accessed by other workloads processing in the computing system concurrently with the given workload when compared the other accelerators comprising the one or more remote accelerators.

Inventors

Cedric Lichtenau
Simon Friedmann
Simon Bubeck
Craig R. Walters

Assignees

INTERNATIONAL BUSINESS MACHINES CORPORATION

Dates

Publication Date: 20260507
Application Date: 20241101

Claims (20)

1 . A computer-implemented method for performance sensitive targeted accelerator dispatch, comprising: initiating, by one or more processors from a first chip in a computing system, a request to utilize an accelerator in the computing system to process work associated with a given workload; determining, by the one or more processors, that a local accelerator on the first chip is not available to accept a dispatch of the work; obtaining, by the one or more processors, hardware monitoring data from local hardware counters associated with various elements of the computing system; and determining, based on the hardware monitoring data, a best remote accelerator for dispatching the work to, wherein the best remote accelerator comprises selecting an accelerator from one or more remote accelerators in the computing system, wherein based on the hardware monitoring data the selected accelerator utilizes bandwidth and one or more caches which are minimally accessed by other workloads processing in the computing system concurrently with the given workload when compared the other accelerators comprising the one or more remote accelerators.
2 . The computer-implemented method of claim 1 , further comprising: determining, by the one or more processors, if activity on a chip or core associated with best remote accelerator is below a pre-defined threshold.
3 . The computer-implemented method of claim 2 , further comprising: based on determining that the activity on the chip or the core associated with best remote accelerator is below the pre-defined threshold, dispatching the work to the best remote accelerator.
4 . The computer-implemented method of claim 3 , further comprising: determining, by the one or more processors, that the local accelerator is available; and dispatching, by the one or more processors, the work to the local accelerator.
5 . The computer-implemented method of claim 1 , wherein determining the best remote accelerator for dispatching the work to comprises ranking the one or more remote accelerators based on the hardware monitoring data.
6 . The computer-implemented method of claim 5 , wherein the ranking comprises ranking the one or more remote accelerators by cache usage.
7 . The computer-implemented method of claim 5 , wherein the ranking comprises ranking the one or more remote accelerators based on accelerator-unrelated activity on each chip or core comprising each remote accelerator of the one or more accelerators.
8 . The computer-implemented method of claim 5 , wherein the ranking comprises evaluating link activity to reach each remote accelerator of the one or more remote accelerators from the first chip.
9 . The computer-implemented method of claim 5 , wherein the ranking comprises evaluating potential cross-workload interference and impact to individual workload performance for workloads running on the computing system.
10 . The computer-implemented method of claim 6 , further comprising: determining, by the one or more processors, the cache usage of the one or more remote accelerators independent of accelerator availability.
11 . The computer-implemented method of claim 10 , wherein the cache usage is selected from the group consisting of: L2 eviction intensity, L3 eviction intensity, and L4 eviction intensity.
12 . The computer-implemented method of claim 1 , wherein the hardware monitoring data is selected from the group consisting of: links usage, memory access, and cache usage.
13 . The computer-implemented method of claim 1 , further comprising: implementing, by the one or more processors, the local hardware counters.
14 . The computer-implemented method of claim 1 , wherein obtaining the hardware monitoring data comprises: obtaining, by the one or more processors, system wide non-accelerator related real-time data from the local hardware counters.
15 . The computer-implemented method of claim 14 , wherein the system wide non-accelerator related real-time data comprises cache eviction activity for each cache hierarchy level of caches comprising the computing system.
16 . The computer-implemented method of claim 14 , wherein the system wide non-accelerator related real-time data comprises a processor cache footprint of each workload running in the computing system.
17 . The computer-implemented method of claim 14 , wherein the system wide non-accelerator related real-time data comprises bandwidth utilization between chip and to memory for each chip and each memory comprising the computing system.
18 . The computer-implemented method of claim 1 , wherein the given workload comprises an artificial intelligence (AI) process and the other workloads processing in the computing system concurrently with the given workload do not comprise AI processes.
19 . A computer system for performance sensitive targeted accelerator dispatch, comprising: a memory; and one or more processors in communication with the memory, wherein the computer system is configured to perform a method, said method comprising: initiating, by the one or more processors from a first chip in a computing system, a request to utilize an accelerator in the computing system to process work associated with a given workload; determining, by the one or more processors, that a local accelerator on the first chip is not available to accept a dispatch of the work; obtaining, by the one or more processors, hardware monitoring data from local hardware counters associated with various elements of the computing system; and determining, based on the hardware monitoring data, a best remote accelerator for dispatching the work to, wherein the best remote accelerator comprises selecting an accelerator from one or more remote accelerators in the computing system, wherein based on the hardware monitoring data the selected accelerator utilizes bandwidth and one or more caches which are minimally accessed by other workloads processing in the computing system concurrently with the given workload when compared the other accelerators comprising the one or more remote accelerators.
20 . A computer program product for performance sensitive targeted accelerator dispatch, the computer program product comprising: one or more computer readable storage media and program instructions collectively stored on the one or more computer readable storage media readable by at least one processing circuit to: initiate, from a first chip in a computing system, a request to utilize an accelerator in the computing system to process work associated with a given workload; determine that a local accelerator on the first chip is not available to accept a dispatch of the work; obtain hardware monitoring data from local hardware counters associated with various elements of the computing system; and determine, based on the hardware monitoring data, a best remote accelerator for dispatching the work to, wherein the best remote accelerator comprises selecting an accelerator from one or more remote accelerators in the computing system, wherein based on the hardware monitoring data the selected accelerator utilizes bandwidth and one or more caches which are minimally accessed by other workloads processing in the computing system concurrently with the given workload when compared the other accelerators comprising the one or more remote accelerators.

Description

BACKGROUND One or more aspects relate, in general, to facilitating processing within a computing environment, and in particular, to protect workloads running in parallel with workloads requesting to utilize a particular accelerator. Artificial intelligence (AI) refers to intelligence exhibited by machines. Artificial intelligence (AI) research includes search and mathematical optimization, neural networks, and probability. Artificial intelligence (AI) solutions involve features derived from research in a variety of different science and technology disciplines ranging from computer science, mathematics, psychology, linguistics, statistics, and neuroscience. Machine learning has been described as the field of study that gives computers the ability to learn without being explicitly programmed. Performance accelerators, also known as accelerators (including hardware accelerators) are microprocessors or specialized circuits or functions that are capable of accelerating certain workloads. Workloads that can be accelerated are offloaded to the performance accelerators, which are much more efficient at performing workloads, such as AI, machine vision, and deep learning. Performance acceleration can integrate general-purpose processors and more specific purpose processors to work together simultaneously to perform a task. Performance accelerators are capable of performing parallel computations rather than serial computing. SUMMARY Shortcomings of the prior art are overcome, and additional advantages are provided through the provision of a computer-implemented method for performance sensitive targeted accelerator dispatch. The method can include: initiating, by one or more processors from a first chip in a computing system, a request to utilize an accelerator in the computing system to process work associated with a given workload; determining, by the one or more processors, that a local accelerator on the first chip is not available to accept a dispatch of the work; obtaining, by the one or more processors, hardware monitoring data from local hardware counters associated with various elements of the computing system; and determining, based on the hardware monitoring data, a best remote accelerator for dispatching the work to, wherein the best remote accelerator comprises selecting an accelerator from one or more remote accelerators in the computing system, wherein based on the hardware monitoring data the selected accelerator utilizes bandwidth and one or more caches which are minimally accessed by other workloads processing in the computing system concurrently with the given workload when compared the other accelerators comprising the one or more remote accelerators. Shortcomings of the prior art are overcome, and additional advantages are provided through the provision of a computer program product for performance sensitive targeted accelerator dispatch. The computer program product comprises a storage medium readable by one or more processors and storing instructions for execution by the one or more processors for performing a method. The method includes, for instance: initiating, by the one or more processors from a first chip in a computing system, a request to utilize an accelerator in the computing system to process work associated with a given workload; determining, by the one or more processors, that a local accelerator on the first chip is not available to accept a dispatch of the work; obtaining, by the one or more processors, hardware monitoring data from local hardware counters associated with various elements of the computing system; and determining, based on the hardware monitoring data, a best remote accelerator for dispatching the work to, wherein the best remote accelerator comprises selecting an accelerator from one or more remote accelerators in the computing system, wherein based on the hardware monitoring data the selected accelerator utilizes bandwidth and one or more caches which are minimally accessed by other workloads processing in the computing system concurrently with the given workload when compared the other accelerators comprising the one or more remote accelerators. Shortcomings of the prior art are overcome, and additional advantages are provided through the provision of a system for performance sensitive targeted accelerator dispatch. The system includes: a memory, one or more processors in communication with the memory, and program instructions executable by the one or more processors via the memory to perform a method. The method can include initiating, by the one or more processors from a first chip in a computing system, a request to utilize an accelerator in the computing system to process work associated with a given workload; determining, by the one or more processors, that a local accelerator on the first chip is not available to accept a dispatch of the work; obtaining, by the one or more processors, hardware monitoring data from local hardware counters associated with var