CN-122019070-A - Task dynamic scheduling method and system based on hardware state awareness

CN122019070ACN 122019070 ACN122019070 ACN 122019070ACN-122019070-A

Abstract

The invention discloses a task dynamic scheduling method and a task dynamic scheduling system based on hardware state awareness, and relates to the technical field of computer system resource scheduling, wherein the method comprises the steps of collecting hardware state data such as the utilization rate of a video memory, core temperature, power consumption and the like of at least one hardware computing unit in real time; dynamically distributing the task to be executed to a hardware computing unit with load meeting preset conditions according to the data; when the hardware state data of the target hardware computing unit exceeds a preset threshold value, a task overflow migration mechanism is started to migrate the task to other available hardware computing units, and the task allocation strategy is iteratively optimized based on feedback data and hardware state change history after task execution. The system comprises a hardware monitoring module, a task allocation module, a task migration module and a feedback optimization module. The method can effectively improve the utilization rate of hardware resources, enhance the continuity and stability of task execution, and is suitable for various high-performance computing scenes such as deep learning, scientific computing, real-time rendering and the like.

Inventors

HU QUNCHAO
XIANG TIAN

Assignees

合肥速显微电子科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251203

Claims (10)

1. The task dynamic scheduling method based on hardware state awareness is characterized by comprising the following steps of: Collecting hardware state data of at least one hardware computing unit in real time, wherein the hardware state data comprises a video memory utilization rate, a core temperature and power consumption; According to the hardware state data, dynamically distributing the task to be executed to a hardware computing unit with load meeting preset conditions; When the hardware state data of a target hardware computing unit is monitored to exceed a preset threshold value, a task overflow migration mechanism is started, and at least one task on the target hardware computing unit is migrated to other available hardware computing units; based on feedback data and hardware state change history after task execution, the task allocation strategy is iteratively optimized.
2. The method for dynamically scheduling tasks based on hardware state awareness according to claim 1, wherein the method for collecting hardware state data of at least one hardware computing unit in real time specifically comprises: periodically acquiring the running state indexes of the GPU and/or the NPU in a polling or event-driven mode through an integrated hardware monitoring module; and carrying out abnormality diagnosis on the acquired state data, and generating an early warning signal when abnormality is diagnosed.
3. The method for dynamically scheduling tasks based on hardware state awareness according to claim 1, wherein the method for dynamically distributing tasks to be performed to hardware computing units whose loads meet preset conditions comprises: Evaluating the current load level of each hardware computing unit according to the real-time hardware state data; constructing a load priority queue according to the load evaluation result, and preferentially distributing the new task to a hardware computing unit with a lower load level; When the resource utilization of any hardware computing unit approaches its capacity threshold, the allocation of new tasks to that unit is restricted or suspended.
4. The method for dynamically scheduling tasks based on hardware state awareness according to claim 1, wherein the method for starting a task overflow migration mechanism specifically comprises: The preset threshold value comprises a core temperature upper limit, a video memory use rate upper limit or a power consumption upper limit; the task migration process guarantees the continuity of the execution state of the migrated tasks and resumes execution after the target hardware resources are available.
5. The method for dynamically scheduling tasks based on hardware state awareness according to claim 1, wherein the method for iteratively optimizing task allocation policies specifically comprises: collecting performance data in the task execution process, including task execution delay and hardware resource utilization rate; and the historical hardware state data and performance data are combined, and scheduling parameters are adjusted through machine learning or a rule engine so as to improve the adaptability and efficiency of subsequent task scheduling.
6. A task dynamic scheduling system based on hardware state awareness, wherein the system is configured to implement the task dynamic scheduling method based on hardware state awareness according to any one of claims 1 to 5, and the task dynamic scheduling system includes the following modules: the hardware monitoring module is used for collecting hardware state data of at least one hardware computing unit in real time, wherein the hardware state data comprises a video memory utilization rate, a core temperature and power consumption; The task allocation module is used for dynamically allocating the task to be executed to a hardware computing unit with load meeting preset conditions according to the hardware state data; The task migration module is used for starting a task overflow migration mechanism when the hardware state data of the target hardware computing unit exceeds a preset threshold value, and migrating at least one task on the target hardware computing unit to other available hardware computing units; and the feedback optimization module is used for iteratively optimizing the task allocation strategy based on the feedback data and the hardware state change history after the task is executed.
7. The hardware state aware-based task dynamic scheduling system of claim 6, wherein the hardware monitoring module is further configured to: performing abnormality diagnosis on the acquired video memory utilization rate, core temperature and power consumption data; and when abnormality is diagnosed, an early warning signal is sent to the task migration module and the task allocation module so as to trigger adjustment of a scheduling strategy.
8. The hardware state aware-based task dynamic scheduling system of claim 6, wherein the task allocation module is further configured to: Evaluating the current load level of each hardware computing unit according to the real-time hardware state data, and constructing a load priority queue; and generating a task allocation instruction according to the load priority queue.
9. The hardware state aware-based task dynamic scheduling system of claim 6, wherein the task migration module is further configured to: Maintaining a pool of available hardware resources; When the task overflow migration mechanism is started, selecting a proper spare hardware computing unit from the resource pool, and executing task migration operation.
10. The hardware state aware-based task dynamic scheduling system of claim 6, wherein the feedback optimization module is further configured to: And dynamically adjusting decision parameters in a scheduling strategy by analyzing historical associated data of task execution delay, resource utilization rate and hardware state so as to realize continuous optimization of scheduling efficiency.

Description

Task dynamic scheduling method and system based on hardware state awareness Technical Field The invention relates to the technical field of computer system resource scheduling, in particular to a task dynamic scheduling method and system based on hardware state awareness. Background Graphics Processing Units (GPUs) and Neural Processing Units (NPUs) have become key computing resources in high performance computing scenarios such as deep learning, scientific computing, real-time rendering, and edge computing. The efficient parallel processing capability accelerates the execution of complex tasks, however, how to efficiently and stably schedule tasks to appropriate hardware resources in a runtime environment of concurrent execution of multiple tasks has become a key factor affecting the overall performance and stability of the system. The existing task scheduling mechanism is mostly based on static configuration or simple load balancing strategies, and lacks the perception capability of hardware real-time state. Specifically, the prior art has the following disadvantages: 1. The hardware state monitoring capability is insufficient, namely the traditional scheduling system usually depends on a preset scheduling strategy and cannot acquire and respond to the dynamic state information of the hardware in real time, such as key indexes of the memory utilization rate, the core temperature, the power consumption and the like. This makes it difficult for the system to make timely adjustments in case of hardware resource shortage or anomalies. 2. Task allocation and hardware state disconnection, namely the existing scheduling strategy usually ignores the actual load condition of hardware, so that overload of partial hardware resources is easy to cause, other resources are idle, the problems of resource contention, task queuing delay increase, even overheating of hardware and the like are caused, and the task execution efficiency and the system reliability are affected. 3. Under the condition of high load, the task failure rate is high, and when resources are tense or hardware state is abnormal (such as overflow of a video memory and overhigh core temperature), an effective task migration or dynamic scheduling mechanism is lacked, so that the task execution failure rate is increased, and the overall stability of the system is challenged. Therefore, an intelligent scheduling mechanism capable of sensing hardware states in real time, dynamically adjusting task scheduling strategies and having task overflow migration capability is urgently needed in the field, so that the utilization rate of system resources is improved, the continuity and stability of task execution are guaranteed, and the intelligent scheduling mechanism is suitable for various high-concurrency and high-load application scenes. Disclosure of Invention Therefore, the embodiment of the invention provides a task dynamic scheduling method and a task dynamic scheduling system based on hardware state awareness, which are used for solving the problems of uneven resource utilization, system overload, high task failure rate and the like caused by the fact that a static scheduling strategy is adopted and cannot adapt to a dynamic hardware state in the prior art. In order to solve the technical problems, an embodiment of the present invention provides a task dynamic scheduling method based on hardware state awareness, which includes the following steps: Collecting hardware state data of at least one hardware computing unit in real time, wherein the hardware state data comprises a video memory utilization rate, a core temperature and power consumption; According to the hardware state data, dynamically distributing the task to be executed to a hardware computing unit with load meeting preset conditions; When the hardware state data of a target hardware computing unit is monitored to exceed a preset threshold value, a task overflow migration mechanism is started, and at least one task on the target hardware computing unit is migrated to other available hardware computing units; based on feedback data and hardware state change history after task execution, the task allocation strategy is iteratively optimized. Preferably, the method for collecting hardware state data of at least one hardware computing unit in real time specifically includes: periodically acquiring the running state indexes of the GPU and/or the NPU in a polling or event-driven mode through an integrated hardware monitoring module; and carrying out abnormality diagnosis on the acquired state data, and generating an early warning signal when abnormality is diagnosed. Preferably, the method for dynamically distributing the task to be executed to the hardware computing unit with the load meeting the preset condition specifically includes: Evaluating the current load level of each hardware computing unit according to the real-time hardware state data; constructing a load priority queue according to the loa