CN-121807504-B - Self-adaptive load resource scheduling method and device for embedded GPU

CN121807504BCN 121807504 BCN121807504 BCN 121807504BCN-121807504-B

Abstract

The invention belongs to the technical field of embedded artificial intelligence and real-time system scheduling, and particularly relates to an embedded GPU-oriented self-adaptive load resource scheduling method and device. The method comprises the steps of acquiring and modeling hot-spot popularity of a platform in an offline stage, constructing DNN variants with different calculation loads and accuracies, carrying out Profiling under multi-frequency and TPC configuration to obtain a delay-power consumption mapping table, and selecting variants and corresponding TPC configuration which have highest accuracies and can meet deadlines in the current hot state from a Profiling table in an online stage based on task cutoff allowance and current temperature/frequency state, wherein the variants and the corresponding TPC configuration are subjected to online cutting in stages and low-cost switching when necessary, and are cooperated with system-level DVFS/task mapping to optimize energy efficiency. The method can still ensure the real-time performance of the reasoning task and obviously reduce the performance fluctuation caused by thermal throttling under the passive frequency-reducing condition.

Inventors

PANG WEIGUANG
GAO LONGXIANG
Fu Kexue
WANG CHANGWEI
QU YOUYANG
GU SHUJUN
GE SHUXIN

Assignees

山东省计算中心（国家超级计算济南中心）

Dates

Publication Date: 20260505
Application Date: 20260306

Claims (10)

1. An adaptive load resource scheduling method for an embedded GPU is applied to an embedded GPU platform with a passive thermal throttling characteristic, and is characterized by comprising an offline preprocessing stage and an online real-time scheduling stage: The off-line pretreatment stage comprises the following steps: S1, acquiring and analyzing temperature and frequency dynamic change characteristics of a target edge SoC platform under a passive thermal throttling mechanism, establishing a semi-empirical model of temperature-frequency-execution time-energy consumption, and acquiring worst execution time boundaries of reasoning tasks under different thermal states; s2, selecting a shared reference CNN model, constructing a group of DNN variants with different calculation loads and reasoning precision, adopting a shared weight and alternate activation strategy to complete the combined training of all DNN variants, and testing to obtain the reference delay and precision of each DNN variant; S3, performing performance analysis on each variant under a plurality of GPU frequency gears and TPC activation configuration, and establishing a delay table C and a power consumption table P which comprise the variants under different resource configurations, wherein the delay table C and the power consumption table P form a Profiling table together; The online real-time scheduling stage comprises the following steps: S4, calculating to obtain a task cut-off allowance when each reasoning task arrives, and performing conservative evaluation on GPU frequency in a task execution stage to determine an evaluation frequency gear corresponding to the task; s5, inquiring a delay table C based on the task cut-off allowance and the frequency gear obtained through evaluation, selecting a DNN variant model with highest precision, selecting a TPC configuration with least activation from feasible TPC configurations corresponding to the model, and starting reasoning.
2. The adaptive load resource scheduling method for an embedded GPU according to claim 1, wherein step S1 specifically comprises the following steps: Consult chip manual and thermal management technical document of the target edge SoC platform, clear the working parameters of the passive thermal throttling mechanism, including the thermal throttling trigger temperature threshold Frequency gear set corresponding to different temperature intervals Control strategy for frequency drop and recovery, temperature and frequency sensor interface type readable by the system; designing a single variable control experiment, gradually changing a single variable on the premise of keeping the consistency of AI reasoning task types, model structures and running environments, and periodically collecting data tuples { , , And the corresponding GPU reasoning time and the total power consumption of the platform to form a corresponding sample set between the GPU frequency state and the task reasoning time, wherein, And The three tuple variables are the current main board temperature, the GPU temperature and the GPU clock frequency of the system respectively; based on sample set data, a semi-empirical model of temperature, frequency, execution time and energy consumption is established, the distribution characteristic of task reasoning execution time is obtained, the frequency stable section, the frequency reduction trigger point and the recovery hysteresis characteristic in each temperature region are obtained, and the worst execution time is extracted.
3. The adaptive load resource scheduling method for an embedded GPU according to claim 1, wherein step S2 specifically comprises the following steps: Selecting a shared reference CNN model and constructing a set of DNN variants m= { ,..., A variant model having part of the parameters of the shared reference model; The DNN variant is trained by adopting a shared weight and alternate activation strategy, namely, a complete shared reference CNN model is firstly trained to obtain a shared initialization weight, and then a certain parameter proportion is randomly activated in a training process with set probability The corresponding sub-networks carry out forward and backward propagation, the shared parameters are synchronously updated among all the activated sub-networks, and the exclusive parameters are updated only when the corresponding sub-networks are activated; after training, carrying out reasoning test on each variant on the target edge SoC platform, and recording the reference delay and precision of each variant.
4. The adaptive load resource scheduling method for an embedded GPU according to claim 3, wherein step S3 specifically comprises the following steps: For each variant Several frequency ranges selected off-line With several TPC mask configurations ∈ Running representative inputs down and recording maximum observed GPU inference delays And corresponding power consumption Wherein k is DNN variant index, j is frequency gear index, and h is TPC configuration index; all data are organized into a delay table C and a power consumption table P, the delay table C and the power consumption table P form a Profiling table together, and the variants are reordered and subjected to dominant term elimination according to reference delay and precision.
5. The adaptive load resource scheduling method for an embedded GPU according to claim 1, wherein step S4 specifically comprises the following steps: when each reasoning task arrives or the job starts, the current time t and the job start time are read Absolute cut-off time Calculating a task cutoff margin Simultaneously reading the current GPU report frequency gear index j and the node temperature T (T), and presetting a temperature margin If (if) Indicating that the frequency reduction is about to trigger, making the estimated frequency j 'be a lower gear than the current frequency j in the Profiling table, otherwise making j' =j.
6. The adaptive load resource scheduling method for an embedded GPU according to claim 5, wherein step S5 specifically comprises the following steps: under the determined j', find the satisfaction from the delay table C ≤ Obtaining a feasible DNN variant model set; Selecting DNN variant model with highest precision from feasible DNN variant model sets Then selecting the TPC configuration with least activation from the possible TPC configuration set corresponding to the model If the feasible model set is empty, marking the job as non-schedulable; Selecting to% , ) Thereafter, the model index is atomically applied at task boundaries With TPC configuration And (5) starting reasoning.
7. The method for adaptive load resource scheduling for embedded GPU according to claim 6, wherein step S5 further comprises dynamically adjusting during reasoning, if a backward progress or a rapid temperature rise is detected during reasoning, the scheduler starts from the remaining network stage and decreases the load proportion α of the remaining stage according to a predetermined rule to ensure that the remaining execution time is satisfied again ≤ 。
8. The method for scheduling adaptive load resources for an embedded GPU according to claim 1, wherein the online real-time scheduling stage further comprises system coordination and adaptive adjustment, specifically as follows: The online scheduler cooperates with the system-level DVFS and the task mapping module, when the system-level requirement reduces the power consumption or the temperature reaches a critical value, the scheduler synchronously reduces the load alpha and allows the DVFS to moderately reduce the frequency; the system keeps track of historical statistics and if the GPU frequency continues to be passively downscaled within a set time window, the scheduling policy preferentially selects a more conservative j' or lower load α.
9. The adaptive load resource scheduling method for the embedded GPU according to claim 1, wherein each decision of a scheduler in the running process of the method only needs to traverse K models and H TPC configurations in the worst case and search a table, the complexity is O (K+H), all variants reside in the GPU memory, the variant model is switched only needs to update indexes or activate masks and is applied in task boundary atomization, the construction of the Profiling table can be parallelized and sampling, and the conservative WCET is adopted for table item selection.
10. The device for realizing the self-adaptive load resource scheduling method facing the embedded GPU is characterized by comprising an offline preprocessing module and an online real-time scheduling module; the off-line preprocessing module comprises a passive down-conversion analysis unit, a DNN variant construction unit and a Profiling table construction unit; The passive frequency-reducing analysis unit is used for collecting and analyzing the temperature and frequency dynamic change characteristics of the target edge SoC platform under a passive thermal throttling mechanism, establishing a semi-empirical model of temperature-frequency-execution time-energy consumption, and acquiring worst execution time boundaries of reasoning tasks under different thermal states; The DNN variant construction unit is used for selecting a shared reference CNN model, constructing a group of DNN variants, adopting a shared weight and alternate activation strategy to complete the combined training of all DNN variants, and testing to obtain the reference delay and precision of each DNN variant; The Profiling table building unit is used for performing performance analysis on each variant under a plurality of GPU frequency gears and TPC activation configurations, and building a delay table C and a power consumption table P which comprise each variant under different resource configurations; The online real-time scheduling module comprises a runtime monitoring and evaluating unit, a scheduling decision unit and a system cooperative self-adaptive unit; The runtime monitoring and evaluating unit calculates a task cut-off margin when each reasoning task arrives, and performs conservative evaluation on GPU frequency in a task execution stage to determine an evaluation frequency gear corresponding to the task; The scheduling decision unit queries a delay table C based on the task cut-off allowance and the frequency gear obtained by evaluation, selects a DNN variant model with highest precision, selects the TPC configuration with least activation from feasible TPC configurations corresponding to the model, and starts reasoning; the system collaborative self-adaption unit is used for performing collaborative optimization on an online scheduling process, a system-level DVFS module and a task mapping module, continuously recording historical statistical information of task operation, and performing closed-loop self-adaption adjustment on a scheduling strategy based on the historical statistical information.

Description

Self-adaptive load resource scheduling method and device for embedded GPU Technical Field The invention belongs to the technical field of embedded artificial intelligence and real-time system scheduling, and particularly relates to an embedded GPU-oriented self-adaptive load resource scheduling method and device. Background As the demand for low-latency reasoning for edge AI applications (e.g., video streaming object detection, autopilot awareness, etc.) grows, embedded socs (e.g., NVIDIA Jetson series) become an important deployment platform. The embedded GPU integrates a temperature sensor and a hardware thermal manager on a chip, when the node temperature exceeds a threshold value, hardware can trigger heating throttling, so that the GPU frequency is forcedly reduced, the reasoning delay is obviously increased, and deadline requirements of a real-time task can be violated. The existing model compression and dynamic network technology can generate model variants with different loads, but most of the model compression and dynamic network technology focuses on average delay or throughput optimization, and does not combine model-level self-adaption with system-level resource allocation (such as TPC activation) under passive thermal throttling for real-time assurance. System-level DVFS or task mapping methods generally assume that the frequency can be directly controlled by software and cannot directly cope with the down-conversion triggered passively by hardware. For example, chinese patent document CN121166377a discloses an edge AI-computing power-consumption collaborative system, which constructs a multi-scheme matrix by task decomposition, computing power/energy consumption prediction, and introduces a mechanism for triggering dynamic task migration by temperature monitoring. The Chinese patent document CN120354954A discloses an AI reasoning optimization method and system for edge equipment by using a dynamic model switching framework, which comprises the steps of correcting quantization errors through a federal learning training quantization compensation model, constructing a hierarchical heterogeneous resource management system to realize multi-core collaborative scheduling, integrating compensation parameters into an FPGA (field programmable gate array) acceleration circuit and adopting an intelligent power consumption balancing technology, and constructing a dynamic decision engine by using a graph neural network to trigger model switching, so that a feasible path for dynamically adjusting at a model level is provided. The invention aims to remedy the defects, and the predictable and steady scheduling and energy efficiency management under the passive down-conversion scene are realized by combining offline Profiling with online conservative decision. Disclosure of Invention The invention aims to overcome at least one defect of the prior art, and provides an adaptive load resource scheduling method for an embedded GPU, which can dynamically adjust DNN load and work together with system-level resources under the condition of passive frequency reduction of the embedded GPU, so that the scheduling method for optimizing energy efficiency and reducing the triggering probability of thermal throttling is realized on the premise of ensuring instantaneity, and the problems of delayed exceeding of reasoning tasks and energy efficiency reduction caused by thermal throttling (passive frequency reduction) of the embedded GPU are solved. The invention also discloses a device loaded with the self-adaptive load resource scheduling method facing the embedded GPU. The detailed technical scheme of the invention is as follows: an adaptive load resource scheduling method for an embedded GPU is applied to an embedded GPU platform with a passive thermal throttling characteristic, and comprises an offline preprocessing stage and an online real-time scheduling stage: The off-line pretreatment stage comprises the following steps: S1, acquiring and analyzing temperature and frequency dynamic change characteristics of a target edge SoC platform under a passive thermal throttling mechanism, establishing a semi-empirical model of temperature-frequency-execution time-energy consumption, and acquiring worst execution time boundaries of reasoning tasks under different thermal states; S2, selecting a shared reference CNN model, constructing a group of DNN variants with different calculation loads and reasoning precision, adopting a shared weight and alternate activation strategy to complete the combined training of all DNN variants, and testing to obtain the reference delay and precision of each DNN variant; S3, performing performance analysis on each variant under a plurality of GPU frequency gears and TPC activation configuration, and establishing a delay table C and a power consumption table P which comprise the variants under different resource configurations, wherein the delay table C and the power consumption table P form a Profi