CN-121996384-A - FPGA-based adaptive hybrid scheduling accelerator and implementation method thereof

CN121996384ACN 121996384 ACN121996384 ACN 121996384ACN-121996384-A

Abstract

The invention provides an adaptive hybrid scheduling accelerator based on an FPGA and an implementation method thereof. The accelerator comprises a control interface unit CIU, a multi-level priority queue array MPQA and an adaptive scheduling core ASH, and aims to solve the problems of large task scheduling delay, slow interrupt response, system throughput bottleneck and the like caused by a traditional software scheduler in a heterogeneous multi-core SoC, a high-requirement embedded system and an edge computing node.

Inventors

LI SIZHAO
ZHU XIAOLI
ZHU YUHANG
WANG HAN
GUO ZHUJUN
SHI SHAOHUA
LI QIULIANG

Assignees

哈尔滨工程大学

Dates

Publication Date: 20260508
Application Date: 20260129

Claims (10)

1. An adaptive hybrid scheduling accelerator based on an FPGA, which is characterized by comprising a control interface unit CIU, a multi-level priority queue array MPQA and an adaptive scheduling core ASH; The control interface unit CIU interacts with the ARM core through the AXI-Lite, receives the current task set state and the strategy switching command, and writes back the selected next-task ID into the core; The multi-level priority queue array MPQA instantiates N parallel FIFOs in the FPGA Block RAM, each corresponding to one priority, and the queue head/tail pointers are synchronized by adopting independent clock domains, so that all the queue heads are read out in a single period in parallel; The self-adaptive scheduling core ASH comprises a policy decision machine and a task migration machine, wherein the policy decision machine samples the queue length, the CPU utilization rate and the task waiting time in a hardware state machine mode, single period switching is carried out between RR, SJF, AHS modes according to a threshold value table without reconfiguration of bit streams, the task migration machine is responsible for cross-core load balancing, when a certain core is detected to be idle for more than 256 clock cycles, tasks at the tail part of the furthest non-empty queue are automatically migrated to an idle core queue, and the migration process is completed by DMA at the FPGA side and is transparent to the CPU.
2. A method for implementing an adaptive hybrid scheduling accelerator based on the FPGA of claim 1, the method comprising: The method comprises the steps of step 1, system power-on and hardware resource initialization, wherein the FPGA side completes global reset and clock domain synchronization, CIU initializes an internal register, MPQA empties all FIFO queues and verifies UltraRAM available capacity, and ensures the context storage space supporting at least 32 tasks; step 2, establishing a shared memory mapping and a data structure, wherein an ARM side loads a scheduling driving module through an insmod, calls mmap () to map 2 MB continuous physical memories starting from a physical address 0x1_0000_0000 into a user state virtual address space, and constructs a ring TCB pool in the shared area, wherein each TCB entry is 128 bytes and comprises a task ID, a priority, residual running time, a context pointer and a checksum; Step 3, batch injection of the initial task TCB; step 4, task context initial loading and UltraRAM distribution; Step 5, starting a scheduler activation and monitoring mode; step 6, scheduling event triggering and quick interrupt response; step 7, carrying out parallel arbitration and ASH dynamic judgment; step 8, performing zero-copy context switching; Step 9, dynamically updating and seamlessly switching the self-adaptive strategy; and 10, maintaining consistency of anomaly monitoring and a system.
3. The implementation method according to claim 2, wherein in step 3, the ARM side writes the TCB data of the initial task set into the shared RAM through the ioctl () interface according to the application requirement, and triggers a DMA batch transfer request of the CIU once every 16 TCBs are written, the CIU asynchronously transfers the TCB data to the corresponding priority FIFO of MPQA through the AXI-Lite bus, and the CIU sets the TCB_READY state flag when the transfer is completed.
4. The method according to claim 2, wherein in step 4, the ARM side writes the initial context snapshot of each task into the data register of the CIU in the order of task ID through a write () system call, the address generation unit AGU in the CIU allocates a fixed UltraRAM address offset to each task, completes on-chip storage of the context in 3 clock cycles, and establishes a direct mapping table of TCB to UltraRAM addresses.
5. The method of claim 2, wherein in step 5, the ARM side writes 0x01 into the CIU command register, the ASH enters an active monitoring state, and at this time, a load monitoring unit in MPQA starts to count the depth of each priority queue, the task arrival rate and the service time, calculates the system load index ρ=λ/μ in real time, and writes the system load index ρ=λ/μ into a visual monitoring register of the ASH for the ARM side to poll and read.
6. The implementation of claim 2, wherein in step 6, the ARM side GIC only performs extremely simple ISR when a timer interrupt or peripheral interrupt arrives, closing interrupt nesting and writing 1 byte command to the sched_trigger register of CIU and returning immediately, this operation compresses ARM intervention delay to <20 ns, avoiding kernel state context save overhead.
7. The implementation method according to claim 2, wherein in step 7, the FPGA side reads the queue heads of MPQA FIFOs at the next clock rising edge and sends them to the parallel decision machine of the ASH, the ASH calculates the dynamic priority weight of each task according to the algorithm specified by the current policy register, and performs two-stage pipeline decision by combining the load index ρ, i.e. the task with the highest priority is screened by the first stage, the weight conflict is processed by the second stage, and the next-task ID is finally output, and the total decision delay is less than or equal to 2 clock cycles.
8. The implementation method according to claim 2, characterized in that in step 8, the switching engine starts immediately after ASH arbitration is completed: the preservation stage comprises the steps of parallelly writing back UltraRAM original addresses from a CPU interface to 16 general registers +PC +PSR of a current task through a wide-bit bus, wherein the time is 1 clock period; And in the recovery stage, the new context is read from UltraRAM according to the next-task ID, and written into the CPU register file through the AXI-HP port in 3 clock cycles, and the whole process has no DDR participation, and the switching delay is less than 50 ns.
9. The implementation method according to claim 2, wherein in step 9, if the load monitoring unit detects ρ mutation or task timeout rate >5%, automatically triggers a policy update interrupt to ASH, the ASH loads new policy parameters from the policy configuration table in an idle period, and atomically updates a "current policy register", and MPQA subsequent enqueuing operations are immediately ordered according to the new policy, and the decision flow being executed is not affected, thus realizing zero jitter of policy switching.
10. The implementation method according to claim 2, wherein in step 10, a built-in watchdog timer of the CIU monitors time consumption of each scheduling, if the switching time is >100 ns, triggers a sched_err interrupt to the ARM, an ARM side exception handler reads a diagnostic register of the CIU, performs selective task termination or FPGA local reconfiguration, and simultaneously, the CIU automatically writes queue statistics information of MPQA into an audit area of the shared RAM after each 128 scheduling is completed, so that the ARM side can perform long-term load analysis and energy consumption optimization.

Description

FPGA-based adaptive hybrid scheduling accelerator and implementation method thereof Technical Field The invention relates to the technical field of accelerators, in particular to an adaptive hybrid scheduling accelerator based on an FPGA and an implementation method thereof. Background The task scheduler of the operating system is responsible for managing the task execution sequence and resource allocation, and the performance of the task scheduler directly influences the response speed, throughput rate and instantaneity of the heterogeneous multi-core system. The traditional Linux full fair scheduler (CFS) and real-time scheduling class (sched_rr/sched_fifo) run on the general CPU core, need to frequently traverse the ready queue of the mangrove or linked list, update vruntime, calculate time slices, and take the overheads of interrupt processing, context save/restore, etc. As the number of tasks and the number of CPU cores increase, the scheduling path increases in a nonlinear manner, the context switching delay is generally more than 5 μs, the interrupt response time is close to 10 μs, millisecond jitter is generated in a high concurrency or hard real-time scene, and the sub millisecond deterministic requirement cannot be met. In order to reduce the scheduling delay, scholars at home and abroad propose various software and hardware cooperation schemes in sequence, but obvious short boards exist: In 2022, xu et al published Microprocessors and Microsystems on "communication delay aware real-time scheduling for FPGA multi-core systems", task priority was adjusted after monitoring inter-core communication overhead by FPGA, but scheduling decisions were still completed on CPU side, and measured context switch delay was still up to 5.14 μs, interrupt response 9.82 μs. 2023, Rodriguez-Canal et al, concurrency and Computation: PRACTICE AND Experience, proposed "preemptive task scheduling based on partial reconfiguration", supporting two algorithms EDF/RR, but the reconfiguration area occupied 40% LUT resources, and policy switching required 50ms, resulting in real-time task starvation risk. In 2024, paul and Danelutto propose "data center FPGA power consumption aware task scheduling" in Euromicro PDP, only accelerate task-FPGA binding decisions, not involve end-to-end paths such as context save, interrupt arbitration, and the measured interrupt response is still >9 μs. 2025, Karabulut et al proposed THEMIS multi-tenant fair scheduler supporting time/energy consumption dual objectives, but adopting static compiling mode, task classification and scheduling strategy is fixed at compiling period, and can not be dynamically adjusted according to instantaneous load during operation, and the average tardiness is increased by 38% when real-time task burst arrives. The common defects of the above schemes are that: (1) The strategy is deeply coupled with hardware, the switching scheduling algorithm needs to be recombined or reconfigured, and the delay is large; (2) Only accelerating the sub-link of task selection, not carrying out hardware on key paths such as context switching, interrupt arbitration, cross-core migration and the like, and improving the end-to-end delay is limited; (3) The lack of a runtime perception mechanism can not dynamically adjust a scheduling strategy according to CPU utilization rate, task real-time characteristics, queue length and the like, so that the fairness is insufficient in a high-load scene and the responsiveness is insufficient in a low-load scene; (4) Heterogeneous multi-core expansion is not supported, inter-core task stealing and load balancing units are absent, and extensible deployment is difficult to realize on a CPU-FPGASoC platform. Therefore, the prior art still cannot complete the 'perception-decision-switching' closed loop in microsecond time, and cannot meet the comprehensive requirements of the heterogeneous multi-core hard real-time system on certainty, low delay and high throughput. Disclosure of Invention The invention provides an adaptive hybrid scheduling accelerator based on an FPGA and an implementation method thereof. The invention designs a self-adaptive hybrid scheduler on a Field Programmable Gate Array (FPGA), and aims to solve the problems of large task scheduling delay, slow interrupt response, system throughput bottleneck and the like caused by a traditional software scheduler in a heterogeneous multi-core SoC, a high-requirement embedded system and an edge computing node. The invention is realized by the following technical scheme, the invention provides an adaptive hybrid scheduling accelerator based on an FPGA, which comprises a control interface unit CIU, a multi-level priority queue array MPQA and an adaptive scheduling core ASH; The control interface unit CIU interacts with the ARM core through the AXI-Lite, receives the current task set state and the strategy switching command, and writes back the selected next-task ID into the core; The multi-level p