CN-122019152-A - Fine-grained lightweight DPU performance isolation method

CN122019152ACN 122019152 ACN122019152 ACN 122019152ACN-122019152-A

Abstract

The present disclosure provides a fine-grained lightweight DPU performance isolation method. The method comprises the steps of firstly entering an offline stage, representing a part of systematic analysis task execution mode through offline analysis resources, modeling a mapping relation between task parameters and resource consumption, then entering an online stage, intercepting the submission of an application task, evaluating resource requirements according to an offline model, applying self-adaptive task splitting and result recombination to partially decompose a large task, executing a scheduling process of workload guidance, realizing dynamic allocation through a global scheduler and an application agent, wherein the global scheduler allocates time slices or subtasks based on workload characteristics, and recombining and returning the result after execution. The method is directly constructed on the SDK of the existing manufacturer under the condition of not modifying hardware or proprietary software, so that accurate resource management is realized, the isolation challenge of the DPU is effectively solved through the technologies, and efficient utilization of resources of a cloud provider in PaaS and SaaS modes is promoted.

Inventors

ZHANG MENGHAO
PENG QIYANG
WANG FEIYANG

Assignees

北京航空航天大学

Dates

Publication Date: 20260512
Application Date: 20260121

Claims (7)

1. A fine-grained lightweight DPU performance isolation method, comprising: Firstly, entering an offline stage, representing a part of systematic analysis task execution mode through offline analysis resources, and establishing a mapping relation between modeling task parameters of a time prediction model and resource consumption; And then entering an online stage, intercepting the submission of an application task, evaluating the resource requirement according to an offline model, applying self-adaptive task splitting and result recombination to partially decompose a large task, executing a scheduling process guided by a workload, realizing dynamic allocation through a global scheduler and an application agent, wherein the global scheduler allocates time slices or subtasks based on the workload characteristics, and after executing, recombining the result and returning to the application.
2. A fine-grained lightweight DPU performance isolation method according to claim 1, wherein the resource representation part through offline parsing is implemented by representing accelerator resources as time slices and mapping task resource usage onto execution time through the advantages of FCFS scheduling, and in the offline stage, running various representative tasks, collecting execution time and input parameters, and building a regression model.
3. The DPU performance isolation method based on fine-grained lightweight class according to claim 1, wherein the adaptive task splitting and result reorganizing part is implemented by firstly evaluating the task scale and the correlation with the execution time for a given accelerator task, splitting the task into a plurality of subtasks if the task exceeds a preset threshold, independently submitting each subtask to a hardware queue for execution, and taking the algorithm characteristic into consideration in the splitting process; Meanwhile, multiple optimization measures are adopted, namely a reusable object pool is used for pre-distributing task objects and a buffer area, delay of dynamic memory distribution is reduced, software and hardware are overlapped in a pipelining mode, input of the next subtask is prepared immediately after the subtask is submitted, parallel processing is achieved, and a zero-copy scattered/aggregated list is used for avoiding unnecessary memory copy operation and only referencing data fragments through pointers.
4. A fine-grained lightweight DPU performance isolation method as defined in claim 1, wherein the workload-guided scheduling process is embodied as including a global scheduler and an agent for each application, the global scheduler being every other time Periodically performing low-frequency decision by ms, predicting application resource requirements by using an exponential weighted moving average algorithm, and implementing a compensation mechanism according to historical SLA expression; Meanwhile, the global scheduler dynamically adjusts task splitting granularity, namely, splitting is increased to ensure fine granularity fairness when competition is intense, and splitting is reduced to maximize throughput when load is low.
5. A fine-grained lightweight DPU performance isolation method as defined in claim 3, wherein the method of evaluating task size and dependency on execution time, if a task exceeds a preset threshold, splitting it into sub-tasks is by evaluating task size for a given accelerator task based on the temporal prediction model, the model scaling task parameters data block number Number of redundant blocks Block size Mapping to predicted execution time The dynamic granularity control mechanism is adopted to determine the task splitting threshold value and the current splitting granularity is set as With a range of values of a predefined ordered set of discrete values Wherein , Split granularity for total number of granularity levels Dynamically setting by ASTRAEA scheduler according to system running state, and transmitting to application agent of each application through shared memory; Data block size for original task The application agent calculates the number of subtasks according to the following formula : When there are multiple competing applications in the system and When splitting the original task into Otherwise, the task is submitted as it is to avoid unnecessary split overhead; The splitting process for the erasure code accelerator task is based on the mathematical separability of the erasure code algorithm for the inclusion of Source data matrix of individual data blocks Dividing it into Sub-matrix, the first Sub-matrix The definition is as follows: Wherein the method comprises the steps of Representation matrix Is the first of (2) Column to the first A sub-matrix of columns, each sub-matrix having a size of The encoding or decoding operation can be independently carried out; for coding tasks, let the coding matrix be The code output of the original task is Output of each subtask Independent calculation, and finally recombining the complete result by horizontal splicing: ; Predicted execution time for each subtask And (3) calculating: the predicted value is used for subsequent time quota management and scheduling decisions.
6. A fine-grained lightweight DPU performance isolation method as defined in claim 5, wherein in the splitting into a plurality of subtasks, a complete task processing pipeline is designed to minimize the overhead introduced by the splitting, including four stages of task splitting, buffer management, queue scheduling and result reorganization: Stage one task interception and splitting decision The agent intercepts the call of the application program to the API of the accelerator through the dynamic link library preloading technology, and when the application submits the original task, the agent firstly reads the current split granularity set by the scheduler from the shared memory And calculates the number of sub-tasks according to the formula Simultaneously, the agent calculates the expected completion time for the task: Wherein the method comprises the steps of As the current time stamp is to be used, Service level targets configured for applications if If not, directly submitting the original task to a subsequent stage; Stage two, subtask creation and buffer zone allocation For tasks that need to be split, the proxy is for each subtask The following operations are performed: source buffer construction sub-source buffers are created by pointer referencing using scatter/gather list techniques. For each data block ( ) The starting address of the sub-buffer is calculated as: The data length is The plurality of sub-buffers are connected in series through a linked list structure to form complete sub-task input, and the whole process does not need data copying; target buffer acquisition, namely acquiring idle target buffers from a pre-allocated reusable buffer pool, wherein the buffer pool is created during system initialization and comprises a fixed number of pre-allocated buffer objects, each buffer size needs to accommodate output data under the maximum granularity, and the output data is obtained through circular indexing The quick allocation is realized, and the dynamic memory allocation overhead during the operation is avoided; Sub-task encapsulation, namely encapsulating a source buffer area, a target buffer area, an encoding/decoding matrix and user callback information into a sub-task object, and simultaneously recording the predicted execution time of the sub-task Subtask index And a flag of whether it is the last subtask; stage three virtual queue management and pipeline commit The packaged subtasks are put into a virtual queue maintained by an agent, the queue adopts a design of a lock-ring-free buffer zone, and the capacity of the queue is set as The head pointer being The tail pointer is The enqueue operation is: The agent prepares a buffer of the next subtask immediately after the subtask is enqueued; the agent maintains a dedicated commit thread which continually polls the virtual queue and when it detects that there are subtasks in the queue to commit, the commit thread first queries the shared memory for the current time quota The subtask is submitted to the hardware queue if one of the following conditions is met: condition 1, quota sufficient: I.e. there is still a remaining time quota in the current scheduling period; condition 2, tail delay protection: I.e. the task is about to timeout, triggering an active tail delay protection mechanism to avoid SLA violation; After each successful commit, the time quota and accumulated usage are updated: Wherein the method comprises the steps of The accumulated resource usage of the application in the scheduling period is calculated; stage four, result callback and data reorganization After the hardware accelerator completes the execution of the subtasks, triggering a registered callback function, and executing the following operations: Data copying, if the sub-task is split, the output data of the sub-task is copied from the temporary target buffer zone to the corresponding position of the target buffer zone of the original task, and for the coding task, the output comprises The copy target address is calculated as: Releasing the source buffer area reference occupied by the subtask, resetting the data length of the temporary target buffer area to 0, so that the temporary target buffer area can be reused by the subsequent subtask; And (3) finishing judgment and returning a result, namely if the current subtask is the last subtask of the original task, executing the following operations: checking the actual completion time Whether or not to exceed the expected time If (if) Then increment the SLA violation counter to the shared memory: ; And calling an original callback function registered by the application, and returning a complete reorganization result.
7. A fine-grained lightweight DPU performance isolation method as defined in claim 4, wherein the global scheduler is configured to perform every other operation The ms periodically performs low frequency decisions, predicts application resource requirements using an exponentially weighted moving average algorithm, and implements a compensation mechanism based on historical SLA performance by every other process = The resource allocation decisions are performed periodically ms. The scheduler communicates with the agent of each application through the shared memory, and reads the resource usage of each application in the last period And SLA violation number And write the time quota of the new period And resolution particle size ; The scheduler predicts the resource demand of each application by adopting an exponential weighted moving average algorithm, and the application is set In the first place The actual resource usage of each scheduling period is The distribution amount is The demand forecast value of the next period Calculated according to the following formula: Wherein the method comprises the steps of Is a smooth coefficient with a value range of ; Based on demand prediction, the scheduler further performs compensatory resource allocation in combination with SLA violation history, and provides an application In the first place The SLA violation number in each period is The sum of the predicted requirements for all applications is: The sum of the SLA violations is: Wherein the method comprises the steps of Is the number of applications currently active. The total allocable time per scheduling period is Wherein a portion is reserved as an SLA compensation quota: Wherein the method comprises the steps of For the reserved proportion, the remainder is the regular allotment quota: Application of In the first place Time quota of a period The calculation is carried out according to the following rules: In case one, when No SLA violation At the moment, the resources are completely distributed according to the predicted demand proportion without compensation; In case two, when When SLA is violated At this time, the regular quota On-demand proportional allocation with reserved compensating quota Distributing the SLA violations to the affected application according to the SLA violations, and realizing long-term fairness compensation; to prevent resource starvation in extreme cases, if the calculated allocation is too small, a lower limit protection is set: after each scheduling period is finished, the scheduler clears an SLA violation counter and a usage accumulator of each application to prepare for statistics of the next period; in addition to time quota allocation, the scheduler also dynamically adjusts the task splitting granularity of each application, setting the application The current resolution granularity of (a) is Which is in ordered granularity set Index of (a) is I.e. Application of The resource utilization ratio of (2) is defined as the ratio of the actual usage amount to the allocation amount: The scheduler adjusts the granularity of each application at each scheduling period according to the following rules: rule one-utilization driven granularity increase When the resource utilization of an application is low, The scheduler increases the split granularity of the application to the next larger level in the ordered set: I.e. in ordered sets Selecting a subsequent element with the current granularity, and keeping unchanged if the subsequent element is the maximum granularity; Rule two-granularity reduction driven by SLA violation When the number of SLA violations of competing applications exceeds a threshold, The scheduler will apply The resolution granularity of (c) is reduced to the next smaller level in the ordered set: I.e. in ordered sets The precursor element with the current granularity is selected, and if the precursor element is the minimum granularity, the precursor element is kept unchanged.

Description

Fine-grained lightweight DPU performance isolation method Technical Field The present disclosure relates to the field of computers, and more particularly, to a fine-grained lightweight DPU performance isolation method. Background The data processor (Data Processing Unit, DPU) is an advanced hardware device that evolved from a remote direct memory access (Remote Direct Memory Access, RDMA) Network interface controller (RDMA Network INTERFACE CARD, RNIC) and an intelligent Network interface card (Smart Network Interface Controller, smartNIC). It integrates a variety of efficient hardware modules including Application-SPECIFIC INTEGRATED Circuit (ASIC) network interface cards (Network Interface Controller, NIC), system on Chip (SoC) processor cores and memory, bypass ASIC accelerators (e.g., modules for encryption and redundancy processing), and programmable data path accelerators (Data Path Accelerator, DPA). These components enable the DPU to efficiently handle infrastructure tasks such as network virtualization, storage services, and security protocols in the data center, thereby significantly reducing the total cost of ownership (Total Cost of Ownership, TCO). Taking NVIDIA BlueField-3 DPU as an example, the hardware architecture comprises ConnectX-7 RNIC for supporting communication between a host and an external network, 8-16 ARM cores and a SoC memory matched with 16-32 GB DRAM for running a standard Linux distribution board, a fixed function ASIC accelerator capable of processing data at a speed of up to 100Gbps, and a 16-core 256-thread RISC-V processor for multi-thread data plane application. Furthermore, PCIe switches ensure efficient interaction between the SoC and hosts and NICs. In software, the DPU provides task abstractions through a unified API, such as NVIDIA DOCA, that represent the basic operations of building an application, such as generating redundant data for buffers or performing RDMA communication operations. The DOCA adopts an asynchronous task submission model and combines a callback function to realize a pipelined application architecture. This design allows the DPU to bypass the host CPU without sacrificing performance, enabling high throughput, low latency data processing and transmission, has been widely deployed by very large scale data centers to improve overall efficiency. In a cloud service environment, deployment of DPUs provides powerful support for Platform as a service (PaaS) and Software as a service (Software AS A SERVICE, saaS). These service modes require the cloud provider to efficiently manage resources to meet the performance requirements of tenants for diverse applications. The DPU significantly improves resource utilization by offloading infrastructure workloads such as virtual network functions, storage indexes, and TCP data planes. For example, in a distributed storage system, the DPU may accelerate indexing operations for discrete storage, and in a high performance computing scenario, it may provide low latency support for distributed deep learning. It is envisioned that in the future, cloud providers may rent DPUs to tenants as PaaS or provide acceleration services in SaaS form, supporting more efficient applications. Such a multi-application coexistence mode is expected to further improve the resource utilization of the data center, but is premised on being able to precisely control the hardware resource allocation to avoid performance degradation caused by resource contention. The advantages of DPU make it particularly suitable for cross-application collaboration in cloud environments, however, the lack of an effective isolation mechanism can prevent cloud providers from achieving guarantees on service level agreements (SERVICE LEVEL AGREEMENT, SLA), such as critical metrics like throughput and job completion time (Job Completion Time, JCT). The existing DPU performance isolation mechanism has obvious defects and cannot meet the requirement of coexistence of multiple applications. First, DPU vendors provide limited mechanisms to abstract hardware resources, typically through only high-level proprietary APIs (e.g., DOCA) that encapsulate hardware functions into task form, obscuring the task parameters versus underlying resource consumption, resulting in an inability to partition resources directly, such as isolating ASIC accelerators using mechanisms similar to cgroups. Secondly, the accelerator adopts a coarse-grained First Come First served (First name FIRST SERVE, FCFS) scheduling strategy, the task size varies from byte level to hundreds of megabytes, the execution time ranges from microseconds to tens of milliseconds, which results in large task exclusive hardware of bandwidth sensitive applications, violating SLA of delay sensitive applications, and reducing overall throughput. For example, in a coexistence scenario, the small tasks of JCT sensitive applications may increase the average JCT by 308.32% due to the head of the large tasks blocking, with