CN-119645606-B - Workflow error scheduling method based on reliability-driven end-edge cloud cooperative system

CN119645606BCN 119645606 BCN119645606 BCN 119645606BCN-119645606-B

Abstract

The invention discloses a workflow error-tolerant scheduling method based on a reliability-driven end-edge cloud cooperative system, which comprises the steps of S1, S2, scheduling and resource allocation of each task in a workflow by using a fault-tolerant scheduling method to obtain a scheduling scheme, S3, creating a corresponding number of data storage copies, data transmission copies and task execution copies by each server according to the scheduling scheme, and executing the workflow according to the scheduling scheme. The technical problem to be solved by the invention is that in the end-edge cloud collaborative computing environment, the workflow can be executed with high reliability by creating a fault tolerance mode of a data storage copy, a data transmission copy and a task execution copy, and the high-efficiency scheduling scheme of the workflow is obtained under the constraint of considering the dependency relationship and deadline between tasks of workflow application, the computing resources of an edge server and the total cost of the system, so that the reliability of the end-edge cloud collaborative system is effectively improved.

Inventors

MA LINHUA
ZHANG YI
SUN JIN

Assignees

南京理工大学

Dates

Publication Date: 20260508
Application Date: 20241210

Claims (8)

1. A workflow error scheduling method based on a reliability-driven end-edge cloud cooperative system is characterized by comprising the following steps: Step S1, reading relevant information of a workflow and information of server resources, wherein the relevant information of the workflow comprises, but is not limited to, data quantity of each task in the workflow and average calculation cycle number required by each MB data quantity, partial order relation among tasks in the workflow, deadline constraint and reliability requirements during data storage, data transmission and task execution; step S2, scheduling and resource allocation are carried out on each task in the workflow by using a fault-tolerant scheduling method, a scheduling scheme is obtained, the scheduling scheme comprises a data storage copy, a data transmission copy, a deployment decision scheme of a task execution copy and the minimum total system cost, the fault-tolerant scheduling method comprises a task selection method and a copy creation method, and the fault-tolerant scheduling algorithm comprises the following steps: Step S21, a task selection method selects a first task in a workflow task pool to be completed as a current task, and analyzes current workflow task information; Step S22, an executable task judging method judges whether the current task can be executed or not; Step S23, initializing a current task and a virtual machine pool; Step S24, calculating data transmission time, transmission cost and transmission reliability required by the current task for deploying data transmission copies at each server, judging whether the current task is a starting task, if so, unloading the starting data transmission copy to an edge server or unloading the input data generated on the Internet of things device to the edge server first and then transmitting the input data to a cloud data center, otherwise, executing the data transmission copy of the current task to the data transmission copy of the previous task required by each server, wherein in the step S24: The input data of the task is divided into two types, namely, the input data of the initial task of the workflow is data generated by the equipment of the Internet of things, and the non-initial task takes the output data of the directly-preceding task as the input data; ① If a data storage copy of the starting task is deployed at the edge base station, then the data transmission need only offload the task to the edge base station, at which point: data offloading rate for offloading data to an edge base station Calculated according to the following formula: Wherein, the Is the bandwidth of the channel and, Is the path loss constant of the optical fiber, For the reference distance to be a reference distance, Is the actual distance from the user terminal device to the edge server, θ is the path loss index, For the transmission power of the terminal device, The noise power spectrum density between the terminal equipment and the edge server; Calculated according to the following formula: Wherein, the Representing data storage replicas By mobile devices Offloading to an edge base station Is used for the data transmission time of the (c) data, The amount of data representing the current task; And Obtained by the formula: Wherein, the And Respectively represent Transmission unit price and probability of transmission success; Representing data storage replicas By mobile devices Offloading to an edge base station Is used for the data transmission overhead of the (a); Representing data storage replicas By mobile devices Offloading to an edge base station Data transmission reliability of (2); ② If the data storage copy of the initial task is deployed in the cloud data center, the task needs to be offloaded to the edge base station first and then transmitted to the cloud data center through the edge base station, and at this time: Wherein, the 、 And Respectively represent slave To the point of Bandwidth, transmission unit price, and probability of transmission success; Representing data storage replicas By mobile devices Offloading to an edge base station To cloud data center Is a data transmission time of (a); Representing data storage replicas By mobile devices Offloading to an edge base station To cloud data center Is used for the data transmission overhead of the (a); Representing data storage replicas By mobile devices Offloading to an edge base station To cloud data center Data transmission reliability of (2); ③ Non-initiating tasks The data of all the immediately preceding tasks are stored as the output data of all the immediately preceding tasks, at this time, the transmission of the output data of a single preceding task is related to the decision of the data storage copy of the current task, any one data copy of the data is successful in transmitting the data, and then the transmission of the data is reliable, at this time: = Wherein, the Representing data transfer copies By the server Transmitting to server Is used for the data transmission time of the (c) data, Representing the transmission overhead of the data transmission copy, Representing the transmission reliability of the copy of the data transmission, 、 And Respectively representing the data transmission bandwidth, the transmission unit price and the probability of successful transmission; Data The transmission reliability of (2) is obtained by: Wherein, the Representation of The total number of deployed data storage copies; the total transmission reliability of the non-initial task deployment data copy is the product of the transmission reliability of all the successor tasks, and the total transmission overhead The overhead deployed for all successor task transport copies is obtained by: ; step S25, calculating the preparation time of the current task for deploying task copies in each server; step S26, calculating the storage cost and the storage reliability of the data storage copy deployed by the current task at each server; step S27, calculating copy deployment budget of the current task; Step S28, judging whether the task copy number of the current task deployment exceeds the maximum copy number, if not, traversing the virtual machine pool, selecting VM deployment task execution copy meeting task budget constraint and deadline constraint and having highest reliability, updating the virtual machine pool, task budget and task scheduling reliability, and then turning to step S28; Step S29, current task scheduling is completed, and the current task is removed from a task pool to be completed; step S210, judging whether a task pool to be completed after scheduling is empty, if so, ending the task scheduling, and turning to step S211, otherwise turning to step S21; Step S211, outputting a scheduling scheme; And step S3, each server creates a corresponding number of data storage copies, data transmission copies and task execution copies according to the scheduling scheme, and executes the workflow according to the scheduling scheme.
2. The method of claim 1, wherein parsing information in step S21 includes, but is not limited to, parsing current workflow task information and all virtual machine information in the virtual machine pool, the workflow task information including, but not limited to, status information of all successor tasks of the current task, data amount of the task CPU cycles required for per MB data amount of task 。
3. The method according to claim 1, wherein in step S22, it is determined whether the current task can be executed, whether all the successor tasks need to be traversed to complete preparation, when the current task is completed, the current task can execute the transition step S23, and if the current task has an incomplete successor task, the current task enters the tail transition step S21 of the task pool to be completed.
4. The method of claim 1, wherein the reliability of initializing the current task and the cost of the current task in step S23 are both 0, and initializing the virtual machine pool is a virtual machine deployable task execution copy for all servers.
5. The method according to claim 1, characterized in that in step S25: ① Initial task of workflow Preparation time for deploying data storage copies at a server To transmit the time of the input data at the server, it is obtained by: Wherein, the Representing offloading of data copies to And at A copy of the data is deployed and, Representing data copy passes Task offloading and transmitting data to a cloud data center Deploying a data copy; ② Non-initiating tasks Preparation time for deploying data storage copies at a server Completion time with successor task copy In relation to the transmission time of the transmission copy, it is obtained by: 。
6. The method according to claim 5, wherein in step S26: ① The storage overhead is the product of the storage unit price and the data volume; ② The reliability of data storage is obtained by: Wherein, the Representation of Is deployed at an edge server Is used for the reliability of the test result, Representation of Is a linear ratio of (2); ③ The data storage reliability of the cloud data center is obtained by the following formula: Wherein, the Representation of Is deployed in a cloud data center Is used for the reliability of the test result, Representing cloud data centers Is a linear ratio of (2).
7. The method according to claim 1, characterized in that in step S27, the task scheduling budget of the current task is obtained by: Wherein, the For a minimum offload data budget, In order to be able to govern the budget, Is a workflow Is used to determine the actual residual budget of the system, Is that The total data volume of tasks that have not yet been assigned.
8. The method according to claim 1, characterized in that in step S28: ① Task execution copy At the server Is the kth virtual machine of (2) Execution time of (a) And execution overhead The calculation is as follows: Wherein, the And Respectively is Is used for calculating the frequency and task execution unit price; ② Task execution reliability is defined as the probability of successful task execution, task The possibility of transient faults in the execution process obeys poisson distribution and tasks At the position of Deploying reliability of task execution copies Calculated by the following formula: Wherein, the Representation of Is a failure rate of (1); ③ Has the following characteristics of Task of individual task scheduling copies Reliability of Represented by the formula: Wherein, the Representing task copies The probability of successful scheduling is obtained by: Wherein, the And Respectively shown in Deploying data storage replicas Data transmission reliability and data storage reliability of (c), Representing task copies At the position of Reliability of execution.

Description

Workflow error scheduling method based on reliability-driven end-edge cloud cooperative system Technical Field The invention relates to the technical field of data transmission and data backup of an end-edge cloud cooperative computing architecture, in particular to a workflow error scheduling method in an end-edge cloud cooperative system. Background With the advent of the worldwide interconnecting age, the number and variety of mobile terminal access devices has grown exponentially, and large amounts of data have been collected on terminal devices with heavy computational loads. However, due to the limitation of physical size, the computing resources and battery capacity of the mobile terminal are limited, and the delay increase will result in higher energy consumption, and the local computing mode is difficult to meet the low latency requirement of the application. In order to overcome the defects of the traditional local processing method, a new computing mode is provided, and a mobile cloud computing technology and a mobile edge computing technology are generated. In a Mobile Cloud Computing (MCC) environment, the computing tasks of the terminal device are offloaded to a cloud server with powerful computing power and are intensively processed. Mobile Edge Computing (MEC) sinks computing originally focused on the cloud to network edge devices, providing computing services at one end closer to the data source. Compared with edge computing, cloud computing has abundant computing resources, cloud computing has generally smaller computing delay than edge computing, but has the advantages of long transmission distance, low delay and wide range compared with an edge server, and edge computing has the limitation of network edge resources. For offloading problems under certain complex conditions, relying on a single computing architecture is not possible, in which case the end-edge-cloud co-computing architecture may become an effective solution. By combining cloud server rich computing resources with low latency of edge servers, a three-tier architecture may provide higher computing and transmission performance than cloud computing or edge computing. However, in the scheduling process of the workflow, the transmission line, the edge server, and the virtual machine are inevitably subjected to various failures. Reliability requirements are also an important quality of service (QoS) indicator. Under the computing environment of end-edge cloud cooperation, introducing the fault tolerance technology based on active replication into workflow scheduling can effectively enhance the reliability of the workflow while meeting the delay constraint of the workflow, the resource constraint of an edge server and the total cost constraint of the system. In the existing coordinated scheduling work of the end Bian Yun, most of optimization targets are delay and energy consumption, and the achievement of reliability of users in an end-edge cloud coordination system is not considered at present. For example, document "Collaborative Cloud-Edge-End Task Offloading in Mobile-Edge Computing Networks With Limited Communication Capability" discloses a task offloading method in a peer-to-peer cloud collaborative computing environment, which aims to minimize the total delay of all mobile devices by determining the computational offloading policy, computational resources, delivery rate and transmit power allocation, under deadline and energy consumption constraints of the Mobile Devices (MD). The document AI-DRIVEN ENERGY-EFFICIENT CONTENT TASK OFFLOADING IN CLOUD-Edge-End Cooperation Networks discloses a task offloading scheme in a Deep Reinforcement Learning (DRL) based end-Edge cloud collaborative network environment, which makes collaborative caching and task offloading decisions in each time slot according to content request information in the previous time slot and the current network state, thereby maximally reducing the total energy consumption of the system. Literature "Collaborative cloud-edge-end task offloading with task dependency based on deep reinforcement learning" discloses a workflow unloading method in an end-to-side cloud collaborative scene, which minimizes average time delay and average energy consumption of all the Internet of things equipment under the constraint of limited computing resources and task dependency of the Internet of things equipment and a multi-core edge server. None of the above methods take into account workflow fault-tolerant scheduling under the coordinated architecture of the end Bian Yun. The existing task scheduling scheme under the coordinated computing environment of the end Bian Yun does not fully consider the reliability requirement of the user on the workflow execution, and particularly the possible faults in the processes of data storage, data transmission and task execution. The workflow needs to be successfully executed to improve the reliability of the system under the constraints th