CN-121979645-A - Multi-machine concurrency-oriented task scheduling and dynamic scheduling method
Abstract
The application provides a multi-machine concurrent task scheduling and dynamic scheduling method, which relates to the technical field of workflow engines and intelligent scheduling, and comprises the steps of receiving tasks to be executed; writing the task to be executed into a task waiting queue, refreshing the task running state of the task to be executed in each task node, determining whether the available capacity in a task machine is larger than 0, if yes, determining whether the task node in an idle state exists in the task machine, if yes, determining whether the task to be executed exists in the task waiting queue corresponding to the task machine, if yes, writing the task to be executed in the task waiting queue into the task node, and determining the task output result based on the task to be executed, so as to solve the problems that a plurality of computers often rely on manual writing and maintenance of a plurality of sets of submitting scripts at present, when a main control program is out abnormally, the task and the state are difficult to recover consistently, and resource idling, repeated submitting or result losing are easy to cause.
Inventors
- WANG ZHENYU
- LUO XIAOSHAN
- CHEN YONGSHUO
- LV JIAN
- WANG YANCHAO
Assignees
- 吉林大学
Dates
- Publication Date
- 20260505
- Application Date
- 20260409
Claims (10)
- 1. The multi-machine-concurrent task scheduling and dynamic scheduling method is applied to a plurality of task machines, wherein the task machines are computer equipment for executing tasks to be executed, at least one task node is arranged in each task machine, and the task node is an execution unit for running the tasks to be executed in the task machine, and is characterized by comprising the following steps: receiving a task to be executed; writing the task to be executed into a task waiting queue; each task machine is provided with a corresponding task waiting queue, and the task waiting queues are used for caching the tasks to be executed; Refreshing task running states of the tasks to be executed in each task node, wherein the task running states comprise running states, completed states and abnormal states; determining a task running state of the task to be executed, and if the task running state is the completed state, removing the corresponding executed task from the task node; if the task running state is the abnormal state, rewriting the corresponding abnormal task into the corresponding task node for execution; determining whether the available capacity in the task machine is larger than 0, if so, determining whether task nodes in an idle state exist in the task machine, if so, determining whether the tasks to be executed exist in task waiting queues corresponding to the task machine, and if so, writing the tasks to be executed in the task waiting queues into the task nodes in the idle state, wherein the available capacity is characterized by the number of the tasks to be executed which can be received currently by the task machine; And determining a task output result based on the task to be executed, wherein the task output result comprises completion, failure and retried abnormality.
- 2. The method for scheduling and dynamically scheduling tasks concurrently oriented to multiple machines according to claim 1, further comprising: Uniformly processing the parameter formats in the task to be executed; The task to be executed is mapped into a schedulable object in a unified mode, the schedulable object is used for storing task information corresponding to the task to be executed and providing an interface corresponding to a scheduling flow, and the task information comprises task instructions, input and output file information, shared file dependent information and execution catalog information.
- 3. The method for scheduling and dynamically scheduling tasks concurrently oriented to multiple machines according to claim 1, wherein after the steps of writing the task to be executed into a task waiting queue and rewriting the corresponding abnormal task into the corresponding task node for execution, the method comprises: And recording a state file in the task waiting queue, wherein the state file is a real-time task to be executed stored in the task waiting queue.
- 4. The method for scheduling and dynamically scheduling tasks concurrently oriented to multiple machines according to claim 1, wherein the step of determining the task running state of the task to be executed comprises: acquiring operation evidence of the task to be executed, wherein the operation evidence comprises process in-place information, log text signals, completion mark files and overtime mark information; and determining the task running state of the task to be executed by using the running evidence.
- 5. The method for scheduling and dynamically scheduling tasks concurrently oriented to multiple machines according to claim 4, wherein the step of determining the task running state of the task to be executed by using the running evidence comprises: Determining whether the task to be executed is completed or not according to the process in-place information and the completion mark file; And determining whether the task running state of the task to be executed is an abnormal state according to whether an abnormal log text signal exists in the log text signal or whether the timeout mark information exists.
- 6. The method for scheduling and dynamically scheduling tasks concurrently oriented to multiple machines according to claim 1, wherein the step of rewriting the corresponding abnormal tasks into the corresponding task nodes for execution comprises: Acquiring the task running state as the abnormal task execution times corresponding to the abnormal state; and if the execution times of the abnormal tasks are smaller than a preset retry execution times threshold, rewriting the abnormal tasks into corresponding task nodes for execution.
- 7. The method for scheduling and dynamically scheduling tasks concurrently oriented to multiple machines according to claim 1, further comprising, after the step of determining whether the available capacity in the task machine is greater than 0: If not, writing the task to be executed into a task waiting queue corresponding to the task machine, and when the available capacity exists in the task machine and the task node in an idle state exists in the task machine, writing the task to be executed into the task node in the idle state.
- 8. The method for scheduling and dynamically scheduling tasks concurrently oriented to multiple machines according to claim 1, wherein the step of writing the task to be executed in the task waiting queue into a task node in an idle state comprises: The batch parameters of the task nodes with the available capacity are determined, wherein the batch parameters comprise the number of tasks to be executed and the dependency relationship information among the tasks to be executed, which are executed in each batch; Determining task information of the task to be executed in the task waiting queue, wherein the task information comprises a task type and an execution priority; And organizing the task to be executed into task units according to the batch parameters and the task information, and writing the task units into task nodes in an idle state according to the batch parameters.
- 9. The method for scheduling and dynamically scheduling tasks concurrently oriented to multiple machines according to claim 1, further comprising: generating an execution script based on the task to be executed, wherein the execution script is used for executing the task to be executed; Generating an execution identifier based on the execution script; and converting the task running state of the task to be executed into running according to the execution identification.
- 10. The method for scheduling and dynamically scheduling tasks concurrently oriented to multiple machines according to claim 1, wherein the step of determining the task output result based on the task to be executed comprises: If the task to be executed is completed, determining that the task output result of the task to be executed is completed, summarizing the completed task result into the node of the main control program, and removing the corresponding executed task from the task node; If the task to be executed is not completed, acquiring the execution times of the task to be executed; if the execution times are smaller than a preset retry execution times threshold, determining that the task output result of the task to be executed is a retry-capable exception; if the execution times are greater than or equal to a preset retry execution times threshold, determining that the task output result of the task to be executed is failure.
Description
Multi-machine concurrency-oriented task scheduling and dynamic scheduling method Technical Field The application relates to the technical field of workflow engines and intelligent scheduling, in particular to a multi-machine concurrency-oriented task scheduling and dynamic scheduling method. Background In the era of rapid development of high-performance computing and large-scale data processing, a multi-computer or cluster environment has become a core infrastructure for complex computing tasks of numerous scientific research institutions and enterprises. Users often need to distribute large amounts of computing work to different execution platforms to meet diverse computing needs. These jobs may involve complex scientific simulations, extensive data analysis, machine learning model training, etc., have a great demand for computational resources and are highly dynamic. To cope with the above-mentioned demands, various job scheduling schemes have been proposed. And the partial scheme is characterized in that a plurality of sets of submitting scripts are manually written and maintained, and corresponding job submitting flows are customized for different execution platforms so as to realize the operation of the jobs on different platforms. Moreover, the current job scheduling scheme is more biased to a single computing resource task execution and monitoring model (i.e. a submission set completes script generation, submission and query on a specific machine or cluster), which is difficult to meet when multiple sets of computing resources (such as multiple login nodes, multiple clusters, multiple scheduling systems) are simultaneously accessed in an actual scene and dynamic job distribution is performed according to real-time available capacity of each resource. In other schemes, some integrated tools are used to package the job submitting interfaces of different platforms, so as to provide a relatively uniform job submitting interface for users, and after the users submit the jobs through the interface, the tools distribute the jobs to the corresponding platforms for execution according to the job configuration. Still other schemes focus on monitoring the job status, and the user can know the execution condition of the job in time by inquiring the job status information of each execution platform at regular time and feeding back the result to the user. However, depending on the manner of manually writing and maintaining multiple sets of submitted scripts, not only are the workload and error probability of users increased, but also the modification and maintenance costs of the scripts are extremely high when the execution platform changes or the job requirements are adjusted. Secondly, manually uploading and downloading files and monitoring the operation state, so that the whole operation management process is low in efficiency, and file transmission errors or untimely operation state monitoring are easy to be caused by human negligence. More importantly, once the main control program is abnormally withdrawn, such as the condition of network disconnection, breakdown or restarting, the consistency recovery of the operation and the state is difficult to ensure, the idle running phenomenon of the computing resources can be caused, the precious computing resources are wasted, the condition of repeated submission of the operation can also occur, the computing cost is increased, even the loss of the computing result can be caused, and the irrecoverable loss is brought to the user. Disclosure of Invention The application provides a multi-machine concurrent task scheduling and dynamic scheduling method, which aims to solve the technical problems that a plurality of computers often rely on manual programming and maintenance of a plurality of sets of submitting scripts at present, when a main control program exits abnormally, the operation and the state are difficult to recover consistently, and resource idling, repeated submitting or result losing are easy to cause. The application provides a multi-machine-concurrent task scheduling and dynamic scheduling method which is applied to a plurality of task machines, wherein the task machines are computer equipment for executing tasks to be executed, at least one task node is arranged in each task machine, the task nodes are execution units in the task machines for running the tasks to be executed, and the method comprises the following steps: receiving a task to be executed; writing the task to be executed into a task waiting queue; each task machine is provided with a corresponding task waiting queue, and the task waiting queues are used for caching the tasks to be executed; Refreshing task running states of the tasks to be executed in each task node, wherein the task running states comprise running states, completed states and abnormal states; determining a task running state of the task to be executed, and if the task running state is the completed state, removing the corr