CN-122019099-A - Sample batch processing scheduling system based on workflow engine

CN122019099ACN 122019099 ACN122019099 ACN 122019099ACN-122019099-A

Abstract

The invention relates to a sample batch processing scheduling system based on a workflow engine, and belongs to the technical field of bioinformatics analysis. The system comprises a tool construction module, a flow arrangement module, a flow execution module, a log management module and a log generation module, wherein the tool construction module is used for configuring tool parameters, resource demand parameters, input and output specifications, generating command lines and constructing an operation environment, the flow arrangement module is used for defining flow contents and simultaneously establishing dependency relationships among tools, the dependency relationship verification module is used for constructing a directed acyclic graph and executing multi-stage legality detection, the flow execution module is used for scheduling tool nodes according to node dependency relationships and executing flow tasks, the flow control module is used for controlling the running flow tasks, and the log management module is used for creating log files named with task numbers and generating independent logs for each tool node. The invention supports two operation modes of web page end and manual JSON configuration, and has the advantages of easy use and flexibility, and can improve the working efficiency.

Inventors

SHU PENG
WU MENGXI
YAN YAN
XIAO JINGJING

Assignees

中国人民解放军陆军军医大学第二附属医院

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (8)

1. The sample batch processing scheduling system based on the workflow engine is characterized by comprising a tool construction module, a flow arrangement module, a dependency verification module, a flow execution module, a flow control module and a log management module; the tool construction module is used for configuring tool parameters, resource demand parameters, input and output specifications, generating command lines and constructing an operation environment; The flow arranging module is used for defining flow content, converting tools constructed by the tool constructing module into directed acyclic graph nodes and establishing dependency relations among the tools; The dependency verification module is used for constructing a directed acyclic graph and executing multi-stage validity detection, so that loop-free and executable flow scheduling logic is ensured; the flow execution module dispatches tool nodes according to the node dependency relationship defined in the directed acyclic graph, and executes flow tasks; The flow control module is used for controlling the running flow tasks, including suspending, stopping, recovering or re-running the tasks; The log management module creates a log file named by a task number, and additionally records state change, key operation and time stamp in the whole process, and simultaneously generates independent logs for each tool node in the process of executing the flow.
2. The system of claim 1, wherein the tool building module comprises a tool metadata definition unit, a resource configuration unit, an IO specification definition unit, a command line execution unit, and an environment building unit; The tool metadata definition unit is used for configuring tool parameters, wherein the tool parameters comprise tool names, tool versions, tool classifications and tool description contents; the resource allocation unit is used for allocating resource demand parameters of the tool, wherein the resource demand parameters comprise quantization parameters of CPU core number, memory capacity, GPU number and maximum operation duration; the IO specification definition unit is used for setting input and output specifications and defining an input file path and an expected output file path required by the tool; The command line execution unit is used for generating an executable Shell command; The environment construction unit is used for constructing an environment in which the tool operates.
3. The system of claim 2, wherein the command line execution unit replaces all placeholders in the template with actual values to generate a complete Shell executable script.
4. The system of claim 1, wherein the flow orchestration module comprises a flow definition unit, a tool node generation unit, and a parameter dependent definition unit; The flow definition unit is used for setting a flow name, a version number, a flow description field and a flow classification, wherein the flow name is used for identifying a unique logic name of a flow, the version number is used for tracking flow iteration, the flow description field is used for explaining flow use, input and output and applicable scenes, and the flow classification is used for identifying an application field or an analysis type of the flow; the tool node generating unit instantiates the tool constructed by the tool constructing module into a directed acyclic graph node; the parameter dependency definition unit establishes a dependency relationship between tools through parameter transfer rules between the father node and the child node.
5. The system of claim 4, wherein instantiating the tools as directed acyclic graph nodes comprises identifying and parsing all tools selected by a user through a tool construction module; instantiating each tool as a node with specific attributes according to the tool parameters, wherein each node contains all information required for executing the tool, including command line scripts, environment variable settings and resource requirements; The parameter transfer rule comprises that a father node is a tool node for generating an output result after execution is completed, and a child node is a subsequent tool node for referencing the output result of the father node by the input parameter through a predefined mapping rule.
6. The system of claim 1, wherein the dependency check module constructs a directed acyclic graph and performs multi-stage validity detection, ensures that the flow scheduling logic is loop-free and executable, comprising: self-circulation checking, namely if a node has an edge pointing to the node, immediately reporting errors; the loop dependency detection, namely detecting whether a closed path exists or not by maintaining an accessed node set and a current recursion path stack by adopting depth-first search, and returning to a specific loop path if the closed path is found; The sub-flow verification comprises the steps of designating a starting node and a terminating node by a user, extracting all intermediate nodes which can reach the terminating node from the starting node to form a logically closed sub-graph, and then executing forward and reverse reachability verification, wherein the forward and reverse reachability verification comprises the steps of traversing all reachable nodes from the starting node along a relying direction and marking the reachable nodes as a set ForwardSet, traversing all prepositions from the terminating node along an inverse relying direction and marking the prepositions as a set BackwardSet, setting the sub-graph node set as an intersection of sets ForwardSet and BackwardSet, and finally detecting whether an isolated node exists or not, wherein the isolated node is a node which exists between the range of the starting node and the terminating node but is not in the sub-graph node set; And the verification result and the execution guarantee are that if the verification fails, the structured error information is returned, and if the verification passes, the legal directed acyclic graph is output as the scheduling basis of the flow execution module.
7. The system of claim 1, wherein the flow execution module schedules tool nodes to execute flow tasks according to node dependencies, comprising: loading flow metadata and all node configurations, and skipping over user marked skipped nodes; creating a dedicated working catalog for each node which is not skipped, and pre-generating Shell scripts; checking the dependence satisfaction condition of the node, wherein the dependence satisfaction of all input parameters of one node has obtained effective values, and all father nodes have been successfully completed; Before any node is executed, the system analyzes all input parameters of the node in real time, and replaces placeholders in the pre-generated Shell script with actual values corresponding to the input parameters; submitting the replaced Shell script and configuring the type of the scheduler; The method comprises the steps of polling a scheduler state, reading an rc file judgment execution result under a node work directory, wherein the rc file comprises a command exit code, if the command exit code is 0, the task execution is successful, otherwise, the task execution fails, if the task execution fails, triggering global failure immediately, terminating all running nodes, and recording an error log; for the successfully executed node, registering the output parameter of the successfully executed node to the global context; and when the execution of the flow task fails, returning error information, rc value and context snapshot of the first failed node for diagnosis by a user.
8. The system of claim 1, wherein the process control module is configured to manage running process tasks; when suspending or terminating the task, firstly reading the metadata of the task, and identifying the running context of the current flow task; then, according to the type of the scheduler, a job cancel command is called or an interrupt or forced termination signal is sent to an operating system process, calculation is stopped, calculation resources are released, and meanwhile, the global state of the task is updated; When the task is resumed, the completed result is reserved, the pause mark is cleared, and then the execution is continued from the last unfinished node; When the task is rerun, a new task instance with Rerun labels is created, the reference relation to the original task is reserved, the state marks of all nodes are cleared, and the process task is executed from the beginning.

Description

Sample batch processing scheduling system based on workflow engine Technical Field The invention belongs to the technical field of bioinformatics analysis, and relates to a sample batch processing scheduling system based on a workflow engine. Background In the field of bioinformatics, data analysis is highly dependent on complex specialized software tools and processes formed by combinations thereof. The current technical scheme mainly comprises three implementation paths, namely a Command Line Interface (CLI) or scripted deployment (such as CELLRANGER parameter configuration and SNAKEMAKE workflow engine), wherein the method has flexibility and expansibility, but requires users to master programming capability, a remarkable technical threshold is formed for non-professional users (experimental researchers), a graphical workflow system (such as Galaxy) reduces operation difficulty through drag operation, but preset components and fixed modes limit the deep customization capability of the flow, and a general task scheduling platform (such as Apache Airflow) is provided with built-in DAG validity check, but the design is not optimized for bioinformatics scenes, and is difficult to process the problems of data dynamic dependency and tool isomerism. The technical paths have obvious limitations, namely, the lack of standardized description formats for CLI and scripted deployment causes low parameter management efficiency, the graphical system sacrifices flexibility to be replaced by usability, and the general scheduling platform influences the actual application effect due to insufficient scene adaptation. In addition, the prior art also has multiple contradictions in the tool configuration and flow management level. On the one hand, it is difficult for a non-professional user to handle command lines or scripts, and professionals lack a unified machine-readable description format (e.g., JSON) when deploying tools in batches, resulting in cumbersome and error-prone parameter configuration. On the other hand, the existing process arrangement mode cannot balance usability and flexibility, the patterning system simplifies operation through a fixed template, but limits the customization capability of a complex process, and the code driving system (such as SNAKEMAKE) supports deep customization, but requires a user to have a programming basis to form a technical barrier. More importantly, most schemes do not integrate an automated DAG validity checking mechanism, loop-dependent errors are often discovered at runtime, adding significant debug costs. In addition, task management functions are dispersed in different systems, so that a unified interface is not only used for supporting visual monitoring of non-professional users, but also centralized management for fine control (such as task suspension and task restarting) of professionals is not available, and the reliability and management efficiency of a process are insufficient. Thus, there is a great need for current bioinformatics research to break through the bottleneck of the prior art. Aiming at the problem of low tool configuration efficiency, a standardized description format needs to be established to improve deployment efficiency, a hybrid architecture which takes graphical operation and code level control into consideration needs to be designed to solve unbalanced usability and flexibility of process arrangement, an automatic DAG checking mechanism needs to be introduced in the aspect of process reliability to avoid logic errors from the source, and meanwhile, an integrated task management platform needs to be established to uniformly process operations such as state monitoring, abnormal intervention and the like. The improvement not only needs to integrate graphical usability, script-level flexibility and intelligent verification mechanism, but also needs to meet the requirements of large-scale tool deployment, complex flow design and multi-scene adaptation through modularized design and dynamic dependence processing capacity, thereby closing the technical gap between non-professional users and professionals. Disclosure of Invention In view of the above, the present invention aims to provide a sample batch processing scheduling system based on a workflow engine, which meets the requirements of large-scale tool deployment, complex flow design and multi-scene adaptation by integrating graphics usability, script level flexibility and intelligent verification mechanism and by modularizing design and dynamic dependent processing capability, and fills the technical gap between non-professional users and professionals. In order to achieve the above purpose, the present invention provides the following technical solutions: a workflow engine based sample batch scheduling system, comprising: the tool construction module is used for configuring tool description parameters, resource demand parameters, input and output specifications, generating command lines and constru