CN-122019112-A - Multi-node cooperative control task scheduling optimization and fault tolerance processing method

CN122019112ACN 122019112 ACN122019112 ACN 122019112ACN-122019112-A

Abstract

The invention relates to the technical field of task scheduling optimization and fault-tolerant processing. The invention relates to a multi-node cooperative control task scheduling optimization and fault tolerance processing method. The method comprises the following steps of S1, obtaining a running node list, extracting resource configuration of running nodes, historical running tasks of the running nodes and corresponding running data, S2, dividing the running data of the historical running tasks into data segments according to fluctuation amplitude to obtain a plurality of data segments corresponding to the historical running tasks, and integrating node fault probability and fault-tolerant resource expenditure into a scheduling decision through scheduling and fault-tolerant integrated design.

Inventors

Jing Zehui
CHEN HONG
WANG BOWEN

Assignees

杭州赋逸科技有限公司

Dates

Publication Date: 20260512
Application Date: 20260414

Claims (9)

1. A multi-node cooperative control task scheduling optimization and fault tolerance processing method is characterized by comprising the following steps: S1, acquiring an operation node list, and extracting resource configuration of operation nodes, historical operation tasks of the operation nodes and corresponding operation data; S2, dividing the operation data of the historical operation task into data segments according to the fluctuation amplitude to obtain a plurality of data segments corresponding to the historical operation task, and carrying out similar association on each data segment to form a data segment set, and generating new data segments according to the operation data and the data segment lengths in the data segment set; S3, setting a data interception length according to the length of the data segment of the historical operation task, intercepting the latest operation data at the operation node according to the interception length, extracting a data segment set corresponding to the operation node, performing similarity analysis on the data segment set and the latest operation data, selecting a most similar target data segment, and performing fault coefficient analysis on the target data segment by combining the resource configuration of the operation node to obtain a fault coefficient corresponding to the operation node; s4, setting a fault optimization threshold according to the residual resource configuration of each operation node, comparing the fault coefficient of each operation node with the corresponding optimization threshold, and judging that the operation node is a node to be protected by fault tolerance when the fault coefficient is larger than the optimization threshold; And S5, screening and selecting the residual resource configuration of other operation nodes according to the corresponding target data segment for the to-be-fault-tolerant protection node, and carrying out cooperative scheduling of the same operation task between the to-be-fault-tolerant protection node and the selected other operation nodes until the fault coefficient of the to-be-fault-tolerant protection node is smaller than an optimization threshold value.
2. The multi-node cooperative control task scheduling optimization and fault tolerance processing method according to claim 1, wherein in the step S1, a multi-node operation system is connected, online state detection is carried out on a global node through the multi-node system, operation nodes in an online operation state are screened out, exclusive identifiers of each operation node are given, and exclusive identifiers of the operation nodes are summarized to obtain an operation node list; Counting, analyzing and quantifying the calculation power, storage, network bandwidth, current configuration value of hardware load and residual available value of each operation node to obtain the resource configuration of the operation node; And (3) according to the time dimension, the task execution log and the data acquisition log of each running node are obtained, the completed and unfinished historical running tasks are screened out and classified according to task numbers, and meanwhile, the running data of each historical running task in the execution process are extracted.
3. The multi-node cooperative control task scheduling optimization and fault tolerance processing method is characterized in that in the S2, a fluctuation amplitude threshold of operation data is set, a sliding window method is adopted to segment time series operation data of historical operation tasks, and when the fluctuation amplitude of the operation data in a sliding window exceeds the threshold, segmentation cutting is carried out until division of operation data of all the historical operation tasks is completed, and a plurality of data segments corresponding to the historical operation tasks are obtained; calculating the running data similarity value between any two data segments by adopting a data feature matching algorithm, setting a correlation threshold, and when the similarity value of the two data segments exceeds the correlation threshold, correlating the two data segments, otherwise, when the similarity value of the two data segments does not exceed the correlation threshold, not correlating the two data segments; the same data segment can establish an association relationship with a plurality of data segments with up-to-standard similarity; The data segment sets are exclusive data segment sets which are independently divided according to the operation nodes, and the historical operation task data segments of each operation node only comprise the exclusive data segment sets corresponding to the operation nodes.
4. The method for optimizing task scheduling and fault-tolerant processing under multi-node cooperative control of claim 1, wherein in S2, at least two data segments are selected from a data segment set, feature fusion and interpolation complement are performed on operation data of the selected data segments, meanwhile, splicing is performed by combining the lengths of the selected data segments, a new data segment conforming to the characteristic rule of the operation data is generated, the length of the new data segment is the average value of the lengths of the selected data segments, and the generated new data segment is added with corresponding identification information and then added into the original data segment set, so that expansion of the data segment set is completed.
5. The multi-node cooperative control task scheduling optimization and fault tolerance processing method according to claim 1, wherein in the step S3, the length average value of all data segments of the historical operation task corresponding to each operation node is calculated, and the length average value is used as the data interception length of the latest operation data of the operation node; Intercepting the latest operation data matched with the data interception length from a real-time data acquisition end of each operation node according to the sequence from new to old; and extracting a proprietary data segment set corresponding to each operation node, analyzing the similarity value of the latest operation data and each data segment in the data segment set, acquiring the similarity value of the latest operation data and each data segment, and then selecting one data segment with the largest similarity value as a target data segment.
6. The multi-node cooperative control task scheduling optimization and fault tolerance processing method according to claim 1, wherein in the step S3, the operation data of the target data segment and the resource allocation of the operation node are subjected to multi-dimensional fitting analysis, the fault weight coefficient of each dimension is set, the initial fault coefficient of the operation node is obtained through weighted calculation, the initial fault coefficient is corrected according to the length of the target data segment, and the corrected value is the final fault coefficient of the operation node; The larger the value of the final fault coefficient is, the higher the node fault risk is; the smaller the value of the final failure coefficient, the lower the risk of node failure.
7. The multi-node cooperative control task scheduling optimization and fault tolerance processing method according to claim 1, wherein in the step S4, residual resource configuration analysis is performed according to the resource configuration corresponding to the operation node, and the residual resource configuration of each operation node is obtained; The resource configuration consists of a used resource configuration and a remaining resource configuration; Summarizing the residual resource configuration of each operation node to set a fault optimization threshold; the more the remaining resource configurations, the lower the failure optimization threshold; the fewer the remaining resource configurations, the higher the failure optimization threshold; Performing risk comparison on the fault coefficients of each operation node and the corresponding optimization threshold; when the fault coefficient is larger than the optimization threshold value, judging the operation node as a node to be subjected to fault-tolerant protection; And when the fault coefficient is smaller than the optimization threshold value, judging that the running node is not the node to be protected by fault tolerance.
8. The multi-node cooperative control task scheduling optimization and fault tolerance processing method according to claim 1, wherein in S5, for a node to be protected by fault tolerance, calculating a resource demand parameter of a target data segment of the node to be protected by fault tolerance; And taking the resource demand parameter as a reference, screening other operation nodes of which the residual resource configuration meets the resource demand parameter in the multi-node system, and preferentially selecting normal operation nodes of which the residual resource configuration values are higher than the reference parameter and the node load rate is the lowest, wherein the other operation nodes are single or multiple.
9. The method for optimizing task scheduling and fault-tolerant processing of multi-node cooperative control of claim 1, wherein in S5, all running tasks currently executed by the node to be fault-tolerant protected are transmitted to other selected running nodes according to a task synchronization method, and the node to be fault-tolerant protected and the other selected nodes execute the same running tasks by cooperative scheduling; in the cooperative scheduling process, detecting the change of the fault coefficient of the node to be fault-tolerant protected in real time, stopping fault-tolerant protection of other selected operation nodes after the fault coefficient is lower than an optimization threshold value, and recovering independent task scheduling of each node.

Description

Multi-node cooperative control task scheduling optimization and fault tolerance processing method Technical Field The invention relates to the technical field of task scheduling optimization and fault-tolerant processing, in particular to a multi-node cooperative control task scheduling optimization and fault-tolerant processing method. Background The task scheduling and fault-tolerant processing of multi-node cooperative control is a core support technology in the fields of distributed computing, industrial Internet, yun Bianduan cooperation and the like, and has the effects of realizing reasonable allocation and efficient execution of tasks in a system consisting of a plurality of heterogeneous nodes, simultaneously coping with sudden conditions such as node faults, resource fluctuation, network abnormality and the like, and guaranteeing stable operation of the system and on-schedule completion of the tasks. In practical application, if the scheduling and fault-tolerant designs are mutually split, most scheduling algorithms only focus on single targets such as resource utilization rate, time delay and the like, node fault probability and fault-tolerant resource expenditure are not integrated into scheduling decisions, the fault-tolerant mechanism is mostly-post-remediation, the scheduling scheme is invalid, the recovery time delay is overlong when faults occur, and the redundant resource waste is serious, secondly, threshold setting and node state suitability are poor, the prior art mostly adopts a globally unified fault judgment threshold, the difference of residual resource configuration of different nodes is not considered, node fault-tolerant trigger hysteresis with sufficient residual resources is caused, or excessive fault tolerance of nodes with insufficient residual resources causes resource exhaustion, so that the task scheduling optimization and fault-tolerant processing method of multi-node cooperative control is provided. Disclosure of Invention The invention aims to provide a multi-node cooperative control task scheduling optimization and fault tolerance processing method for solving the problems in the background technology. In order to achieve the above purpose, a multi-node cooperative control task scheduling optimization and fault tolerance processing method is provided, which comprises the following steps: S1, acquiring an operation node list, and extracting resource configuration of operation nodes, historical operation tasks of the operation nodes and corresponding operation data; S2, dividing the operation data of the historical operation task into data segments according to the fluctuation amplitude to obtain a plurality of data segments corresponding to the historical operation task, and carrying out similar association on each data segment to form a data segment set, and generating new data segments according to the operation data and the data segment lengths in the data segment set; S3, setting a data interception length according to the length of the data segment of the historical operation task, intercepting the latest operation data at the operation node according to the interception length, extracting a data segment set corresponding to the operation node, performing similarity analysis on the data segment set and the latest operation data, selecting a most similar target data segment, and performing fault coefficient analysis on the target data segment by combining the resource configuration of the operation node to obtain a fault coefficient corresponding to the operation node; s4, setting a fault optimization threshold according to the residual resource configuration of each operation node, comparing the fault coefficient of each operation node with the corresponding optimization threshold, and judging that the operation node is a node to be protected by fault tolerance when the fault coefficient is larger than the optimization threshold; And S5, screening and selecting the residual resource configuration of other operation nodes according to the corresponding target data segment for the to-be-fault-tolerant protection node, and carrying out cooperative scheduling of the same operation task between the to-be-fault-tolerant protection node and the selected other operation nodes until the fault coefficient of the to-be-fault-tolerant protection node is smaller than an optimization threshold value. As a further improvement of the technical scheme, in the step S1, a multi-node operation system is connected, online state detection is performed on the global node through the multi-node system, operation nodes in an online operation state are screened out, each operation node is given an exclusive identifier, and then the exclusive identifiers of the operation nodes are summarized to obtain an operation node list; Counting, analyzing and quantifying the calculation power, storage, network bandwidth, current configuration value of hardware load and residual available value of each