CN-115168096-B - Cloud computing node control method, device and medium

CN115168096BCN 115168096 BCN115168096 BCN 115168096BCN-115168096-B

Abstract

The application relates to the field of cloud computing and discloses a cloud computing node control method, a device and a medium, wherein when a fault node of a cloud computing platform is detected, a target cloud host corresponding to each fault node is determined; the method comprises the steps of carrying out priority sequencing on recovery tasks to obtain a priority queue, wherein the recovery tasks comprise cloud node recovery tasks and cloud host recovery tasks, executing the recovery tasks in the priority queue, judging whether the recovery tasks in the priority queue are successfully executed or not, and if the recovery tasks are not successfully executed, reducing the priority of failure tasks and adding the failure tasks into the priority queue. When the large-scale host computer fails, when the execution failure of the recovery task is detected, the priority of the failure task is reduced, the failure task is added to the task queue again, and the stable and orderly recovery of the cloud host computer is ensured. According to the method, the priority of the failed task is reduced and executed, so that the stable recovery of the cloud host is ensured, and the reliability and stability of the cloud platform are improved.

Inventors

SU ZHENGWEI

Assignees

济南浪潮数据技术有限公司

Dates

Publication Date: 20260512
Application Date: 20220727

Claims (7)

1. A cloud computing node control method, comprising: when detecting fault nodes of a cloud computing platform, determining target cloud hosts corresponding to the fault nodes; The method comprises the steps of carrying out priority sequencing on recovery tasks to obtain a priority queue, wherein the recovery tasks comprise cloud node recovery tasks and cloud host recovery tasks; Executing each recovery task in the priority queue, and judging whether each recovery task in the priority queue is successfully executed or not; If the recovery task is not successfully executed, reducing the priority of the failed task, and adding the failed task into the priority queue so as to execute each recovery task in the priority queue again; said executing each of said recovery tasks in said priority queue comprises: Judging whether the recovery task of the priority queue accords with a fusing rule or not; If the fusing rule is met, judging whether the recovery task of the priority queue meets the current limiting rule or not; If the current limiting rule is met, carrying out recovery processing on the recovery task; If the current limit rule is not met, setting the state of the corresponding recovery task as a waiting state and returning to the step of carrying out priority sequencing on the fault cloud node and the recovery tasks corresponding to the cloud hosts to obtain a priority queue; If the fusing rule is not met, the corresponding recovery task is not subjected to recovery processing; The fusing rule making comprises the following steps: The fusing rules comprise cloud host fusing rules and node fusing rules, wherein the cloud host fusing rules at least comprise a first fusing rule and a second fusing rule, and the node fusing rules at least comprise a third fusing rule, a fourth fusing rule and a fifth fusing rule; when the number of recovery failures of the cloud host recovery tasks of the cloud hosts exceeds a first threshold, acquiring a cloud host corresponding to current recovery, determining remaining unrecovered cloud hosts according to the cloud hosts corresponding to the current recovery, and controlling the remaining unrecovered cloud hosts not to perform recovery processing; The second fusing rule is a rule that when the recovery times of a cloud host recovery task of a current cloud host exceeds a second threshold value in a preset time, the cloud host corresponding to the cloud host recovery task does not perform recovery processing; the third fusing rule is a rule that the fault cloud node does not perform recovery processing when the number of the fault cloud nodes exceeds a third threshold value; The fourth fusing rule is a rule that when the number of node recovery task recovery failures of the fault cloud node exceeds a fourth threshold value, the fault cloud node corresponding to current recovery is obtained, the fault cloud node to be recovered is determined according to the fault cloud node corresponding to current recovery, and the fault cloud node to be recovered is controlled not to carry out recovery processing; The fifth fusing rule is a rule that the current fault cloud node does not perform recovery processing when the recovery times of the node recovery task of the current fault cloud node exceeds a fifth threshold value in a preset time; The formulation of the current limiting rule comprises the following steps: the current limiting rules comprise cloud host current limiting rules and node current limiting rules; The cloud host current limiting rule is a rule for acquiring the number of cloud hosts corresponding to a current fault cloud node in a specified time, and if the number of cloud hosts exceeds a first current limiting threshold, evacuating according to the priority of a recovery task of each cloud host corresponding to the current fault cloud node; The node current limiting rule is a rule for evacuating according to priorities of recovery tasks corresponding to a plurality of fault cloud nodes when the number of the plurality of fault cloud nodes obtained in a specific time exceeds a second current limiting threshold.
2. The method of claim 1, wherein prioritizing the recovery tasks to obtain a priority queue comprises: And respectively carrying out priority sequencing on the cloud node recovery task and the cloud host recovery task to obtain the node recovery task queue and the cloud host recovery task queue.
3. The cloud computing node control method of claim 1, wherein said reducing the priority of failed tasks comprises: judging whether the failed task meets a preset condition or not; And if the preset condition is met, reducing the priority of the failed task.
4. The cloud computing node control method according to claim 3, wherein the preset condition includes: The priority of the failed task is greater than a preset priority, and the execution times of the failed task are less than a time threshold.
5. A cloud computing node control apparatus, comprising: the determining module is used for determining target cloud hosts corresponding to the fault nodes when the fault nodes of the cloud computing platform are detected; The acquisition module is provided with a priority queue for sequencing all recovery tasks, wherein the recovery tasks comprise cloud node recovery tasks and cloud host recovery tasks; the judging module is used for executing each recovery task in the priority queue and judging whether each recovery task in the priority queue is successfully executed or not; the priority reducing module is used for reducing the priority of the failed task if the recovery task is not successfully executed, and adding the failed task into the priority queue so as to execute each recovery task in the priority queue again; said executing each of said recovery tasks in said priority queue comprises: Judging whether the recovery task of the priority queue accords with a fusing rule or not; If the fusing rule is met, judging whether the recovery task of the priority queue meets the current limiting rule or not; If the current limiting rule is met, carrying out recovery processing on the recovery task; If the current limit rule is not met, setting the state of the corresponding recovery task as a waiting state and returning to the step of carrying out priority sequencing on the fault cloud node and the recovery tasks corresponding to the cloud hosts to obtain a priority queue; If the fusing rule is not met, the corresponding recovery task is not subjected to recovery processing; The fusing rule making comprises the following steps: The fusing rules comprise cloud host fusing rules and node fusing rules, wherein the cloud host fusing rules at least comprise a first fusing rule and a second fusing rule, and the node fusing rules at least comprise a third fusing rule, a fourth fusing rule and a fifth fusing rule; when the number of recovery failures of the cloud host recovery tasks of the cloud hosts exceeds a first threshold, acquiring a cloud host corresponding to current recovery, determining remaining unrecovered cloud hosts according to the cloud hosts corresponding to the current recovery, and controlling the remaining unrecovered cloud hosts not to perform recovery processing; The second fusing rule is a rule that when the recovery times of a cloud host recovery task of a current cloud host exceeds a second threshold value in a preset time, the cloud host corresponding to the cloud host recovery task does not perform recovery processing; the third fusing rule is a rule that the fault cloud node does not perform recovery processing when the number of the fault cloud nodes exceeds a third threshold value; The fourth fusing rule is a rule that when the number of node recovery task recovery failures of the fault cloud node exceeds a fourth threshold value, the fault cloud node corresponding to current recovery is obtained, the fault cloud node to be recovered is determined according to the fault cloud node corresponding to current recovery, and the fault cloud node to be recovered is controlled not to carry out recovery processing; The fifth fusing rule is a rule that the current fault cloud node does not perform recovery processing when the recovery times of the node recovery task of the current fault cloud node exceeds a fifth threshold value in a preset time; The formulation of the current limiting rule comprises the following steps: the current limiting rules comprise cloud host current limiting rules and node current limiting rules; The cloud host current limiting rule is a rule for acquiring the number of cloud hosts corresponding to a current fault cloud node in a specified time, and if the number of cloud hosts exceeds a first current limiting threshold, evacuating according to the priority of a recovery task of each cloud host corresponding to the current fault cloud node; The node current limiting rule is a rule for evacuating according to priorities of recovery tasks corresponding to a plurality of fault cloud nodes when the number of the plurality of fault cloud nodes obtained in a specific time exceeds a second current limiting threshold.
6. A cloud computing node control device, comprising a memory for storing a computer program; A processor for implementing the steps of the cloud computing node control method according to any of claims 1 to 4 when executing the computer program.
7. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, implements the steps of the cloud computing node control method according to any of claims 1 to 4.

Description

Cloud computing node control method, device and medium Technical Field The present application relates to the field of cloud computing, and in particular, to a method, an apparatus, and a medium for controlling cloud computing nodes. Background The cloud computing platform manages various physical devices through hardware virtualization technology, so as to provide cloud computing services for users. In order to enable a cloud computing platform to provide stable cloud services for users, the cloud computing platform is generally required to have a high-availability computing function, that is, when a certain host node in the platform fails, cloud hosts running on the failed node can be dispersed to other normal nodes, so that user services are not affected. However, currently, the high availability function provided by each cloud computing platform only has the function of transferring the cloud host running on the fault node to a normal node when the fault node is found, and the recovery process is not effectively controlled. When large-scale fault nodes exist in the cloud computing platform, when a certain recovery task fails to be executed, a cloud host fault recovery system can be confused, so that the recovery task cannot be normally executed, and finally the cloud computing platform is crashed. Therefore, how to provide a cloud computing node management method in a cloud computing platform to ensure the normal operation of a cloud host when a large-scale node fault occurs in the cloud computing platform is a problem to be solved by those skilled in the art. Disclosure of Invention The application aims to provide a cloud computing node control method, a cloud computing node control device and a cloud computing node control medium, which are used for preventing a cloud computing platform from collapsing when a recovery task fails to execute when a large-scale node fault occurs to the cloud computing platform, so that the cloud computing platform is ensured to work normally. The application provides a cloud computing node control method, which comprises the following steps: when detecting fault nodes of a cloud computing platform, determining target cloud hosts corresponding to the fault nodes; The method comprises the steps of carrying out priority sequencing on recovery tasks to obtain a priority queue, wherein the recovery tasks comprise cloud node recovery tasks and cloud host recovery tasks; Executing each recovery task in the priority queue, and judging whether each recovery task in the priority queue is successfully executed or not; if the recovery task is not successfully executed, the priority of the failed task is reduced, and the failed task is added into the priority queue, so that each recovery task in the priority queue can be executed again. Preferably, said executing each of said recovery tasks in said priority queue comprises: Judging whether the recovery task of the priority queue accords with a fusing rule or not; If the fusing rule is met, judging whether the recovery task of the priority queue meets the current limiting rule or not; If the current limiting rule is met, carrying out recovery processing on the recovery task; if the current limit rule is not met, setting the state of the corresponding recovery task as a waiting state and returning to the step of carrying out priority sequencing on the recovery tasks corresponding to the fault cloud node and each cloud host to obtain a priority queue; and if the fusing rule is not met, the corresponding recovery task is not subjected to recovery processing. Preferably, the prioritizing the recovery tasks to obtain a priority queue includes: And respectively carrying out priority sequencing on the cloud node recovery task and the cloud host recovery task to obtain the node recovery task queue and the cloud host recovery task queue. Preferably, the reducing the priority of the failed task includes: judging whether the failed task meets a preset condition or not; And if the preset condition is met, reducing the priority of the failed task. Preferably, the preset conditions include: The priority of the failed task is greater than a preset priority, and the execution times of the failed task are less than a time threshold. Preferably, the formulation of the fusing rule includes: The fusing rules comprise cloud host fusing rules and node fusing rules, wherein the cloud host fusing rules at least comprise a first fusing rule and a second fusing rule, and the node fusing rules at least comprise a third fusing rule, a fourth fusing rule and a fifth fusing rule; when the number of recovery failures of the cloud host recovery tasks of the cloud hosts exceeds a first threshold, acquiring a cloud host corresponding to current recovery, determining remaining unrecovered cloud hosts according to the cloud hosts corresponding to the current recovery, and controlling the remaining unrecovered cloud hosts not to perform recovery processing; The second fu