CN-122027626-A - Cluster operation method and server cluster

CN122027626ACN 122027626 ACN122027626 ACN 122027626ACN-122027626-A

Abstract

The embodiment of the application discloses a cluster operation method and a server cluster, and the method comprises the steps of monitoring load scheduling failure events issued by the server cluster, obtaining the total amount of resources required by loads to be scheduled in the server cluster when the accumulated monitoring times of the load scheduling failure events reach a first threshold value, inquiring candidate instances from a plurality of bound public clouds, wherein the candidate instances are instances meeting the total amount of resources, summarizing instance information of each candidate instance, constructing a candidate instance list, selecting a target instance from the candidate instance list, sending an instance creation request to a target public cloud from which the target instance originates, and scheduling the loads to be scheduled to a target working node after the fact that the target working node has joined the server cluster is detected, so as to operate the loads to be scheduled through the target working node. By binding a plurality of public clouds, unified multi-cloud scheduling in the server cluster is realized, and the efficiency of instance type selection is improved.

Inventors

Ci Kaiyu

Assignees

青岛聚看云科技有限公司

Dates

Publication Date: 20260512
Application Date: 20251226

Claims (10)

1. A method of cluster operation, comprising: monitoring a load scheduling failure event issued by a server cluster; When the accumulated monitoring times of the load scheduling failure events reach a first threshold value, acquiring the total amount of resources required by the load to be scheduled in the server cluster, wherein the load to be scheduled comprises a load with scheduling failure; querying a plurality of bound public clouds for candidate instances, the candidate instances being instances meeting the aggregate resource demand; summarizing the instance information of each candidate instance, constructing a candidate instance list, and selecting a target instance from the candidate instance list; an instance creation request is sent to a target public cloud, so that the target public cloud responds to the instance creation request, the target instance is created, and a target working node corresponding to the target instance is added into the server cluster; And after detecting that the target working node has joined the server cluster, scheduling the load to be scheduled to the target working node so as to run the load to be scheduled through the target working node.
2. The method of claim 1, wherein the instance information includes a price of an instance, an operational resource, a network delay, and an outage risk factor, the outage risk factor characterizing a probability that an instance is outage, wherein selecting a target instance from the candidate instance list comprises: Calculating the score of each candidate instance based on the price, the operation resource, the network delay and the interruption risk rate of each candidate instance in the candidate instance list, as well as a first weight coefficient, a second weight coefficient, a third weight coefficient and a fourth weight coefficient, wherein the first weight coefficient represents the price weight, the second weight coefficient represents the operation resource weight, the third weight coefficient represents the network delay weight, and the fourth weight coefficient represents the interruption risk weight; The candidate right instance with the highest score is selected as the target instance.
3. The method according to claim 2, wherein the method further comprises: And acquiring the interruption risk rate of the candidate instance based on the historical operation log of the candidate instance in the server cluster, wherein the historical operation log comprises the starting time, the releasing time and the releasing mode of the candidate instance, and the releasing mode comprises active release and passive interruption.
4. The method according to claim 1, wherein the method further comprises: When the trigger time of node recovery is reached, acquiring to-be-recovered working nodes of which the resource utilization rate in the server cluster is smaller than a second threshold value; Migrating the load to be migrated in the work node to be recovered to a first work node, wherein the first work node is the work node with the most residual resources in the migratable work nodes, and the migratable work node is the work node supporting the scheduling of the load to be migrated in the server cluster, and the migratable work node does not comprise the work node to be recovered and the work node marked with the completion of the current scheduling period; And repeating the steps until all loads in the working node to be recovered are migrated, recovering the working node to be recovered, and releasing the instance corresponding to the working node to be recovered.
5. The method of claim 4, wherein the migrating the load to be migrated in the worker node to be reclaimed onto the first worker node comprises: The load interruption budget related to the load to be migrated is obtained, wherein the load interruption budget is used for specifying the minimum available load quantity or the maximum unavailable load quantity of the working node under the condition of voluntary interruption, and the voluntary interruption refers to the load termination operation actively initiated by the server cluster; When the load to be migrated meets rescheduling conditions and meets the constraint of load interruption budget, acquiring the migratable working node which has affinity with the load to be migrated and meets the stain tolerance rule, and acquiring the first working node from the migratable working node; And if the first working node can bear the load to be migrated, migrating the load to be migrated to the first working node.
6. The method of claim 5, wherein said migrating the load to be migrated to the first worker node comprises: Deleting the load to be migrated from the work node to be recovered; creating a new load associated with the load to be migrated, and scheduling the new load onto the first work node to run the new load through the first work node.
7. The method of claim 5, wherein said migrating the load to be migrated to the first worker node comprises: Recording a first father controller to which the load to be migrated belongs, wherein the first father controller is a component for creating and managing the load to be migrated; Deleting the load to be migrated from the work node to be recovered, and intercepting the process of returning the load to be recovered after load reconstruction; creating a new load associated with the load to be migrated through a second parent controller, wherein the second parent controller is a component for creating and managing the new load; If the load information recorded by the second father controller is consistent with the load information recorded by the first father controller, writing the first working node name in a working node name field corresponding to the new load; and canceling interception operation so that the new load is scheduled to the first working node after the new load is created.
8. The method of claim 4, wherein the obtaining the to-be-reclaimed work node with the resource utilization rate smaller than the second threshold value in the server cluster when the trigger opportunity of node reclamation is reached comprises: when a timing time point is reached or when any working node deleting load in the server cluster is detected, determining triggering time meeting the recovery of the node, and acquiring the resource utilization rate of each working node in the server cluster; And determining the working node with the resource utilization rate smaller than the second threshold value as the working node to be recycled.
9. A server cluster, comprising: the system comprises at least one server, wherein the server is a physical server or a virtual server, each server corresponds to a working node, and the working nodes are used for running loads through examples; a decision system deployed within a control node in the server cluster and binding a plurality of public clouds, the decision system configured to: Monitoring a load scheduling failure event issued by the server cluster; When the accumulated monitoring times of the load scheduling failure events reach a first threshold value, acquiring the total amount of resources required by the load to be scheduled in the server cluster, wherein the load to be scheduled comprises a load with scheduling failure; querying the bound public cloud for candidate instances, wherein the candidate instances are instances meeting the total resource demand; summarizing the instance information of each candidate instance, constructing a candidate instance list, and selecting a target instance from the candidate instance list; An instance creation request is sent to a target public cloud, so that the target public cloud responds to the instance creation request, the target instance is created, and a target working node corresponding to the target instance is added into the server cluster; And after detecting that the target working node is added into the server cluster, scheduling the load to be scheduled to the target working node so as to run the load to be scheduled through the target working node.
10. The server cluster of claim 1, wherein the decision system is further configured to: When the trigger time of node recovery is reached, acquiring to-be-recovered working nodes of which the resource utilization rate in the server cluster is smaller than a second threshold value; Migrating the load to be migrated in the work node to be recovered to a first work node, wherein the first work node is the work node with the most residual resources in the migratable work nodes, and the migratable work node is the work node supporting the scheduling of the load to be migrated in the server cluster, and the migratable work node does not comprise the work node to be recovered and the work node marked with the completion of the current scheduling period; And repeating the steps until all loads in the working node to be recovered are migrated, recovering the working node to be recovered, and releasing the instance corresponding to the working node to be recovered.

Description

Cluster operation method and server cluster Technical Field The present application relates to the field of cloud computing technologies, and in particular, to a cluster operation method and a server cluster. Background The server cluster is a collection of a plurality of servers (physical servers or virtual servers), each server corresponds to a working node, the working node can bear the running and resource requirements of loads through examples provided by cloud manufacturers, wherein the examples are carriers for running application programs and services and act as a completely controllable virtual server, so that enterprises can obtain computing capacity without purchasing physical hardware, and quick deployment and flexible expansion and contraction are realized. Preemptive examples belong to pay-on-demand Fei Shili, which is a low-cost, interruptible computing resource provided by cloud vendors, and users can acquire examples through bidding. The preemptive instance can be successfully created when the market price is lower than the price, and the inventory is sufficient, and the preemptive instance can be interrupted and recycled when the price is lower than the market price, or the inventory is insufficient, and the instance is automatically released after 5 minutes. Thus, the preemptive example is applicable to load tasks with low stability requirements, capable of tolerating interrupts, such as big data analysis, image rendering, etc., and can save 90% of the cost relative to the Fei Shili standard of pay-per-view. The server cluster binds a public cloud, so that cross-cloud scheduling of preemptive examples cannot be realized, and the preemptive examples are selected by relying on manual price comparison, namely, enterprise arrangement operation and maintenance personnel traverse all public cloud control platforms to inquire the price of the preemptive examples. The method has the defects that the manual price comparison covers a limited time working node, minute price fluctuation is difficult to deal with, the selection is delayed or an optimal price window is missed, and more labor cost is consumed. Disclosure of Invention Some embodiments of the application provide a cluster operation method and a server cluster, which are used for realizing unified multi-cloud scheduling in the server cluster by binding a plurality of public clouds, so that the efficiency and quality of example selection are improved. In a first aspect, some embodiments of the present application provide a cluster operation method, including: monitoring a load scheduling failure event issued by a server cluster; When the accumulated monitoring times of the load scheduling failure events reach a first threshold value, acquiring the total amount of resources required by the load to be scheduled in the server cluster, wherein the load to be scheduled comprises a load with scheduling failure; querying a plurality of bound public clouds for candidate instances, the candidate instances being instances meeting the aggregate resource demand; summarizing the instance information of each candidate instance, constructing a candidate instance list, and selecting a target instance from the candidate instance list; An instance creation request is sent to a target public cloud, so that the target public cloud responds to the instance creation request, the target instance is created, and a target working node corresponding to the target instance is added into the server cluster; And after detecting that the target working node has joined the server cluster, scheduling the load to be scheduled to the target working node so as to run the load to be scheduled through the target working node. The embodiment of the first aspect has the advantages that the server cluster is pre-bound with a plurality of public clouds, when the accumulated monitoring times of load scheduling failure events reach a first threshold value, the server cluster is indicated that resources in the server cluster are seriously insufficient, nodes are required to be expanded, and examples are added, then the total amount of resources required by all loads to be scheduled, which are failed in scheduling, in the cluster are acquired, then candidate examples capable of meeting the total amount of resources are acquired from all the public clouds which are bound, resource gaps are compensated, relevant example information of each candidate example is summarized and integrated, a candidate example list is finally constructed, a target example is selected from the candidate example list, then the example is created by the target public clouds from which the target example originates, and a target working node corresponding to the target example after the creation is automatically added into the server cluster, so that the loads to be scheduled can be scheduled to the working node, and the target example can bear the operations of the loads. The method breaks through the limitati