CN-121387647-B - Kubemark-based Pod fault injection method and system in kubernetes clusters

CN121387647BCN 121387647 BCN121387647 BCN 121387647BCN-121387647-B

Abstract

The invention provides a Pod fault injection method and a Pod fault injection system based on kubemark in kubernetes clusters, which firstly provide a fault acquisition tool for collecting Pod fault information in real clusters, simultaneously use modified Kubemark to create virtual nodes with the same specification as real cluster nodes in simulation clusters, deploy CRI-Proxy (interface agent during container running) components to execute fault injection and container life cycle management, simulate fault scenes through a fault recording-fault broadcasting mode, simulate Pod and container life cycle abnormality through fault injection, thereby testing the stability and recovery capacity of a system, providing valuable performance test data and system optimization guidance for the real clusters, saving resources, improving test efficiency, helping to reproduce problems and find potential defects and being beneficial to avoiding various risks in advance.

Inventors

Sun Qingjiao
FAN KANG
YANG XIANGJUN
YI XIAOMENG

Assignees

之江实验室

Dates

Publication Date: 20260508
Application Date: 20251225

Claims (9)

1. A method for kubemark-based Pod fault injection in kubernetes clusters, the method comprising the steps of: Step one, in a true Kubernetes cluster, using a fault acquisition tool to acquire true Pod fault information and generating a fault strategy configuration file; Creating virtual nodes in the simulation cluster by using the reconstructed Kubemark, setting the same resource configuration as the real nodes by each node, and simulating a cluster with the same specification as the real cluster; The CRI-Proxy component is deployed in the simulation cluster and is used for receiving Kubemark requests and carrying out fault injection on specified Pod, and the method is realized by the following substeps: (3.1) creating a remote runtime instance, starting gRPC a server, and processing a client request; (3.2) constructing simulation services when the remote container runs, namely respectively simulating real run-time services and mirror Image services by realizing CRI run-time Service interfaces Fake_run_service and mirror Image Service interfaces Fake_image_service; And (3.3) loading a fault strategy configuration file, broadcasting fault information to a designated fault triggering stage according to the configuration file, defining different types of faults and events aiming at different stages of a container and a Pod life cycle, and calling an interface to execute injection specific fault simulation at a designated time, namely, each fault object can define the main body, time and fault type information of fault occurrence, multiple faults can be injected in the life cycle of one Pod, and fault execution is triggered at the designated time.
2. The Pod fault injection method according to claim 1, wherein the fault collection tool in the first step collects real Pod fault information based on a promethaus cluster monitoring system, analyzes the collected Pod fault data, extracts specific attributes of the fault, and generates a configuration file which can be identified when the fault is injected into the simulation cluster.
3. The Pod fault injection method of claim 2, wherein the real Pod fault information comprises CRI call interface error, pod state exception information, and the specific attributes comprise Pod name, namespace, container name, fault type, fault time, and fault error hint.
4. The Pod fault injection method of claim 1, wherein the reconstructed Kubemark in the second step is specifically one of adding a new CRI-Proxy component client for interacting with CRI-Proxy, and two of rewriting all CRI implementations, forwarding an original CRI interface call to CRI-Proxy, so that Kubemark is only responsible for receiving its response result.
5. The Pod fault injection method of claim 1, wherein the creating a virtual node in the second step is specifically composed of Hollow Kubelet and holow Proxy, which are used to simulate Kubelet and Kube Proxy on a real node respectively, and the two are used to simulate the behavior of the real node by simulating API calls and status reports without actually running a container or executing a workload.
6. The Pod fault injection method according to claim 1, wherein the lifecycle of the Pod and its internal containers in step (3.3) follows strict sequential logic, the creation of the Pod being the start of the whole lifecycle; then the internal containers enter the creation stage in turn; the method comprises the steps of terminating a container according to respective exit logic when the operation of the container is finished, deleting the final Pod, marking the finish of the whole life cycle, simulating the state of the Pod in each stage based on the life cycle, so as to realize the control of the creation, operation or deletion of the Pod outside the Kubernetes logic, executing the creation flow of the container after the creation of the Pod, triggering a preset creation fault at the moment, formally completing the creation of the container after the execution of fault processing logic is finished, terminating according to the state when the operation cycle of the container is finished, deleting the final Pod, finishing the whole life cycle, and defining fault occurrence time, duration, fault injection targets, fault injection time, fault completion and fault error prompt for each fault object, wherein the fault injection targets specify the name of the Pod, the name space of the Pod and the name of the container, and the fault injection time specifies a specific CRI interface, namely, the fault logic step is specified to be executed.
7. A Pod fault injection system based on kubemark in kubernetes clusters, the system comprising the following modules: the fault strategy configuration module is used for acquiring real Pod fault information in a real Kubernetes cluster by using a fault acquisition tool and generating a fault strategy configuration file; the simulation cluster construction module is used for creating virtual nodes in the simulation clusters by using the reconstructed Kubemark, setting the same resource configuration as the real nodes by each node, and simulating a cluster with the same specification as the real clusters; the fault injection module is used for deploying a CRI-Proxy component in the simulation cluster, receiving Kubemark requests and carrying out fault injection on the specified Pod, and is realized by the following steps: creating a remote runtime instance, starting gRPC a server, and processing a client request; Constructing simulation services when a remote container runs, and particularly respectively simulating real Runtime services and mirror Image services by realizing a CRI Runtime Service interface Fake_run_service and a mirror Image Service interface Fake_image_service; meanwhile, a fault processing mechanism is embedded in the realization logic of the interface, so that the interface can simulate a corresponding fault scene according to information transmitted in a subsequent fault injection process; defining different types of faults and events aiming at different stages of a container and a Pod life cycle, and executing injection specific fault simulation by calling an interface at a designated time, namely, each fault object can define a main body, time and fault type information of fault occurrence, multiple faults can be injected in the life cycle of one Pod, and fault execution is triggered at the designated time.
8. An electronic device comprising a memory coupled to a processor, wherein the memory is configured to store program data, and a processor configured to execute the program data to implement the kubemark-based Pod fault injection method in a kubernetes cluster as claimed in any one of claims 1 to 6.
9. A computer readable storage medium having stored thereon a computer program, wherein the program when executed by a processor implements the method and system of kubemark based Pod fault injection in kubernetes clusters as claimed in any one of claims 1 to 6.

Description

Kubemark-based Pod fault injection method and system in kubernetes clusters Technical Field The invention relates to the technical field of cloud computing and container arrangement, in particular to a kubemark-based Pod fault injection method and system in kubernetes clusters. Background The Kubernetes-open-source container arrangement system is widely applied to automatic deployment, elastic expansion and contraction and management of containerized application at present, provides strong resource management and scheduling capability, and can ensure efficient utilization of underlying computing resources and storage resources. With the rapid development of the AI technology, it is becoming more and more important to use distributed computing resources to perform large-scale model training, while the capability of Kubernetes meets the requirements of the AI large model in terms of large resource demand, deployment, operation and maintenance, and the like, and becomes a standard base of model training and reasoning environments. Although Kubernetes provides high availability and fault tolerance functionality, in a practical production environment, the system may still face various unexpected situations where development and operation personnel are required to construct various abnormal scenarios, test the stability and recovery capabilities of the system. However, introducing faults directly in the production environment may present a risk. Therefore, fault injection is a key technical means for testing the performance of the system and the application in the face of various fault scenarios. However, the conventional fault injection method tends to operate directly for real nodes and Pod, but in a large-scale cluster environment, it is difficult to precisely control the range and condition of fault injection, and waste of resources is caused. Therefore, a safe, efficient and easy-to-configure method is developed to test the behavior of the Kubernetes component in a fault scene, and valuable performance simulation data is provided for a real cluster environment, which has important significance. Disclosure of Invention Aiming at the defects of the prior art, the invention aims to provide a Kubemark-based Pod fault injection method and system, which simulate a large-scale Kubernetes cluster consistent with the logic specification of a real cluster by utilizing and transforming a performance test tool Kubemark disclosed by the Kubernetes official, thereby greatly reducing the resource consumption and precisely controlling the range and condition of fault injection. The method solves the problems of high cost, difficult reproduction and investigation of the large-scale chaos test, low repeatability of the test method and the like in the existing method, and the performance of the real Kubernetes cluster is simulated and estimated through a simulation mode of 'fault recording-fault broadcasting'. The technical scheme adopted by the invention is as follows: In a first aspect of the invention, a kubemark-based Pod fault injection method in kubernetes clusters comprises the following steps: Step one, in a true Kubernetes cluster, using a fault acquisition tool to acquire true Pod fault information and generating a fault strategy configuration file; Creating virtual nodes in the simulation cluster by using the reconstructed Kubemark, setting the same resource configuration as the real nodes by each node, and simulating a cluster with the same specification as the real cluster; And thirdly, deploying a CRI-Proxy component in the simulation cluster, wherein the CRI-Proxy component is used for receiving Kubemark requests and performing fault injection on the specified Pod. Specifically, the fault collection tool in the first step collects real Pod fault information based on a Prometaus cluster monitoring system, analyzes the collected Pod fault data, extracts specific properties of faults, and then generates a configuration file which can be identified when the faults are injected into the simulation cluster. The real Pod fault information comprises CRI call interface errors and Pod state exception information, and the specific attributes comprise Pod names, namespaces, container names, fault types, fault time and fault error prompts. The method comprises the steps of modifying Kubemark in the second step, namely, adding a client of a CRI-Proxy component newly for interacting with the CRI-Proxy, and rewriting all CRI implementations, and forwarding the original CRI interface call to the CRI-Proxy to enable Kubemark to only be responsible for receiving response results. Further, in the second step, a virtual node is created, specifically, the virtual node is composed of Hollow Kubelet and Hollow Proxy, which are respectively used for simulating Kubelet and Kube Proxy on a real node, and the virtual node does not need to actually run a container or execute a workload, but simulates the behavior of the real node through si