Search

CN-122027447-A - Fault processing method and device for micro service cluster, electronic equipment and medium

CN122027447ACN 122027447 ACN122027447 ACN 122027447ACN-122027447-A

Abstract

The embodiment of the invention provides a fault processing method, a device, electronic equipment and a medium of a micro-service cluster, which belong to the technical field of micro-services, wherein the method comprises the steps of obtaining a fault scene which is obtained by combining a plurality of fault actions aiming at a plurality of micro-services; determining a fault restoration scheme corresponding to a fault scene, wherein the fault restoration scheme is obtained by performing simulation test on the fault scene in a sandbox environment, judging whether the fault is matched with the fault scene or not when faults occur in a micro service cluster, and automatically executing the fault restoration scheme corresponding to the fault scene if the faults are matched with the fault scene. According to the embodiment of the invention, the fault tolerance of the system in the multi-dimensional fault concurrency can be improved by performing simulation test on the complex fault scene in the sandbox environment, and in addition, the labor cost is reduced, the fault response time is shortened, and the safety and reliability of the system are further improved by automatically triggering the execution of the fault repair scheme.

Inventors

  • WANG GE

Assignees

  • 中国电信股份有限公司
  • 中电信数政科技有限公司

Dates

Publication Date
20260512
Application Date
20260128

Claims (12)

  1. 1. A method for fault handling of a micro-service cluster, the method comprising: Obtaining a fault scene, wherein the fault scene is obtained by combining a plurality of fault actions aiming at a plurality of micro services; Determining a fault repairing scheme corresponding to the fault scene, wherein the fault repairing scheme is obtained by performing simulation test on the fault scene in a sand box environment; When a fault occurs in the micro service cluster, judging whether the fault is matched with the fault scene or not; And if the fault is matched with the fault scene, automatically executing a fault repair scheme corresponding to the fault scene.
  2. 2. The method for fault handling of a micro service cluster according to claim 1, wherein the obtaining a fault scenario comprises: Acquiring the dependency relationship among the micro services in the micro service cluster; And determining a fault scene according to the dependency relationship among the micro services.
  3. 3. The method for fault handling of a micro service cluster according to claim 2, wherein determining a fault scenario according to the dependency relationship between the micro services comprises: Constructing a micro-service topological graph according to the dependency relationship among the micro-services; determining, for each micro-service in the micro-service topology map, one or more fault actions from a predefined fault action library; Determining a group of micro services with dependency relationships as a micro service group according to the micro service topological graph; And combining the fault actions corresponding to the micro service group to obtain the fault scene.
  4. 4. The method for fault handling of a micro service cluster according to claim 3, wherein determining a fault scenario according to the dependency relationship between the micro services further comprises: evaluating the fault scene to obtain an evaluation result; And screening the fault scene according to the evaluation result to determine a final fault scene.
  5. 5. The method for fault handling of a micro service cluster according to claim 1, wherein the determining a fault repair scheme corresponding to the fault scenario includes: Constructing a sand box environment, wherein the sand box environment is used for performing simulation test on the fault scene; and injecting the fault scene into the sandbox environment to perform the simulation test, and determining a fault repair scheme corresponding to the fault scene.
  6. 6. The method for fault handling of a micro service cluster according to claim 5, wherein the injecting the fault scenario into the sandbox environment for the simulation test, determining a fault repair scheme corresponding to the fault scenario, includes: After the fault scene is injected into the sandbox environment, collecting operation data of the sandbox environment; determining a fault reason according to the operation data; Determining candidate fault repair schemes according to fault reasons; executing the candidate fault repair scheme in a sandbox environment; and if the sandbox environment is not abnormal after the candidate fault repair scheme is executed, determining the candidate fault repair scheme as a final fault repair scheme.
  7. 7. The method for fault handling of a micro service cluster according to claim 5, wherein said constructing a sandbox environment comprises: constructing a sandbox environment according to a preset sandbox resource template, wherein the dependency relationship of the micro-services in the sandbox environment is consistent with the dependency relationship of the micro-services in the micro-service cluster; And copying the service flow of the micro service cluster and injecting the service flow into the sandbox environment so that the sandbox environment can be subjected to simulation test.
  8. 8. The fault handling method of a micro service cluster according to claim 3, wherein the fault scenario and the corresponding fault repair scheme are recorded in a preset fault handling library; the method further comprises the steps of: monitoring operation data of the micro service cluster; Performing fault prediction according to the micro-service topological graph and the operation data of the micro-service cluster to obtain a fault prediction result; Judging whether a fault scene matched with the fault prediction result exists in the fault processing library or not; If the fault processing library has a fault scene matched with the fault prediction result, directly executing a corresponding fault repairing scheme.
  9. 9. The method of fault handling of a micro service cluster according to claim 8, wherein the method further comprises: If the fault processing library does not have a fault scene matched with the fault prediction result, generating a fault scene aiming at the fault prediction result according to the fault prediction result; and injecting a fault scene aiming at the fault prediction result into the sandbox environment to perform the simulation test, so as to obtain a fault repair scheme aiming at the fault prediction result.
  10. 10. A failure handling apparatus for a micro service cluster, the apparatus comprising: the fault scene acquisition module is used for acquiring a fault scene, wherein the fault scene is obtained by combining a plurality of fault actions aiming at a plurality of micro services; The fault restoration scheme determining module is used for determining a fault restoration scheme corresponding to the fault scene, and the fault restoration scheme is obtained by performing simulation test on the fault scene in a sand box environment; the fault matching judging module is used for judging whether the fault is matched with the fault scene or not when the fault occurs in the micro service cluster; And the fault restoration scheme execution module is used for automatically executing a fault restoration scheme corresponding to the fault scene if the fault is matched with the fault scene.
  11. 11. An electronic device comprising a processor, a memory and a program or instruction stored on the memory and executable on the processor, which when executed by the processor, implements the steps of the fault handling method of a micro-service cluster according to claims 1-9.
  12. 12. A readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of the fault handling method of a micro service cluster according to claims 1-9.

Description

Fault processing method and device for micro service cluster, electronic equipment and medium Technical Field The present invention relates to the field of micro services, and in particular, to a fault handling method and apparatus for a micro service cluster, an electronic device, and a readable storage medium. Background Chaotic engineering is an engineering method for verifying the capacity of a system by actively injecting faults, and is characterized in that the bearing capacity of the system under non-ideal conditions is tested by simulating abnormal scenes in a real environment. Potential defects in the system are actively discovered by simulating faults in real environments such as server downtime, network delay and the like, so that problems are exposed and repaired before the faults actually occur, and the fault tolerance capability and usability of the system are improved. The chaotic engineering has irreplaceable important value for guaranteeing the stable operation of a complex system. In the related art, the workflow of the chaotic engineering is that a developer manually defines possible faults based on understanding of a system business architecture, and then manually establishes a corresponding emergency plan for each preset fault scene. When the fault injection experiment is executed, the system is responsible for triggering the preset fault, and then the subsequent fault influence observation, root cause positioning and recovery operation are carried out by manual or other operation and maintenance tools. The method relies heavily on manual experience, not only increases manual intervention cost, but also is long in manual operation time, and faults cannot be resolved in time, so that the reliability of the system is low. Disclosure of Invention In view of the foregoing, embodiments of the present invention are directed to a method, an apparatus, an electronic device, and a readable storage medium for fault handling of a micro service cluster, which overcome or at least partially solve the foregoing problems. In a first aspect, an embodiment of the present invention provides a method for fault handling of a micro service cluster, where the method includes: Obtaining a fault scene, wherein the fault scene is obtained by combining a plurality of fault actions aiming at a plurality of micro services; Determining a fault repairing scheme corresponding to the fault scene, wherein the fault repairing scheme is obtained by performing simulation test on the fault scene in a sand box environment; When a fault occurs in the micro service cluster, judging whether the fault is matched with the fault scene or not; And if the fault is matched with the fault scene, automatically executing a fault repair scheme corresponding to the fault scene. Optionally, the acquiring the fault scenario includes: Acquiring the dependency relationship among the micro services in the micro service cluster; And determining a fault scene according to the dependency relationship among the micro services. Optionally, the determining the fault scenario according to the dependency relationship between the micro services includes: Constructing a micro-service topological graph according to the dependency relationship among the micro-services; determining, for each micro-service in the micro-service topology map, one or more fault actions from a predefined fault action library; Determining a group of micro services with dependency relationships as a micro service group according to the micro service topological graph; And combining the fault actions corresponding to the micro service group to obtain the fault scene. Optionally, the determining the fault scenario according to the dependency relationship between the micro services further includes: evaluating the fault scene to obtain an evaluation result; And screening the fault scene according to the evaluation result to determine a final fault scene. Optionally, the determining the fault repair scheme corresponding to the fault scenario includes: Constructing a sand box environment, wherein the sand box environment is used for performing simulation test on the fault scene; and injecting the fault scene into the sandbox environment to perform the simulation test, and determining a fault repair scheme corresponding to the fault scene. Optionally, the injecting the fault scenario into the sandbox environment to perform the simulation test, and determining a fault repair scheme corresponding to the fault scenario includes: After the fault scene is injected into the sandbox environment, collecting operation data of the sandbox environment; determining a fault reason according to the operation data; Determining candidate fault repair schemes according to fault reasons; executing the candidate fault repair scheme in a sandbox environment; and if the sandbox environment is not abnormal after the candidate fault repair scheme is executed, determining the candidate fault repair scheme as