CN-122001752-A - Fault checking method and device, electronic equipment and storage medium

CN122001752ACN 122001752 ACN122001752 ACN 122001752ACN-122001752-A

Abstract

The application discloses a fault checking method, a device, an electronic device and a storage medium, which relate to the technical field of computers, and the application realizes the periodical automatic detection of the access connectivity of cluster service, can trigger the targeted fault checking and classification diagnosis when the service is inaccessible, finally automatically matches and executes a repair strategy according to the diagnosis result, therefore, the technical problems that the existing K8S operation and maintenance scheme is low in fault positioning efficiency due to the fact that manual investigation is relied on, recovery time is easy to delay due to misjudgment, and further service continuity is affected are solved, and the technical effects of improving the fault positioning efficiency, reducing recovery delay due to misjudgment and guaranteeing service continuity are achieved.

Inventors

DING KAI
MA BAO

Assignees

济南浪潮数据技术有限公司

Dates

Publication Date: 20260508
Application Date: 20260210

Claims (10)

1. A fault detection method, comprising: configuring a monitoring strategy and periodically detecting the access connectivity of each service in a target cluster; in response to the service being inaccessible, triggering a fault checking procedure; judging whether the service is an internal access address or an external access address according to the type of the service, and respectively executing corresponding fault diagnosis steps; and matching a preset fault type code according to the diagnosis result, and calling a corresponding automatic repair strategy for processing.
2. The method of claim 1, wherein after matching a predetermined fault type code according to the diagnosis result, invoking a corresponding automatic repair strategy for processing, the method further comprises: and responding to failure type matching failure, generating alarm information and recording the alarm information to the message middleware, and not executing repair operation.
3. The method of claim 1, wherein configuring the listening policy and periodically detecting access connectivity of each service in the target cluster comprises: Setting a maximum timeout time, a maximum number of failed attempts and a maximum inspection period when configuring a monitoring strategy so as to control the detection frequency and sensitivity; binding the monitoring strategy with the target cluster information, the service access address and the service type, and realizing the differentiated monitoring of different service scenes.
4. The method of claim 1, wherein the determining whether the service is an internal access address or an external access address according to the type of the service, and performing the corresponding fault diagnosis step respectively, comprises: when the service is an internal access address, acquiring port parameters defined in the container mirror image, and comparing the port parameters with ports defined by the service to judge whether port configuration errors exist or not; and when the service is an external access address, detecting whether the terminal is configured with correct domain name resolution so as to judge whether the fault type of the inlet controller error exists.
5. The method of claim 1, wherein the matching the preset fault type code according to the diagnosis result, and invoking the corresponding automatic repair strategy for processing comprises: Responding to the fault type of the port error, calling the port configuration in the updated service resource of the target cluster, and verifying whether the updated port can be normally accessed; And in response to the fault type of the POD network error, creating a test POD, executing packet grabbing detection, identifying a routing strategy deletion or iptables rule abnormality, and repairing.
6. The method according to claim 1, wherein the method further comprises: The fault detection and recovery device is deployed in the target cluster in a container mirror image mode, and unified monitoring and management of a plurality of clusters are realized through the authentication configuration file.
7. The method of claim 1, wherein the diagnosing of the internal access address includes checking for container mirror port to service port consistency, inter-Pod communication connectivity, and network policy restrictions of the namespace, the diagnosing of the external access address includes checking for at least one of outside-cluster connectivity to node IPs, and domain name resolution validity, and the fault diagnosing step includes updating service port configuration, creating test pods for packet-grabbing detection, adjusting NetworkPolicy policies, excluding firewall restrictions, and configuring domain name mappings.
8. A fault checking device, comprising: The configuration unit is used for configuring a monitoring strategy and periodically detecting the access connectivity of each service in the target cluster; A triggering unit for triggering a fault checking flow in response to the service being inaccessible; The judging unit is used for judging whether the service is an internal access address or an external access address according to the type of the service and respectively executing corresponding fault diagnosis steps; And the processing unit is used for matching a preset fault type code according to the diagnosis result and calling a corresponding automatic repair strategy for processing.
9. An electronic device, comprising: A memory for storing a computer program; processor for implementing the steps of the fault checking method according to any one of claims 1 to 7 when executing the computer program.
10. A computer readable storage medium, characterized in that a computer program is stored in the computer readable storage medium, wherein the computer program, when being executed by a processor, implements the steps of the fault checking method according to any one of claims 1 to 7.

Description

Fault checking method and device, electronic equipment and storage medium Technical Field The present application relates to the field of computer technologies, and in particular, to a fault detection method, a fault detection device, an electronic device, and a storage medium. Background K8S is used as a core container arrangement system in a cloud native architecture and is widely applied to deployment and management of a large-scale distributed system. The network access architecture of the containerized application covers the whole process from service discovery to network communication, and the stability of the network access architecture directly influences the service connectivity inside and outside the cluster. In the existing K8S operation and maintenance scheme, a manual investigation mode is adopted, so that the fault positioning efficiency is low, or the recovery time is delayed due to misjudgment, and the service continuity is affected. Disclosure of Invention The application provides a fault checking method, a fault checking device, electronic equipment and a storage medium, which at least solve the problems that the fault positioning efficiency is low or the recovery time is delayed due to misjudgment in the related technology, so that the service continuity is affected. The application provides a fault checking method, which comprises the following steps: configuring a monitoring strategy and periodically detecting the access connectivity of each service in a target cluster; in response to the service being inaccessible, triggering a fault checking procedure; judging whether the service is an internal access address or an external access address according to the type of the service, and respectively executing corresponding fault diagnosis steps; and matching a preset fault type code according to the diagnosis result, and calling a corresponding automatic repair strategy for processing. Optionally, after matching a preset fault type code according to the diagnosis result and invoking a corresponding automatic repair strategy for processing, the method further includes: and responding to failure type matching failure, generating alarm information and recording the alarm information to the message middleware, and not executing repair operation. Optionally, the configuring a monitoring policy and periodically detecting access connectivity of each service in the target cluster includes: Setting a maximum timeout time, a maximum number of failed attempts and a maximum inspection period when configuring a monitoring strategy so as to control the detection frequency and sensitivity; binding the monitoring strategy with the target cluster information, the service access address and the service type, and realizing the differentiated monitoring of different service scenes. Optionally, the determining that the service is an internal access address or an external access address according to the type of the service, and executing the corresponding fault diagnosis step respectively includes: when the service is an internal access address, acquiring port parameters defined in the container mirror image, and comparing the port parameters with ports defined by the service to judge whether port configuration errors exist or not; and when the service is an external access address, detecting whether the terminal is configured with correct domain name resolution so as to judge whether the fault type of the inlet controller error exists. Optionally, the matching the preset fault type code according to the diagnosis result, and calling the corresponding automatic repair strategy to process includes: Responding to the fault type of the port error, calling the port configuration in the updated service resource of the target cluster, and verifying whether the updated port can be normally accessed; And in response to the fault type of the POD network error, creating a test POD, executing packet grabbing detection, identifying a routing strategy deletion or iptables rule abnormality, and repairing. Optionally, the method further comprises: The fault detection and recovery device is deployed in the target cluster in a container mirror image mode, and unified monitoring and management of a plurality of clusters are realized through the authentication configuration file. Optionally, the diagnosing of the internal access address includes checking consistency of the mirror port and the service port of the container, communication connectivity between Pod and network policy limitation of the name space, the diagnosing of the external access address includes checking at least one of connectivity and domain name resolution validity of the node IP outside the cluster, and the diagnosing of the fault includes updating service port configuration, creating a test Pod for packet grabbing detection, adjusting NetworkPolicy policy, excluding firewall limitation and configuring domain name mapping. The application also provides a fault