CN-122019231-A - Fault self-checking recovery method and system based on cloud deployment system

CN122019231ACN 122019231 ACN122019231 ACN 122019231ACN-122019231-A

Abstract

The invention discloses a fault self-checking recovery method and a system based on a cloud deployment system, wherein the method constructs an omnibearing real-time monitoring system, and carrying out multidimensional depth detection on the operation states of the operating system layer, the database layer and the application layer of the cloud platform, and automatically triggering an adaptive recovery mechanism once abnormality is found, so as to ensure continuous and stable operation of the system. The supporting tool integrates the core modules of state acquisition, exception analysis, recovery execution, log recording and the like, cooperatively realizes real-time acquisition of operation data, accurate exception identification of an intelligent algorithm, quick matching and implementation of a recovery scheme, and logs in the whole process for problem investigation and system optimization. The tool sequentially checks the server node, the database state, the service accessibility and the webpage availability, and immediately starts emergency treatment when abnormality is found, so that the stability and the reliability of the cloud platform are obviously improved, and the manual intervention cost is reduced.

Inventors

HAN ZIBO
SUN XI
GAO XU
HAN XIAOGANG

Assignees

西安西热电站信息技术有限公司
西安热工研究院有限公司

Dates

Publication Date: 20260512
Application Date: 20260119

Claims (10)

1. The utility model provides a fault self-checking recovery system based on high in clouds deployment system which characterized in that includes: The state acquisition module is connected with the operating system layer, the database layer and the application layer and is used for acquiring the operation parameters of each layer in real time and transmitting the original operation parameter data to the abnormality analysis module, The abnormality analysis module is connected with the state acquisition module and the threshold configuration module simultaneously, acquires operation parameter data and preset various parameter thresholds, performs depth comparison analysis on the real-time data and the thresholds to obtain abnormality types and grades, synchronizes abnormality information and triggers subsequent recovery and processing flows; The recovery execution module is used for calling a recovery scheme corresponding to the abnormal type and grade information from the intelligent recovery strategy library maintained by the strategy updating module after receiving the abnormal type and grade information and executing the recovery scheme; The log recording module is used for recording the running state of the system, abnormal information and each key link and data in the recovery process as log information; the alarm module responds after receiving the abnormal information and sends alarm information to related operation and maintenance personnel; the threshold configuration module is used for flexibly and autonomously configuring various parameter normal thresholds according to actual service requirements; and the strategy updating module is used for adding a new recovery strategy and adjusting the execution priority or parameter setting of the existing strategy.
2. The cloud deployment system-based fault self-checking recovery system of claim 1, wherein the anomaly analysis module synchronizes anomaly information to the recovery execution module, the alarm module and the log recording module.
3. The fault self-checking recovery system based on the cloud deployment system according to claim 1, wherein the recovery execution module monitors the progress of the recovery operation in real time during the execution process, and feeds back the recovery result to the log recording module, so as to record and track the whole recovery process.
4. The fault self-checking recovery system based on the cloud deployment system according to claim 1, wherein the alarm module sends alarm information to related operation and maintenance personnel through a plurality of preset short messages, mails and instant messaging tools, and the alarm information comprises time, place, abnormal type and severity of fault occurrence.
5. The cloud deployment system-based fault self-checking recovery system according to claim 1, wherein the user adjusts the threshold range of each parameter of the operating system layer, the database layer and the application layer through the threshold configuration module according to the operation characteristics of the cloud platform system, the service load condition and the requirement of the user on the system stability.
6. A fault self-checking recovery method based on a cloud deployment system is characterized by comprising the following steps: Acquiring system running state parameters, database running state parameters and application running state parameters of a cloud platform deployment system in real time; Comparing the acquired parameters with corresponding preset normal threshold ranges respectively, and judging whether the system has abnormal running state or faults; determining the type and grade of the abnormality or fault when the running state abnormality or system fault is judged to exist, calling a corresponding recovery strategy from a preset recovery strategy library according to the type and grade of the abnormality or fault, and executing the recovery strategy to realize quick recovery of the system; Recording system running state parameters, abnormal or fault information and recovery processing process information; When the system is abnormal, alarm information is immediately sent to a system administrator, and the alarm information content comprises an abnormal type, occurrence time, nodes and abnormal items; And starting the system for 3 times continuously after recovery, if the system passes the verification, judging that the system is recovered to be normal, if the system fails the verification, automatically lifting the fault level, calling the corresponding strategy and triggering the high-level alarm, and if the strategy is not recovered after 3 times continuously upgrading, triggering a manual intervention mechanism, locking the current state of the system and waiting for operation of operation and maintenance personnel.
7. The fault self-checking recovery method based on cloud deployment system according to claim 6, wherein for Windows system, program is set as service automatic start or planning task is created by service management console, and the trigger is set as trigger when "system start" is selected to run in "system" identity, for Linux system, application instance is realized by SYSTEMCTL ENABLE service instruction or using crontab @ reboot command, for AIX system, rc.local script file is added in/etc/inittab file by init process, user application to be started in system guiding process is written in rc.local script file in detail; for a Linux system, CPU load and memory use details are obtained in real time through a proc file system, and for an AIX system, memory paging activity states are collected through nmon commands.
8. The fault self-checking recovery method based on cloud deployment system according to claim 6, wherein when detecting that the operating system is operating normally, but the user cannot access normal service, and the database state connection is abnormal, confirming the database service state, if the system does not automatically switch nodes in 3 periods, performing node switching, catalog mounting and service restarting to recover normal operation of the database service.
9. The cloud deployment system-based fault self-checking recovery method according to claim 8, wherein when the database connection number is detected to last for 10 minutes and exceed 80% of a preset threshold value, a database connection pool capacity expansion mechanism is automatically triggered, and a 50% connection number quota is temporarily increased.
10. The fault self-checking and recovering method based on the cloud deployment system according to claim 6, wherein the fault self-checking and recovering operation is realized by performing omnibearing and uninterrupted monitoring on a server operating system, an operating database and an application, and detailed fault information is sent after the system is successfully recovered.

Description

Fault self-checking recovery method and system based on cloud deployment system Technical Field The invention belongs to the technical field of computer system software, and particularly relates to a fault self-checking recovery method and system based on a cloud deployment system. Background With the advent of the digitization era, the cloud platform is widely applied to various industries, has become the first choice of many enterprises, is based on the advantages of high expandability and high safety of the cloud platform, high cost effectiveness and the like, brings convenience to enterprises and individuals, and promotes the digitization transformation and innovation of the enterprises. However, the cloud platform system has a complex structure, and relates to various aspects of hardware devices, software components, network architecture, mass data and the like, and various faults, such as server hardware faults, software program errors, network congestion or interruption, data damage or loss and the like, are easy to occur in the running process. Looking back at the prior art, when the conventional small computer server faces faults, the processing mode often excessively depends on manual operation. Typical scenarios such as downtime of the system, or failure of the operating system to connect remotely, only depend on personnel in the machine room to go to the site and manually execute the restarting operation, which is inefficient and may cause long-time interruption of the service in case of emergency, resulting in huge loss to the enterprise. Even in the field of cloud platform systems, current detection and recovery means still rely on manual inspection and manual operation to a great extent. The manual inspection has a plurality of unavoidable defects, such as untimely detection, difficulty in capturing the instantaneous abnormality of the system in real time, high omission factor, easiness in neglecting some faults with strong concealment, slow recovery speed, long time from fault discovery to recovery operation implementation, serious influence on the normal operation of the cloud platform, and incapability of meeting the strict requirements of enterprises on service continuity and stability. Part of the existing automatic detection tools improve detection efficiency to a certain extent, but are single in function, and can only detect a specific aspect of a cloud platform generally, so that comprehensive and comprehensive monitoring of a system cannot be realized. Moreover, the recovery mechanism lacks sufficient flexibility and intelligence, and is difficult to quickly and accurately formulate and implement an effective recovery strategy in the face of complex and varied system faults, so that the system faults cannot be timely and properly solved. Therefore, a method and a corresponding tool capable of performing comprehensive self-detection on a cloud platform deployment system and realizing rapid recovery when an abnormality or a fault occurs are highly needed. Disclosure of Invention In order to solve the problems in the prior art, the invention provides a cloud deployment operating system fault self-checking and recovering method and tool, which fundamentally solve the problems of incomplete detection, low recovering efficiency, excessive dependence on manual intervention and the like in the prior art, and aim to provide an innovative cloud platform deployment system-based self-checking and fault recovering method and tool. By the method and the tool, the operation states of the operating system layer, the database layer and the application layer of the cloud platform system are monitored in an omnibearing and real-time manner, and when the system is abnormal, a recovery mechanism can be automatically, quickly and accurately triggered, so that the stability and the reliability of the cloud platform system are obviously improved, the operation and maintenance cost is greatly reduced, and a solid guarantee is provided for efficient and stable operation of the cloud platform. In order to achieve the above object, the present invention provides a fault self-checking recovery system based on a cloud deployment system, comprising: The state acquisition module is connected with the operating system layer, the database layer and the application layer and is used for acquiring the operation parameters of each layer in real time and transmitting the original operation parameter data to the abnormality analysis module, The abnormality analysis module is connected with the state acquisition module and the threshold configuration module simultaneously, acquires operation parameter data and preset various parameter thresholds, performs depth comparison analysis on the real-time data and the thresholds to obtain abnormality types and grades, synchronizes abnormality information and triggers subsequent recovery and processing flows; The recovery execution module is used for calling a recovery scheme corr