CN-122027449-A - Method for automatically switching main and standby management software
Abstract
The application discloses a method for automatically switching main and standby management software, which comprises the following steps of 1) automatically detecting the running state of a main node by a basic judging layer through a kernel mode extremely simplified heartbeat mechanism and a multi-dimensional fault fusion judging mechanism, executing layered triggering logic according to a detection result, 2) automatically adjusting a judging threshold value of the basic judging layer by an intelligent optimizing layer based on historical abnormal data of the basic judging layer by adopting a sliding window weighted average algorithm, automatically transmitting the adjusted threshold value to the basic judging layer, 3) automatically and sequentially executing local state snapshot check and cross-node service abstract comparison by a consistency pocket bottom layer when the basic judging layer triggers medium and high risk abnormality, and confirming whether the main node has real service fault or not, and 4) if the consistency pocket bottom layer confirms that the main node has real service fault, automatically modifying main and standby role identification by the standby node through kernel mode atomic operation, and automatically executing a switching script.
Inventors
- CUI JUNFENG
- LI SHILONG
- CHEN JUN
- CUI YONGQIANG
- WEI YONG
Assignees
- 中国人民解放军63729部队
Dates
- Publication Date
- 20260512
- Application Date
- 20260128
Claims (9)
- 1. A method for automatically switching main and standby management software adopts a three-layer architecture of a basic judging layer, an intelligent optimizing layer and a consistency pocket bottom layer, and comprises the following steps: Step 1), a basic judging layer automatically detects the running state of a main node through a kernel mode extremely simple heartbeat mechanism and a multidimensional fault fusion judging mechanism, and executes layered triggering logic according to a detection result; step 2), the intelligent optimization layer automatically adjusts the judgment threshold value of the basic judgment layer by adopting a sliding window weighted average algorithm based on the historical abnormal data of the basic judgment layer, and automatically transmits the adjusted threshold value to the basic judgment layer; step 3) when the basic judging layer triggers the medium-high risk abnormality, the consistency pocket bottom layer automatically and sequentially executes local state snapshot verification and cross-node business abstract comparison to confirm whether the main node has real business faults or not; and step 4) if the consistency pocket bottom layer confirms that the main node has real service faults, the standby node automatically modifies the main and standby role identifications through kernel-mode atomic operation, automatically executes the switching script and completes automatic switching of the main and standby management software.
- 2. The method for automatically switching active/standby management software according to claim 1, wherein the kernel-mode extremely simple heartbeat mechanism in step 1 is: a heartbeat transmission channel is constructed based on a kernel timer of the embedded system kernel and an existing communication link; The master node kernel mode timer automatically generates a heartbeat frame according to a fixed period, wherein the heartbeat frame is of a fixed structure and comprises a 1-byte node ID, a 1-byte core process survival mark, a 4-byte time stamp and a 2-byte fault feature mark bit, and the frame length is fixed to 8 bytes; The standby node kernel mode automatically monitors and analyzes the heartbeat frame without passing through the user mode protocol stack.
- 3. The method for automatically switching the active and standby management software according to claim 2, wherein in a kernel mode extremely simple heartbeat mechanism, a heartbeat timeout threshold is automatically linked with a system load, the threshold is 3 heartbeat cycles when the CPU occupation of the system is less than or equal to 80%, and the threshold is automatically adjusted to 5 heartbeat cycles when the CPU occupation is more than 80%; the fault characteristic marking bit quantifies the CPU occupation trend of the main node and the survival stability of the core process, the quantification level is 0-3 level, and data support is provided for subsequent automatic judgment.
- 4. The method for automatically switching active and standby management software according to claim 1, wherein the determination dimensions of the multidimensional fault fusion determination mechanism in step 1 include a heartbeat timeout state, a core process survival state, a resource occupation state, and a service link reachability state; if any combination of the heartbeat timeout, the core process survival abnormality, the heartbeat timeout, the service link accessibility abnormality, the core process survival abnormality and the resource occupation abnormality is met, the medium-high risk abnormality is automatically determined; and when the single dimension is abnormal and the fusion judgment condition is not met, automatically judging that the risk is low.
- 5. The method for automatically switching active/standby management software according to claim 4, wherein the automatic decision logic of the resource occupancy state is: The kernel mode of the master node automatically calculates the memory increment and the real-time CPU occupancy rate within 10 seconds, and the memory increment is more than 100KB or the CPU occupancy rate is more than 95 percent and is judged to be abnormal; the automatic decision logic of the service link reachability status is: The standby node automatically sends a 1-2 byte probe instruction through the existing service port, and automatically counts the response timeout times or response error times of the probe.
- 6. The method for automatically switching active/standby management software according to claim 1, wherein the intelligent optimization layer in step 2 is: Dividing a fixed storage area which is less than or equal to 1KB in the embedded nonvolatile storage, and automatically storing an abnormality time stamp, an abnormality type and an abnormality judgment result of the last 100 times of abnormal events; The size of the sliding window is fixed to 20 times of abnormal events, and the duty ratio of real faults and misjudgment in the window is automatically calculated; When the misjudgment ratio is more than 30%, the heartbeat timeout threshold is automatically increased by 2 periods on the current basis; When the real fault ratio is more than 80%, the heartbeat timeout threshold is automatically reduced by 1 period on the current basis; and when the duty ratio is in the interval of 30% -80%, the current threshold value is kept unchanged.
- 7. The method for automatically switching active/standby management software according to claim 1, wherein the local state snapshot check in step 3 is: Automatically generating a core service state snapshot with the byte number less than or equal to 50 by the kernel mode of the master node according to a fixed period, wherein the snapshot comprises the connection number of the managed equipment, a task execution progress mark and a system error code; the master node automatically calculates CRC16 hash abstract of each generated snapshot and stores the CRC16 hash abstract in a kernel state buffer; when the basic judging layer detects the abnormality of medium and high risk, the main node automatically compares the hash digests of the current snapshot and the historical normal snapshot, if the comparison is consistent, the main node automatically judges that the system shakes, the switching flow is terminated, and if the comparison is inconsistent, the cross-node service digest comparison is automatically started.
- 8. The method for automatically switching active-standby management software according to claim 7, wherein the cross-node service summary comparison is: the standby node automatically sends a 1-byte service probe instruction to the main node through the existing communication link, wherein the instruction is used for reading a preset key state of the managed equipment; The master node automatically returns a 4-byte CRC16 hash abstract of the key state; the standby node automatically compares the abstract with a locally pre-stored normal service abstract template, if the comparison is not matched, the service fault of the main node is automatically confirmed, the switching process is triggered, and if the comparison is matched, the automatic judgment is a false report, and the switching process is terminated.
- 9. The method for automatically switching active/standby management software according to claim 1, wherein in step 4, the kernel-state atom is operated as: The backup node automatically modifies the main backup role identifier stored in the embedded nonvolatile storage through kernel state atomic operation, wherein the main backup role identifier is 1 byte, 0 represents the backup node, and 1 represents the main node; The automatic execution step of the switching script is that the standby node self heartbeat monitoring thread is automatically closed, the main node core service process is automatically started, and a 1 byte role switching notification is automatically sent to the managed equipment.
Description
Method for automatically switching main and standby management software Technical Field The application relates to the technical field of fault tolerance and system redundancy processing, in particular to a method for automatically switching main and standby management software. Background In key fields such as industrial control, intelligent transportation, and terminals of the internet of things, the high availability of the embedded system is a core requirement for guaranteeing continuous operation of services, the automatic switching technology of the active and standby management software is used as a core means for improving the fault tolerance of the system, and seamless taking over of the fault of the active node is realized through redundancy design, so that the embedded system becomes one of standard technology of the embedded high-availability system. The main and standby management software automatic switching scheme of the current main stream is mainly divided into two major types of hardware redundancy driving type and software logic judging type, and the two types of schemes have technical short boards adapting to lightweight embedded scenes in practical application. The hardware redundancy driving scheme is a main stream implementation mode of early main-standby switching, and the core idea is to implement main node fault detection and switching triggering by deploying special hardware modules (such as a double MCU lock step core, an independent arbitration chip and a special heartbeat line). For example, part of the high-reliability embedded system adopts a Triple Modular Redundancy (TMR) architecture, and the three independent processors are used for parallel operation and comparison to judge the fault, or an external monitoring circuit is used for monitoring the power supply and the running state of a main node in real time to trigger a hardware-level switching signal. The scheme has the advantages of high response speed, but has the obvious defects that on one hand, the cost, the volume and the power consumption of the system are greatly increased by the additional hardware module, which is contrary to the design of low resources and low cost of the lightweight embedded equipment, and on the other hand, the compatibility of the hardware redundancy module is poor, the hardware redundancy module is required to be customized and developed for different chip architectures, the cross-platform multiplexing is difficult to realize, and the development and maintenance cost is increased. The software logic judgment type scheme is gradually raised for adapting to an embedded scene, and the core of the software logic judgment type scheme is to realize fault judgment through heartbeat detection and state monitoring of a software layer, so that dependence on additional hardware is reduced. The existing software scheme mainly comprises two types, namely an A/B partition-based switching scheme, a main-standby application partition is divided in a storage medium, when the main partition fails, the main partition is switched to the standby partition for starting, the scheme is mainly applicable to the fault rollback of a firmware upgrading scene without additional hardware, real-time dynamic switching of main-standby management software is difficult to realize in the running process, and the other type is a heartbeat mechanism-based software judging scheme, the survival state of the main node is detected through periodic heartbeat interaction between the main node and the standby node, and the switching is triggered when the heartbeat is overtime. However, the existing software judgment scheme still has a plurality of technical bottlenecks, namely, the fault judgment dimension is single, a plurality of single heartbeat signals are relied on, error switching is easily caused by non-fault factors such as network instantaneous break, system load fluctuation and the like, especially in an embedded scene with complex electromagnetic environment, the error switching rate is high, secondly, the judgment threshold is fixed, the operation working conditions of different embedded equipment cannot be self-adapted (such as the load fluctuation characteristic difference between industrial control equipment and a consumer-level internet of things terminal is obvious), the fault is missed due to the fact that the threshold is too loose, the error judgment is frequently performed due to the fact that the threshold is too strict, thirdly, a service consistency check mechanism is lacked, the fault is judged only through a survival state, the situation that the heartbeat of a main node is abnormal but the service can still run normally is triggered, the service continuity is affected, and fourthly, part of the scheme adopts a user state to realize fault detection and switching logic, when the main node core process is suspended, the user state detection logic is invalid, switching triggering failure is caused, and the reli