CN-121979644-A - Real-time abnormal monitoring local automatic recovery method and device

CN121979644ACN 121979644 ACN121979644 ACN 121979644ACN-121979644-A

Abstract

The invention discloses a method and a device for automatically recovering real-time abnormality monitoring local, which are characterized in that behavior data of an interrupt service routine can be collected in each interrupt execution period through injection monitoring logic and is compared with a baseline model in real time, sub-health states of the interrupt service routine are identified before abnormal triggering of hardware of a processor and even before obvious deterioration of system scheduling, recovery actions with different granularities can be adopted according to severity and influence range of the abnormality through level evaluation on the abnormality, abnormal trends can be identified in a normal operation stage of a system through continuous monitoring and analysis on the operation behavior of the common interrupt service routine, and grading and local automatic recovery can be implemented according to the abnormal level, so that the overall restarting probability of the system is reduced, and the reliability and instantaneity of the system are improved.

Inventors

WANG JIAN
RONG MINGKANG
LIANG HONGPEI
ZHOU XIULONG
LI HUI
HAO ZHICHAO
GUO XIAO

Assignees

广州市金其利信息科技有限公司

Dates

Publication Date: 20260505
Application Date: 20260330

Claims (10)

1. The real-time abnormality monitoring local automatic recovery method is characterized by comprising the following steps of: When the system is initialized or stably operated, an operation behavior baseline model of an interrupt service routine is constructed, wherein the operation behavior baseline model at least comprises at least one standard statistical feature for describing the interrupt service routine in a normal working state; Injecting lightweight monitoring logic into an inlet position and an outlet position of an interrupt service routine to monitor the execution process of the interrupt service routine in real time, and collecting at least one real-time statistical feature of the interrupt service routine in the execution process in real time; Comparing the real-time statistical features acquired in real time with standard statistical features to obtain a comparison result, and judging whether the current interrupt service routine is in an abnormal state or not based on the comparison result and a preset judgment rule, wherein the abnormal state comprises interrupt, abnormal nesting and interrupt task preemption; And when the current interrupt service routine is in an abnormal state, evaluating the abnormal grade, and triggering an automatic recovery action matched with the abnormal grade according to the abnormal grade, wherein the automatic recovery action comprises at least one local recovery operation without interrupting the system operation.
2. The method of claim 1, wherein injecting lightweight monitoring logic at the ingress and egress locations of the interrupt service routine is based on the premise of not modifying the interrupt service routine.
3. The method of claim 1, wherein the lightweight monitoring logic includes at least a first time stamp or cycle count value when an interrupt is entered, a second time stamp or cycle count value when an interrupt is exited to calculate interrupt execution duration, statistics of interrupt trigger frequency per unit time, detection of whether stack usage reaches a threshold, and detection of abnormal nesting depth during execution of an interrupt service routine.
4. The method of claim 1, wherein the standard statistical features include average execution time, peak execution time, interrupt trigger frequency interval, maximum interrupt nesting depth, execution time jitter range, and stack usage.
5. The method of claim 3, wherein determining whether the current interrupt service routine is in an abnormal state based on the comparison result and a preset determination rule comprises: Defining to be in an abnormal state when any one or more of the interrupt execution duration exceeds a first threshold, the interrupt trigger frequency is greater than a second threshold, and the abnormal nesting depth is greater than a third threshold; an interrupt service routine is defined as an abnormal state when any one or more of interrupt execution duration, interrupt trigger frequency, abnormal nesting depth, and the like in a continuous plurality of sampling periods exhibit a continuously increasing or monotonically increasing trend.
6. The method for monitoring and locally and automatically recovering from real-time anomalies according to claim 1, wherein the anomaly level comprises a local instantaneous anomaly, a processing thread anomaly and a global stable anomaly, the anomaly level evaluation comprises judging whether the anomaly level is a local instantaneous anomaly or a processing thread anomaly or a global stable anomaly according to a first dimension, a second dimension and a third dimension, wherein the first dimension comprises judging whether interrupt execution duration is greater than a first threshold, interrupt trigger frequency is greater than a second threshold, abnormal nesting depth is greater than a third threshold, and any one of interrupt execution duration, interrupt trigger frequency and abnormal nesting depth is in a continuous increasing trend, the second dimension comprises whether interrupt task preemption delay is greater than a set threshold, and the third dimension comprises whether duration of an abnormal state in a continuous sampling period exceeds a preset period threshold.
7. The method of claim 6, wherein triggering an automatic recovery action matching an anomaly level based on the anomaly level comprises: When the abnormal level is instantaneous abnormal, executing an interrupt level self-healing process, wherein the interrupt level self-healing process comprises resetting related peripherals, emptying a hardware buffer zone or reinitializing an interrupt controller, and the task scheduling is kept to normally run when the interrupt level self-healing process is executed; When the exception level is the exception of the processing thread, terminating and reconstructing the associated processing thread, recovering the processing thread context and the message queue, wherein other processing threads which are not related to the interrupt service routine keep running; And when the abnormality level is global stable abnormality, executing preset system-level soft restart, and reserving preset key application data and diagnostic logs in the system-level soft restart process.
8. The utility model provides a real-time abnormal monitoring local automatic recovery device which characterized in that includes: The system comprises a baseline model construction module, a processing module and a processing module, wherein the baseline model construction module is used for constructing an operation behavior baseline model of an interrupt service routine when the system is initialized or stably operated, and the operation behavior baseline model at least comprises at least one standard statistical characteristic for describing the interrupt service routine in a normal working state; The real-time feature acquisition module is used for injecting lightweight monitoring logic into the inlet position and the outlet position of the interrupt service routine to monitor the execution process of the interrupt service routine in real time and acquiring at least one real-time statistical feature of the interrupt service routine in the execution process in real time; The execution characteristic comparison module is used for comparing the real-time statistical characteristics acquired in real time with the standard statistical characteristics to obtain comparison results, and judging whether the current interrupt service routine is in an abnormal state or not based on the comparison results and a preset judgment rule; And the local interrupt recovery module is used for evaluating the abnormal grade when the current interrupt service routine is in an abnormal state, triggering automatic recovery actions matched with the abnormal grade according to the abnormal grade, wherein the automatic recovery actions comprise at least one local recovery operation without interrupting the system operation.
9. An electronic device comprising a memory storing executable program code, a processor coupled to the memory, the processor invoking the executable program code stored in the memory for performing the real-time anomaly monitoring local automatic recovery method of any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program causes a computer to execute the real-time abnormality monitoring local automatic restoration method according to any one of claims 1 to 7.

Description

Real-time abnormal monitoring local automatic recovery method and device Technical Field The invention relates to the technical field of communication exception handling, in particular to a real-time exception monitoring local automatic recovery method and device. Background In real-time operating systems (RTOS), interrupt and exception management typically includes functions such as registration and offloading of interrupt service routines, enabling and masking of interrupts, installation of exception handlers, etc., to ensure that the system can enter a predefined process flow in the event of an exception. However, the prior art still has the following drawbacks in terms of interruption and anomaly management: 1. There is a lack of continuous monitoring capability for normal interrupt operation behavior. In existing RTOS, interrupt Service Routines (ISRs) are generally regarded as the normal running path of the system, whose execution behavior defaults to "correct and controllable". The system generally does not continuously monitor and statistically analyze the performance characteristics of the ISR, such as execution time, trigger frequency, nesting depth, and execution jitter during operation. Therefore, when the problems of abnormal extension of the execution time, abnormal rise of the trigger frequency, abnormal nesting and the like of the ISR occur, the system is difficult to sense in time, and the system is often required to wait until the task scheduling is seriously destroyed or even the system crashes. 2. The exception detection mechanism takes processor exception interrupt as a trigger condition, and belongs to post-processing. The exception handling mechanism in the prior art mainly relies on an exception interrupt (such as an illegal instruction, illegal memory access or bus error) provided by a processor. Such exceptions are triggered only after the system has entered a critical error state, which is typical of post-processing. For normal interrupt exception behavior which does not trigger processor exception but has an influence on the real-time performance and stability of the system, the prior art is difficult to identify and process in time. 3. A unified monitoring and analysis framework covering common interrupts and anomalies is lacking. The conventional RTOS generally manages common interrupts and exception handling as two independent mechanisms, wherein the common interrupts are only responsible for event response, the exception interrupts are only responsible for error handling, and a unified monitoring and analyzing framework is lacked between the common interrupts and the exception handling. The design mode of the fracture causes that the system cannot perform unified modeling, classification and positioning on the interrupt operation behavior, and the influence of interrupt abnormality on task scheduling and system state cannot be analyzed from the whole system. 4. The exception handling mode is coarse-grained and lacks local automatic recovery capability. In the related art, when a system detects an abnormality, it is common to adopt a data loss of printing abnormality information and performing a system reset or shutdown. For a multitasking real-time system, the prior art has difficulty in locally repairing an abort or related thread without interrupting the overall operation of the system. 5. Lack of analysis and prediction capabilities based on ISR behavioral characteristics. The conventional RTOS generally does not establish a historical model for the running behavior of the interrupt service routine, can not analyze and predict the abnormal trend based on the statistical characteristics of the interrupt execution behavior, and is difficult to take targeted intervention measures before the occurrence of the abnormality or in the early stage of the abnormality. In a real-time operating system (RTOS), an interrupt mechanism is the basis for implementing real-time response and peripheral event processing of the system, and its running process extends through the whole life cycle of the system. The RTOS responds quickly to external events through the interrupt service routine and continues to perform task scheduling after the interrupt returns. Normally, an interrupt belongs to the normal running behavior of the system and is not equivalent to an exception. However, in the actual running environment, the common interrupt may also have problems such as abnormal execution time, abnormal triggering frequency or abnormal nesting depth due to program defects, peripheral abnormality or system load change during the execution. The problems usually do not trigger abnormal interruption of the processor at the initial stage and immediately cause system breakdown, but the real-time performance of task scheduling and the overall stability of the system are gradually affected, and the problems belong to a potential operation risk. The management of common interrupts by existing RTOSs is typi