CN-121984836-A - Router system service abnormity self-healing method based on cloud AI
Abstract
The application belongs to the technical field of communication, and in particular relates to a cloud AI-based router system service abnormality self-healing method, which comprises the steps of arranging an abnormality detection module at a router, collecting system logs and state data in real time and generating a standardized abnormality report; the method comprises the steps of uploading a report to a cloud AI server through an encryption MQTT protocol, generating a self-healing strategy instruction packet by a cloud call rule matching engine, a machine learning classifier and a mixed analysis model formed by a reinforcement learning decision network, receiving and executing a strategy by a router end, completing operations such as service restarting, configuration rollback or hot patching Ding Jiazai and the like, verifying a self-healing effect, automatically activating an embedded loop self-healing subsystem when communication interruption exceeds a threshold value, and independently disposing abnormality based on a local strategy library. By adopting the technical scheme, the millisecond-level abnormal response and high-success-rate autonomous recovery can be realized, the network availability and service continuity are obviously improved, and the disaster recovery self-healing capability is still realized when the cloud is in disconnection.
Inventors
- LIU MINGBO
- ZHOU LONG
- CHEN BEI
Assignees
- 成都飞鱼星科技股份有限公司
Dates
- Publication Date
- 20260505
- Application Date
- 20260202
Claims (10)
- 1. A cloud AI-based router system service abnormality self-healing method is characterized by comprising the following steps: The method comprises the steps that 1, an abnormality detection module at a router end detects system service abnormality in real time and generates a standardized abnormality report; Step 2, an abnormality report module of the router reports an abnormality report to a cloud AI server through a secure transmission channel; step 3, after the cloud AI server receives the abnormal report, invoking an artificial intelligent analysis model to execute abnormal root cause diagnosis; Step 4, the cloud AI server generates a self-healing strategy instruction adapting to the running environment of the router end and sends the self-healing strategy instruction to the router end through a safety channel; Step 5, the self-healing implementation module of the router receives and executes the self-healing strategy instruction, completes the abnormal service repair operation, and feeds back the execution result to the cloud AI server; and 6, continuously checking the link connectivity between the communication state monitoring unit at the router end and the cloud AI server, automatically activating the loop self-healing subsystem when the communication abnormality is monitored, and independently completing the diagnosis, treatment and effect verification of the subsequent service abnormality based on the pre-synchronized local strategy library and the lightweight decision engine.
- 2. The method for self-healing service anomalies of a router system based on cloud AI as set forth in claim 1, wherein the anomaly detection module adopts an active scanning and passive monitoring cooperative mode, and comprises the following steps: the active scanning unit polls the system kernel log, the service process survival state, the hardware resource occupancy rate and the network session table item data according to a configurable period; The passive monitoring unit captures system kernel events, application error signals and network protocol abnormal messages in real time; And the dynamic threshold is updated regularly based on the statistical characteristics of historical operation data, and the sliding time window length is configured differently according to the service importance level.
- 3. The method for self-healing abnormal service of a router system based on cloud AI of claim 1, wherein the standardized exception report is designed by using a hierarchical field, and comprises the following steps: the abnormal type identification adopts a main category and subtype secondary coding system, wherein the main category covers service breakdown, resource overload, configuration conflict and protocol abnormality; The abnormal characteristic parameters comprise an abnormal occurrence time stamp, a quantized abnormal index value, an abnormal duration and an associated service identifier; and the equipment environment information comprises a router hardware model, a firmware version number, a current load level, a network topology connection number and a last configuration change record.
- 4. The cloud AI-based self-healing method of service anomalies in a router system as defined in claim 1, wherein the secure transmission channel is constructed by adopting an encrypted message queue telemetry transmission protocol, and the method comprises the following specific steps: The transmission protocol preferably selects an MQTT protocol, and when the MQTT connection is abnormal, the MQTT protocol is automatically switched to an HTTPS protocol to form redundant transmission; The method comprises the steps that an end-to-end encryption mechanism is adopted to protect an abnormal report, an encryption algorithm supports a national encryption algorithm or a mainstream symmetric encryption algorithm, a secret key is managed in a mode of combining factory preset and cloud dynamic negotiation of equipment, and a secret key updating period is dynamically adjusted based on a security level; The unsuccessfully reported exception report is stored in a non-volatile storage medium at the router end, and the breakpoint continuous transmission after the network recovery is supported.
- 5. The cloud AI-based self-healing method of the service abnormality of the router system as set forth in claim 1, wherein the artificial intelligence analysis model is a three-stage cascade hybrid model comprising, in order: The rule matching engine is internally provided with an exception handling rule base constructed based on expert experience, the rule adopts a condition and action mapping structure, the matching response delay is controlled within a preset threshold, and a standard self-healing strategy is directly output when the matching is successful; The machine learning classifier is generated based on the training of the labeling abnormal sample data set, is constructed by adopting an integrated learning framework, is input into a high-dimensional abnormal feature vector, is output into a root cause category and a confidence score, and triggers corresponding strategy generation when the confidence reaches a set threshold; And the reinforcement learning decision network models an abnormal handling scene by a Markov decision process, a state space comprises service health degree, resource constraint conditions and service priority weights, an action space defines a standardized self-healing operation set, the balance of self-healing success rate and service interruption cost is optimized by a reward function, and the model is deployed and operated after offline training convergence.
- 6. The method for self-healing abnormal service of a router system based on cloud AI of claim 1, wherein the self-healing policy instruction is a structured data message, and is encapsulated by adopting a unified data exchange format, and the method specifically comprises the following steps: The mandatory fields comprise a strategy global unique identifier, a target service accurate identifier, an operation instruction sequence, an execution timeout threshold value, a rollback plan index and a digital signature field; Optional fields, namely executing priority identification, environment adaptation parameters, resource reservation configuration and effect verification threshold; The instruction message supports a version compatibility mechanism and ensures compatible adaptation of routers with different firmware versions.
- 7. The method for self-healing abnormal router system services of the cloud AI of claim 1, wherein the self-healing implementation module starts a two-channel collaborative verification mechanism after executing a self-healing policy instruction: the first verification channel is used for detecting the survival state and the response capability of the target service process by periodically sending the heartbeat probe, and the continuous repeated detection result accords with the normal standard and is regarded as the first verification passing; The second verification channel is used for collecting key indexes of network service quality in real time, including packet loss rate, delay jitter, throughput and connection establishment success rate, wherein the index value is stably in a preset normal interval and is regarded as passing the second verification; And judging that the self-healing is successful only when the first verification and the second verification are both passed, otherwise triggering a secondary treatment process, wherein the secondary treatment process comprises strategy adjustment or alarm reporting.
- 8. The method for self-healing abnormal service of a router system based on cloud AI as set forth in claim 1, wherein the communication state monitoring unit is based on a multi-dimensional signal fusion determination mechanism, and specifically comprises: the monitoring dimension at least covers the loss number of continuous heartbeat packets, the failure frequency of domain name resolution and the network time synchronization offset; setting an independent judging threshold and a duration threshold for each monitoring dimension, wherein any dimension meets the threshold condition and continuously reaches the set duration, namely triggering communication abnormality judgment; and after the communication abnormality is judged, starting a link recovery retry mechanism, and formally activating a loop-back self-healing subsystem after the retry fails.
- 9. The cloud AI-based self-healing method of abnormal router system service of claim 1, wherein the loop-back self-healing subsystem is scheduled and managed by a finite state machine controller, and the method specifically comprises the following steps: the finite state machine comprises five running states of monitoring, diagnosis, decision making, execution and verification, state switching is realized through an event triggering mechanism, and switching logic is preset in the controller; the local policy library stores a high-success rate self-healing policy and related abnormal characteristic fingerprints verified by the cloud AI server, an incremental synchronization mechanism is adopted to update the high-success rate self-healing policy and the related abnormal characteristic fingerprints from the cloud AI server, and the synchronization process comprises data integrity verification; The local lightweight decision engine adopts a simplified version mixed analysis model architecture, a local strategy library is matched preferentially, a preset basic self-healing process is executed when the local lightweight decision engine is not matched, and all operation logs are stored in a local circulation log buffer area.
- 10. The method according to any one of claims 1 to 9, wherein the method realizes multi-vendor heterogeneous router access through an abstract device adaptation layer, and specifically comprises the following steps: The device adaptation layer adopts a modularized design, provides an API interface set with unified standard, and covers four functional domains of service control, configuration read-write, log extraction and performance monitoring; the interface supports two modes of synchronous call and asynchronous callback, and adapts to the differences of hardware drivers and firmware interfaces of different routers; The adaptation layer is internally provided with a compatibility adaptation module, supports hardware specifications and operating system versions of main flow router manufacturers, and can be compatible with new equipment models through expansion adaptation plug-ins.
Description
Router system service abnormity self-healing method based on cloud AI Technical Field The invention belongs to the technical field of communication, and particularly relates to a cloud AI-based router system service abnormality self-healing method. Background With the deep development of the universal interconnection age, the complexity of network infrastructure is increased, and especially in medium and large-scale network environments, the access terminals are various in variety, various in service types and dynamic in topological structure, and the functional completeness, cooperative response capability and operation stability of core network equipment such as routers are challenged unprecedented. Modern routers have evolved from single data forwarding devices to intelligent gateways integrating security, qoS, service orchestration, and other functions, where there is a high degree of coupling and interdependence between service modules. Although manufacturers strive to cover typical use scenarios through multiple rounds of simulation testing in the research and development stage, actual deployment environments are quite different, and it is difficult to exhaust all potential abnormal combinations, so that system-level faults may still be caused by problems such as service conflicts, configuration drift or resource competition in actual operation of equipment. The cloud AI-based router system service abnormity self-healing method focuses on constructing a closed-loop autonomous architecture of edge detection, cloud decision and local execution, and aims to break through the high dependence of the traditional operation and maintenance mode on manual intervention. In the prior art, once a router is abnormal in service, professional operation and maintenance personnel are involved to conduct log analysis, path tracking and strategy adjustment, the process is long in time consumption and slow in response, and has extremely high requirements on personnel experience, and even if part of equipment has a basic alarm function, problem reporting can be achieved only, so that autonomous diagnosis and repair capability is lacked. More critical, when an anomaly causes an interruption in the device's communication with the remote management platform, existing solutions tend to lose the capability of handling entirely, resulting in a prolonged interruption of the traffic. The defects of coarse abnormal perception granularity, static solidification of analysis logic, lack of intelligent adaptation of self-healing strategies and the like commonly exist in the prior art. On one hand, the local detection mechanism depends on a preset threshold value or a simple rule, and is difficult to identify cross-service, low-frequency and high-risk composite anomalies, and on the other hand, if an AI engine with continuous learning capability is not introduced into the cloud, the cloud cannot effectively infer and generate strategies for unknown fault modes. In addition, the lack of standardized exception description and strategy execution interfaces makes it difficult for different types of equipment to share a unified self-healing knowledge base, and severely restricts the operation and maintenance efficiency under large-scale deployment. Therefore, in the background of increasingly urgent network demands with high availability, there is a need for a router system service anomaly self-healing method that integrates edge real-time sensing, cloud intelligent decision-making and local reliable execution, so as to achieve millisecond response, high success rate recovery and autonomous disaster recovery capability in a communication interruption scene. Disclosure of Invention The invention aims to provide a cloud AI-based router system service abnormity self-healing method, which aims to solve the problems that in the prior art, when network equipment causes system-level faults due to service conflicts, configuration drifting or resource competition in a complex running environment, the response of relying on artificial operation and maintenance is slow, diagnosis is difficult, repair efficiency is low and the handling capacity is completely lost in a communication interruption scene. In order to solve the technical problems, the invention adopts the following technical scheme: a router system service abnormity self-healing method based on cloud AI comprises the following steps: The method comprises the steps that 1, an abnormality detection module at a router end detects system service abnormality in real time and generates a standardized abnormality report; Step 2, an abnormality report module of the router reports an abnormality report to a cloud AI server through a secure transmission channel; step 3, after the cloud AI server receives the abnormal report, invoking an artificial intelligent analysis model to execute abnormal root cause diagnosis; Step 4, the cloud AI server generates a self-healing strategy instruction adapting to