CN-122027450-A - Predictive recovery method and system for cloud native service chain

CN122027450ACN 122027450 ACN122027450 ACN 122027450ACN-122027450-A

Abstract

A predictive recovery method of a cloud primary service chain comprises the steps of receiving and analyzing service chain demand configuration submitted by a tenant, generating a corresponding service chain flow table rule, instantiating a high-availability unit for each service node in the service chain based on the service chain flow table rule, wherein the high-availability unit comprises a main container, a Sidecar container and a Mirror container, inputting operation time sequence indexes collected by the Sidecar container into a pre-trained hybrid prediction model to obtain an abnormal prediction result, wherein the hybrid prediction model is used for outputting the abnormal prediction result based on feature screening and time sequence mode analysis, isolating the main container if the abnormal prediction result indicates that the main container has abnormal risk, lifting the Mirror container to be a new main container, and updating the service chain flow table to switch service flow to the new main container. The method can realize high availability, fault prediction and noninductive switching of the service chain in the cloud primary environment.

Inventors

XIANG JUNCHENG
ZHANG YAN
PAN HENG
ZHENG ZIHAO

Assignees

中国科学院计算机网络信息中心

Dates

Publication Date: 20260512
Application Date: 20260203

Claims (10)

1. A method for predictively recovering a cloud native service chain, the method comprising: Receiving and analyzing service chain demand configuration submitted by a tenant, generating a corresponding service chain flow table rule, and instantiating a high-availability unit for each service node in a service chain in a Kubernetes environment based on the service chain flow table rule, wherein the high-availability unit comprises a main container, a Sidecar container and a Mirror container, the main container is used for executing business logic, the Sidecar container is used for collecting operation state indexes of the main container, and the Mirror container is used as a state maintaining container of the main container and is used for synchronously maintaining key state information related to service chain continuity in the operation process of the main container; collecting operation time sequence indexes of a main container through a Sidecar container; Inputting the operation time sequence index into a pre-trained mixed prediction model to obtain an abnormal prediction result, wherein the mixed prediction model is used for outputting the abnormal prediction result based on feature screening and time sequence mode analysis; And if the abnormality prediction result indicates that the main container has an abnormality risk, isolating the main container, lifting the Mirror container to a new main container, and updating a service chain flow table to switch the service flow to the new main container.
2. The method of claim 1, wherein the hybrid predictive model comprises a concatenated LightGBM model and a one-dimensional convolutional neural network model, and wherein inputting the run time index into the pre-trained hybrid predictive model comprises: performing feature importance screening on the input operation time sequence index by using the LightGBM model to obtain a key feature subset; Inputting the key feature subset into the one-dimensional convolutional neural network model, and performing time sequence convolutional operation to extract abnormal mode features; based on the abnormal mode characteristics, outputting the abnormal prediction result, wherein the abnormal prediction result comprises an abnormal probability, an abnormal type and a time window for predicting occurrence of an abnormality.
3. The method of claim 2, wherein the exception type comprises at least one of CPU overload, memory leakage, performance degradation, service blocking.
4. The method according to claim 1, wherein if the abnormality prediction result indicates that the main container is at risk of abnormality, specifically comprising: And presetting an abnormal probability threshold, and judging that an abnormal risk exists when the output abnormal probability exceeds the abnormal probability threshold, wherein the abnormal probability threshold is a configurable value in a range of 0.65 to 0.85.
5. The method of claim 1, wherein the updating the service chaining flow table to switch traffic to the new master container comprises: updating the flow table rules of the data plane in an atomic operation mode by calling a network control interface Kube-OVN; and a strategy of configuring a new path firstly and then switching traffic to take effect is adopted, so that no traffic loss or disorder is ensured in the process of updating the flow table.
6. The method according to claim 1, wherein the method further comprises: Placing the isolated main container into an isolation area, wherein the isolation area adopts a stack data structure to manage isolation nodes; Periodically polling the health status of the containers in the isolation zone; If the container is restored to health, it is re-associated to the service chain and re-deployed as a Mirror container or according to resource policies.
7. The method according to claim 1, wherein the method further comprises: when the change of the service chain demand configuration is detected, comparing the rule difference of the flow table derived from the new and old configurations; if the difference is local adjustment, incremental update is executed, and only the affected part of flow table rules and container configuration are modified; if the discrepancy involves topology reconfiguration, a full volume update is performed, replacing all flow table rules and readjusting the container deployment.
8. The method of claim 1, wherein the Sidecar containers collect the operational timing metrics of the master container in a non-intrusive manner to the business logic of the master container.
9. The method of claim 1, wherein the run-time metrics include CPU, memory, disk I/O, network latency, and service response time.
10. A predictive restoration system for a cloud native service chain, comprising: The service chain deployment module is used for receiving and analyzing service chain demand configuration submitted by a tenant, generating a corresponding service chain flow table rule, and instantiating a high-availability unit for each service node in a service chain in a Kubernetes environment based on the service chain flow table rule, wherein the high-availability unit comprises a main container, a Sidecar container and a Mirror container, the main container is used for executing service logic, the Sidecar container is used for collecting the main container running state index, and the Mirror container is used as a state holding container of the main container and is used for synchronously maintaining key state information related to service chain continuity in the main container running process; The service chain prediction module is used for inputting operation time sequence indexes of the main container collected through the Sidecar containers into a pre-trained mixed prediction model to obtain an abnormal prediction result, wherein the mixed prediction model is used for outputting the abnormal prediction result based on feature screening and time sequence mode analysis; And the change and recovery module is used for isolating the main container if the abnormality prediction result indicates that the main container has an abnormality risk, lifting the Mirror container to be a new main container, and updating a service chain flow table to switch the service flow to the new main container.

Description

Predictive recovery method and system for cloud native service chain Technical Field The invention relates to the technical field of service chains, in particular to a predictive recovery method and a predictive recovery system for a cloud native service chain. Background With the maturation of the cloud-native technology architecture and the development of network function virtualization (Network Function Virtualization, NFV), VNFs are increasingly being widely deployed on general-purpose computing platforms to replace the network service capabilities traditionally relying on dedicated hardware implementations. The service function chain (Service Function Chain, SFC) is used as an important component in the NFV, and is used for organizing a plurality of VNs with different processing capacities into a chain structure according to a preset sequence, so that the service flow can sequentially pass through each function node according to the service strategy, and thereby the diversified network functions such as firewall, security detection, load balancing, acceleration optimization and the like are realized. In the cloud-native environment, kubernetes has become a de facto container orchestration standard. To meet the deployment needs of large-scale network traffic on Kubernetes, industry has proposed a variety of Container Network Interface (CNI) based service chain solutions. Kube-OVN is used as a cloud native network system based on OVN/OVS, and by introducing capabilities of virtual private network (VPC), subnet management, access Control List (ACL), qoS management and the like, kubernetes can carry more complex multi-tenant network service architecture. Along with the enhancement of OVN data plane capability, the expansion of a control plane model and the perfection of programmable logic, the support of OVN on SFC is gradually perfected, and Kube-OVN has a preliminary technical foundation for bearing service chain business in a Kubernetes environment. However, in the prior art, problems of complex deployment, high requirement on a bottom network, large system resource consumption, difficulty in realizing large-scale observability and the like generally exist in a service chain implementation mode, and most of the prior art does not have fault prediction capability. Disclosure of Invention In order to solve the problems in the prior art, the embodiment of the application provides a method, a system, a computing device, a computer storage medium and a product containing a computer program for predictively recovering a cloud primary service chain, which can realize high availability, fault prediction and noninductive switching of the service chain in a cloud primary environment. In a first aspect, an embodiment of the present application provides a method for predictively recovering a cloud primary service chain, which includes receiving and analyzing service chain demand configuration submitted by a tenant, generating a corresponding service chain flow table rule, based on the service chain flow table rule, instantiating a high availability unit for each service node in the service chain in a Kubernetes environment, where the high availability unit includes a main container, a Sidecar container and a minor container, the main container is used for executing service logic, the Sidecar container is used for collecting an operation state index of the main container, the minor container is used as a state maintaining container of the main container, and is used for synchronously maintaining key state information related to continuity of the service chain in an operation process of the main container, collecting an operation time sequence index of the main container through the Sidecar container, inputting the operation time sequence index into a pre-trained hybrid prediction model to obtain an abnormal prediction result, wherein the hybrid prediction model is used for outputting the abnormal prediction result based on feature screening and time sequence pattern analysis, if the abnormal prediction result indicates that the main container has abnormal risk isolation, and switching the main container to a new service chain, and the minor container is used for updating the service chain. In some possible implementation manners, the hybrid prediction model comprises a LightGBM model and a one-dimensional convolutional neural network model which are cascaded, the step of inputting operation time sequence indexes into the pre-trained hybrid prediction model comprises the step of screening the feature importance of the input operation time sequence indexes by utilizing the LightGBM model to obtain a key feature subset, the step of inputting the key feature subset into the one-dimensional convolutional neural network model to perform time sequence convolution operation to extract abnormal mode features, and the step of outputting an abnormal prediction result based on the abnormal mode features, wherein the abnormal pr