CN-122027452-A - Vendor fault automatic switching and activity detection method and system for distributed API gateway

CN122027452ACN 122027452 ACN122027452 ACN 122027452ACN-122027452-A

Abstract

The invention relates to the technical field of Internet, in particular to a provider failure automatic switching and activity detection method and system of a distributed API gateway. The system comprises a distributed gateway cluster, a distributed coordination center and a distributed network system, wherein each node of the distributed gateway cluster is provided with a local route cache module for storing a provider state identifier, and the distributed coordination center is in communication connection with each gateway node and is used for storing a global state identifier and providing atomization operation, distributed locks and message broadcasting. The method comprises the steps that when service requests are processed, nodes inquire local caches, if a main provider is normal, the requests are forwarded, when abnormal, atomic scripts of a coordination center are called to accumulate error counts in a sliding window, when the threshold is reached, global states are updated to be offline and all the nodes are synchronized by broadcasting, when the detection activity is recovered, each node performs single-point detection through a distributed lock election unique detection main node during offline, and when the detection activity is successful, the global states are updated to be normal and the broadcasting synchronization is performed.

Inventors

MA YONGCHAO

Assignees

马永超

Dates

Publication Date: 20260512
Application Date: 20260303

Claims (10)

1. A vendor failover and liveness detection system for a distributed API gateway comprising: The distributed gateway cluster comprises a plurality of gateway nodes, wherein each gateway node is provided with a local route cache module for storing the real-time state identification of a provider and routing a client request to a corresponding provider interface based on the state identification; the distributed coordination center is in communication connection with each gateway node in the distributed gateway cluster, and is used for storing and managing global state identifiers of each provider and providing atomization operation, distributed lock and message broadcasting service; wherein each gateway node is configured to: when a service request is received, if the state identifier of the main provider is determined to be normal according to the local route cache module, forwarding the service request to a main provider interface, and judging whether to trigger fusing according to a return result; If the fusing is triggered, calling an atomization script of the distributed coordination center to atomically accumulate error counts in a sliding window and judge whether the fusing threshold is reached; When the fusing threshold is reached, the distributed coordination center updates the global state identifier of the main provider to be offline, and informs all gateway nodes to update the respective local route cache modules through a broadcasting mechanism.
2. The vendor failover and alive detection system of a distributed API gateway of claim 1, wherein the distributed coordination center is further configured to: during the period that the global state of the main provider is marked as offline, responding to a probe activity request triggered by each gateway node at fixed time, and determining a unique probe main node from a plurality of competing nodes through a distributed mutual exclusion lock; The detection master node is used for initiating a single detection request to the main provider interface, deciding whether to update the global state identifier stored in the distributed coordination center to be normal according to a detection result, and notifying all gateway nodes to update the local route cache module through a broadcasting mechanism.
3. The vendor failover and liveness system of a distributed API gateway of claim 2 wherein the distributed mutual exclusion lock is a lock with time-to-live based on the SETNX command of Redis, the gateway node that failed to acquire the lock relinquishes the probing to allow only a single node to perform the probing task during the probing period, eliminating the frightening group effect in the distributed environment.
4. The vendor failover and live detection system of the distributed API gateway of claim 1, wherein the distributed coordination center is Redis, the atomized script is a Lua script executed on the Redis server, and the script encapsulates the operations of reading, accumulating, threshold judging and global status updating of the error count into an atomic transaction unit to ensure the consistency of data and the uniqueness of the generation of the fusing instruction in the high concurrency scenario.
5. The vendor failover and active detection system of a distributed API gateway of claim 1 wherein said broadcast mechanism is a dis-based publish/subscribe mode, and when said distributed coordination center updates the global state identifier, status change messages are published to channels subscribed to all gateway nodes immediately to achieve millisecond level synchronization of the full cluster state.
6. The system of claim 1, wherein the local route buffer module is a data storage unit deployed in a local memory of each gateway node, and is configured to store a provider status identifier synchronized from the distributed coordination center, and each gateway node preferentially queries the local buffer when processing a service request, and updates the buffer in real time after receiving a broadcast message from the distributed coordination center, so as to achieve high performance routing and final consistency.
7. A vendor failover and liveness method based on the distributed API gateway of any one of claims 1-6, comprising the steps of: Service request processing and fusing: the gateway node receives a client request, and queries a local route cache module to obtain a main provider state identifier; if the state is normal, forwarding the request to a main provider interface, and judging whether to trigger fusing according to a return result; When triggering fusing, calling an atomization script of a distributed coordination center, atomically accumulating error counts in a sliding window and judging whether a fusing threshold is reached; if the threshold value is reached, the distributed coordination center updates the global state identifier of the main provider to be offline, and informs all gateway nodes of updating the local route cache module through broadcasting; and (3) detecting and recovering: each gateway node triggers the activity detection logic at fixed time during the period that the main supplier is in the down state, and the unique detection main node is determined through the competition distributed mutual exclusion lock; And the detection master node initiates a single detection request to the main provider interface, updates the global state identification in the distributed coordination center to be normal if the detection is successful, and informs all gateway nodes to update the local route cache module through broadcasting so as to realize automatic flow switching.
8. The method for automatically switching and detecting the failure of a provider of a distributed API gateway according to claim 7, wherein said step of determining whether to trigger fusing according to the returned result comprises: Judging whether the response of the main provider interface is overtime or whether the returned HTTP status code is an internal error of the 5xx server, if so, judging that the response is abnormal and triggering a fusing judgment flow.
9. The method for automatically switching and detecting a provider failure of a distributed API gateway as recited in claim 7, wherein said determining a unique detecting master node by competing for distributed mutex lock comprises: each gateway node sends SETNX instructions to the distributed coordination center to attempt to write a lock key with preset survival time; The first successfully written gateway node becomes the detection master node of the detection period, and other gateway nodes which fail to write discard the detection task.
10. The method for automatically switching and detecting the failure of a provider of a distributed API gateway according to claim 7, wherein after said distributed coordination center informs all gateway nodes to update the local route buffer module by broadcasting, each gateway node will route the subsequent service request to the provider interface with normal status according to the status identifier in the updated local route buffer module to complete the cluster-level global switching of the traffic.

Description

Vendor fault automatic switching and activity detection method and system for distributed API gateway Technical Field The invention relates to the technical field of Internet, in particular to a provider failure automatic switching and activity detection method and system of a distributed API gateway. Background With the popularization of micro-service architecture, an API gateway is used as a system entry and bears core functions such as request routing, protocol conversion, security verification and the like. Particularly in the scenario of aggregating multiple third party AI service providers (e.g., deepSeek, universal thousands, etc.), the gateway needs to have high availability and fault tolerance capabilities to ensure that traffic can be automatically switched to the backup provider when a certain provider interface is not available. Currently, the mainstream gateway fusing and failover schemes (such as Netflix Hystrix or the default configuration of Sentinel) are usually based on stand-alone memory for status statistics and independent probing. However, these schemes suffer from the following technical drawbacks in a distributed environment: high communication cost overhead the prior art typically uses an "active probing" mode, i.e. each gateway node in the cluster will periodically (e.g. every 5 seconds) initiate heartbeat packets or test requests to all provider interfaces. Since the third party interface is usually charged according to the call times, the multipoint, high frequency active detection can generate a large amount of redundant bills irrelevant to the service, resulting in a sharp rise in the operation cost. The "frightening group effect" in a distributed environment is that when a provider service is subject to transient jitter or recovered from failure, gateway nodes initiate probe requests to verify service status almost simultaneously, as they operate independently of each other and lack coordination. The high concurrency detection behavior not only can instantly occupy the system intranet bandwidth and the computing resources of the gateway node, but also is extremely easy to be misjudged as DDoS (distributed denial of service) attack by a firewall of a provider, so that the gateway IP is blocked, and a larger usability risk is caused. Cluster state aware hysteresis and inconsistency in that each gateway node determines vendor state based solely on its local perspective (e.g., local error count) without a centralized coordination mechanism. This can lead to a "cognitive tear" phenomenon within the cluster, i.e., some nodes have cut down the failed link while another part of nodes are still attempting to send requests, resulting in extremely unstable response results received by the client, disrupting the high availability commitments of the service. Therefore, there is a need for a distributed API gateway provider fail-over and probe method and system that reduces communication costs, eliminates the frightening group effect, and ensures cluster state consistency. Disclosure of Invention In order to overcome the above-mentioned drawbacks of the prior art, the present invention provides a method and a system for automatically switching and detecting the provider failure of a distributed API gateway, so as to solve the problems in the prior art. The invention provides a provider failure automatic switching and activity detection system of a distributed API gateway, which comprises the following steps: The distributed gateway cluster comprises a plurality of gateway nodes, wherein each gateway node is provided with a local route cache module for storing the real-time state identification of a provider and routing a client request to a corresponding provider interface based on the state identification; the distributed coordination center is in communication connection with each gateway node in the distributed gateway cluster, and is used for storing and managing global state identifiers of each provider and providing atomization operation, distributed lock and message broadcasting service; wherein each gateway node is configured to: when a service request is received, if the state identifier of the main provider is determined to be normal according to the local route cache module, forwarding the service request to a main provider interface, and judging whether to trigger fusing according to a return result; If the fusing is triggered, calling an atomization script of the distributed coordination center to atomically accumulate error counts in a sliding window and judge whether the fusing threshold is reached; When the fusing threshold is reached, the distributed coordination center updates the global state identifier of the main provider to be offline, and informs all gateway nodes to update the respective local route cache modules through a broadcasting mechanism. Further, the distributed coordination center is further configured to: during the period that the global state of the main