KR-20260066689-A - Parallel Computing System based on Dedicated Hardware Architecture for API Traffic Distribution

KR20260066689AKR 20260066689 AKR20260066689 AKR 20260066689AKR-20260066689-A

Abstract

The present invention discloses a parallel computing system based on a dedicated hardware architecture for API traffic distribution processing. The system (100) of the present invention includes a hardware-based dispatcher (110) that classifies the protocol specifications, data formats, and complexity of API requests in real time, a plurality of API-dedicated processor cores (120) equipped with a serialization and deserialization instruction set (123), and a shared memory pool (130) connected via the CXL protocol to share API context data between cores with low latency. The dispatcher (110) monitors the real-time computational load and cache hit rate of each core based on dual criteria and dynamically allocates API requests to the optimal core. According to the present invention, the effects of eliminating scheduling overhead of a software gateway, improving processing performance through a serialization-dedicated ISA, ensuring session continuity through CXL-based context sharing, and eliminating cache cold start penalties through preemptive context warm-up are achieved.

Inventors

안범주

Assignees

안범주

Dates

Publication Date: 20260512
Application Date: 20260424

Claims (19)

A hardware-based dispatcher that classifies the complexity of tasks by analyzing the protocol specifications and data formats of the above API request in real time; A plurality of API-dedicated processor cores that are allocated variable computational resources according to the characteristics of the tasks classified by the dispatcher and include an instruction set specialized for the serialization and deserialization processing of the API request; and It includes a shared memory pool connected via the CXL (Compute Express Link) protocol to share API context data between the plurality of API-dedicated processor cores without delay; A parallel computing system dedicated to API processing, characterized in that the dispatcher monitors the real-time computational load and cache hit rate of each of the plurality of API-dedicated processor cores and dynamically schedules waiting API requests to the optimal core.
The above dispatcher is, A multi-factor complexity calculation circuit that calculates complexity factors quantifying the payload size, nesting depth, and type of authentication method of a received API request, respectively; and A parallel computing system dedicated to API processing, characterized by including a hardware classifier that generates a single complexity score by applying weights to a plurality of complexity factors and classifies the API request into one of the Light, Medium, or Heavy grades according to the complexity score.
The above dispatcher is, It includes a speculative pre-dispatch circuit that parses the header field of a received API request and, at the same time, pre-selects a candidate core to process the request based on the parsing result. A parallel computing system dedicated to API processing, characterized by the above speculative pre-dispatch circuit pre-fetching relevant API context data from the shared memory pool into the cache of the corresponding candidate core before payload parsing is completed.
The above dispatcher is, It further includes a session affinity circuit that preferentially allocates consecutive API requests having the same client ID or session ID to the same API-dedicated processor core that previously processed the request, A parallel computing system dedicated to API processing, characterized in that the session affinity circuit switches the allocation to an alternative core only when the computational load of the priority allocation target core exceeds a threshold, and atomically transfers the relevant API context data through the shared memory pool during the switch.
The above dispatcher is, Analyzes the payload of the received API request to identify the data format among JSON (JavaScript Object Notation), Protocol Buffer, MessagePack, and XML through a hardware pattern matching circuit, and A parallel computing system dedicated to API processing, characterized by including a format-aware routing circuit that prioritizes routing a sequence of deserialization instructions corresponding to an identified data format to a core specialized in processing the corresponding format among the plurality of API-dedicated processor cores.
The above plurality of API-dedicated processor cores are, One or more efficiency cores configured to process the above-mentioned lightweight API requests and operating with low power consumption; and It is formed as a heterogeneous structure including one or more performance cores configured to process API requests of the above-mentioned weight class and providing high computational throughput; and A parallel computing system dedicated to API processing, characterized in that the dispatcher selectively allocates either the efficiency core or the performance core according to the complexity classification result.
The above specialized instruction set is, A VLI-only instruction for encoding and decoding a variable-length integer within a single clock cycle; A structure seek (Field Seek) command that navigates specific fields within a JSON or XML structure; and A parallel computing system dedicated to API processing, characterized by including a vector serialization instruction that processes multiple serialization fields in parallel using SIMD (Single Instruction Multiple Data).
Each of the above plurality of API-dedicated processor cores is, It includes a checkpoint recording circuit that generates a processing checkpoint containing the deserialization progress status of an API request being processed, the current parsing position, and temporary accumulated data, and periodically writes it to the shared memory pool. A parallel computing system dedicated to API processing, characterized by the above dispatcher selecting an alternative core to resume processing from a checkpoint stored in the shared memory pool when a specific core fails, thereby ensuring uninterrupted API request processing.
Each of the above plurality of API-dedicated processor cores is, It includes a Response Cache that stores the serialization results of processed API responses in the cache, and A parallel computing system dedicated to API processing, characterized in that the dispatcher includes an idempotency-aware cache hit path that immediately returns response data from the response cache of the corresponding core without a serialization process when the received API request matches the Uniform Resource Identifier (URI), parameters, and hash value of the request body with the same request previously processed.
The above shared memory pool is, A high-speed area (Hot Tier) that stores active API context data with high access frequency; and It is configured by being separated into a low-speed area (Cold Tier) that stores idle API context data with low access frequency, and A parallel computing system dedicated to API processing, characterized in that the dispatcher includes a hierarchical memory management circuit that monitors the recent access time and access frequency of API context data and dynamically moves API context data between the high-speed region and the low-speed region.
The above shared memory pool is, The stored API context data is saved by performing lossless compression using a hardware compression engine, and A parallel computing system dedicated to API processing, characterized in that when the plurality of API-dedicated processor cores read API context data from the shared memory pool, the hardware compression engine performs decompression in real time within the CXL transmission path, thereby reducing CXL link bandwidth consumption.
The above shared memory pool is, It is configured by being separated into multiple logical partitions partitioned by API endpoint type, and A parallel computing system dedicated to API processing, characterized in that each of the above logical partitions preferentially shares a CXL path with a set of API-dedicated processor cores dedicated to processing requests for the corresponding API endpoint, thereby blocking memory access interference between endpoints.
The above shared memory pool is, It includes a hardware-based transaction coordination circuit for atomicly merging the processing results of each request when multiple API requests from the same client are processed in parallel, and A parallel computing system dedicated to API processing, characterized by the above transaction coordination circuit maintaining coherency of context data between cores using Snoop messages of the CXL.cache protocol.
The above dispatcher is, Collects thermal information from the temperature sensors of each of the aforementioned multiple API-dedicated processor cores, and A parallel computing system dedicated to API processing, further comprising a Thermal-Aware Migration circuit that stops the allocation of new API requests to a specific core when the temperature of that core exceeds a thermal threshold and transfers API context data being processed by that core to a core with thermal room through the shared memory pool.
The above dispatcher is, The moving average and inter-arrival time distribution of API request complexity per client are calculated in real time using a hardware statistical circuit, and A parallel computing system dedicated to API processing, further comprising a predictive power management circuit that predicts the complexity and quantity of API requests to be received in the future based on the above moving average and arrival interval distribution, and, depending on the prediction result, pre-activates (Wake-up) API-dedicated processor cores in an idle state or switches cores in an overactive state to a power-gating state.
The above dispatcher is, It includes a request coalescing circuit that groups multiple pending API requests having the same URI pattern or the same API schema into a single batch and sequentially assigns them to a single API-dedicated processor core. A parallel computing system dedicated to API processing, characterized by the above request coalesing improving the instruction cache hit rate by maintaining the state in which a sequence of deserialized instructions for the same schema is already loaded in the instruction cache of the corresponding core.
The above system is, It further includes an inline anomaly detection module that analyzes the payload size, request frequency, and source IP of received API requests in real time using a hardware-based anomaly detection circuit, and isolates API requests exhibiting abnormal traffic patterns at the hardware layer before allocating them to the plurality of API-dedicated processor cores. A parallel computing system dedicated to API processing, characterized in that the above-described inline anomaly detection module is inserted into the allocation decision path of the above-described dispatcher and operates without CPU intervention.
The above shared memory pool is, It includes an Audit Log Region that records the processing history, response codes, and processing time of completed API requests in an immutable form along with timestamps, and A parallel computing system dedicated to API processing, characterized in that the above-mentioned audit log area is hardware-protected so that no API-dedicated processor core can modify the recorded data through the write-once access control of the CXL protocol.
The above dispatcher is, If the cache hit rate of a specific core among the plurality of API-dedicated processor cores falls below a preset lower threshold, a proactive context warm-up operation is performed to preemptively copy the relevant API context data stored in the shared memory pool to a cache line of another core whose cache hit rate is above the lower threshold, without suspending the processing of API requests currently allocated to the core. A parallel computing system dedicated to API processing, characterized by prioritizing the allocation of subsequent associated requests to another warmed-up core after the processing of the aforementioned specific core is completed.

Description

Parallel Computing System based on Dedicated Hardware Architecture for API Traffic Distribution Parallel Computing System based on Dedicated Hardware Architecture for API Traffic Distribution The present invention relates to a dedicated architecture for processing API (Application Programming Interface) requests received from a plurality of external clients in parallel at the hardware layer, and more specifically, to a system that provides a parallel computing environment dedicated to API processing, comprising a hardware-based dispatcher, a plurality of API-dedicated processor cores equipped with instruction sets dedicated to API serialization and deserialization, and a shared memory pool based on the CXL (Compute Express Link) protocol. In modern cloud computing environments, it has become commonplace for hundreds of thousands of external clients to simultaneously send API requests in various protocol formats, such as REST, gRPC, and GraphQL. API gateways serve as entry points for this traffic, performing functions such as authentication, routing, serialization and deserialization, and load balancing. However, conventional API gateways operate as software stacks on general-purpose CPU cores, which presents fundamental limitations in terms of latency, throughput, and energy efficiency. The first problem with conventional software-based API gateways is serialization and deserialization overhead. Parsing API payloads encoded in formats such as JSON, Protocol Buffer, and MessagePack into software consumes a significant portion of CPU computation cycles. According to Google's internal research, the cycles required for Protocol Buffer serialization and deserialization account for more than 3% of the total fleet cycles, resulting in wasted costs amounting to hundreds of millions of dollars at a data center scale. The second issue is the context switching cost of general-purpose CPU cores. API gateways running on general-purpose operating systems may be preempted by the operating system's scheduler during API request processing, leading to context switching. During this process, context data required for API processing is evicted from the cache, resulting in numerous cache misses when processing resumes. In particular, when multiple cores distribute the processing of consecutive API requests from the same client session, context data sharing between cores occurs via software message passing or shared memory access, causing an additional delay of several microseconds to tens of microseconds. The third problem is the accuracy of load balancing decisions. Conventional load balancers based on Round-Robin or Least Connection algorithms do not consider the processing complexity of each API request; consequently, simple health check requests and complex transaction requests containing large payloads are distributed with equal weight, leading to a bias where the overload is concentrated on specific cores. Conventional API acceleration technologies such as FPGA-based API offloading and network layer processing acceleration using SmartNICs have been studied, but these technologies have high data movement overhead through the PCIe interface and lack a mechanism for sharing context data between cores at low latency, which limits performance in stateful API processing scenarios. Furthermore, a dedicated hardware architecture integrating API complexity-based hardware-level scheduling, a dedicated processor core equipped with a dedicated instruction set for serialization and deserialization, and a CXL-based memory pool for context sharing between cores has not yet been proposed. FIG. 1 is an overall architecture block diagram of a parallel computing system (100) dedicated to API processing. Figure 2 is an internal function block diagram of a hardware dispatcher (110). FIG. 3 is a configuration diagram of a multi-factor complexity calculation circuit (111) and a hardware classifier (112). Figure 4 is an operation timing diagram of a speculative pre-dispatch circuit (113). FIG. 5 is a flowchart of the session affinity circuit (114) and the atomic context transfer flowchart during core switching. FIG. 6 is a configuration diagram of a format recognition routing circuit (115). FIG. 7 is a diagram showing the arrangement of the efficiency core (121) and performance core (122) in a heterogeneous core structure. FIG. 8 is a diagram of a specialized instruction set (123) dedicated to API. FIG. 9 is a checkpoint recording circuit (124) and a fault recovery operation flowchart. FIG. 10 is a block diagram of the hit path of an idempotent-based response cache (125). FIG. 11 is a diagram showing the configuration of the high-speed region (131) and low-speed region (132) of the hierarchical CXL shared memory pool (130). FIG. 12 is a configuration diagram of an inline hardware compression engine (134) within a CXL transmission path. FIG. 13 is a mapping diagram of logical partitions (135) and core sets by API endpoint. FIG. 14 is a sequence