EP-3846398-B1 - HYPERSCALAR PACKET PROCESSING

EP3846398B1EP 3846398 B1EP3846398 B1EP 3846398B1EP-3846398-B1

Inventors

KADU, SACHIN PRABHAKARRAO

Dates

Publication Date: 20260506
Application Date: 20201119

Claims (15)

A method comprising: receiving (411) a plurality of network packets from a plurality of data paths (110A-110H); arbitrating (412), based at least in part on an arbitration policy, the plurality of network packets to a plurality of packet processing blocks comprising one or more full processing blocks (150A, 150B) and one or more limited processing blocks (130A, 130B, 130C, 130D), each of the one or more limited processing blocks supporting a limited set of packet processing features; processing, in parallel, the plurality of network packets via the plurality of packet processing blocks, wherein each of the one or more full processing blocks (150A, 150B) processes a first quantity of network packets during a clock cycle, and wherein each of the one or more limited processing blocks (130A, 130B, 130C, 130D) processes during the clock cycle a second quantity of network packets that is greater than the first quantity of network packets; and sending the processed plurality of network packets through a plurality of data buses (414).
The method of claim 1, wherein the processing corresponds to processing of a start and end of packet, SEOP, of the plurality of network packets.
The method of claim 1 or 2, wherein arbitrating, based at least in part on the arbitration policy, the plurality of network packets comprises: for each respective packet of the plurality of network packets: determining a packet size for each respective packet; routing each respective packet to the one or more full processing blocks (150A, 150B) when the packet size exceeds a threshold packet size of the arbitration policy; and routing each respective packet to the one or more limited processing blocks (130A, 130B, 130C, 130D) when the packet size does not exceed the threshold packet size of the arbitration policy.
The method of any of claims 1 to 3, wherein the one or more limited processing blocks (130A, 130B, 130C, 130D) are grouped into shared logical limited processing engines comprising at least two of the one or more limited processing blocks with shared logic and lookup hardware.
The method of any of claims 1 to 4, wherein the arbitration comprises queuing ancillary operations into the plurality of packet processing blocks that are not processing any of the plurality of network packets.
The method of any of claims 1 to 5, wherein the second quantity of network packets is at least twice the first quantity of network packets.
The method of any of claims 1 to 6, wherein the first quantity of network packets is one.
The method of any of claims 1 to 7, wherein arbitrating the plurality of network packets comprises arbitrating, for each of the one or more full processing blocks, into at least one first in first out, FIFO, queue.
The method of any of claims 1 to 8, wherein the one or more full processing blocks (150A, 150B) are grouped into shared full processing engines comprising at least two of the one or more full processing blocks (150A, 150B) with shared circuitry.
The method of claim 9, wherein each of the shared full processing engines utilize an aggregated packet processing pipe with coherent data access structures.
A system (500) comprising: a first shared bus (180A, 180B) with a first arbiter configured to receive a plurality of network packets from a plurality of data paths (110A-110H); a plurality of packet processing blocks comprising one or more full processing blocks (150A, 150B) and one or more limited processing blocks (130A, 130B, 130C, 130D), each of the one or more limited processing blocks supporting a limited set of packet processing features, wherein each of the one or more full processing blocks (150A, 150B) is configured to process a first quantity of network packets during a clock cycle, and wherein each of the one or more limited processing blocks (130A, 130B, 130C, 130D) is configured to process during the clock cycle a second quantity of network packets that is greater than the first quantity of network packets; wherein the first arbiter is configured to arbitrate, based at least in part on an arbitration policy, the plurality of network packets for processing in parallel by the plurality of packet processing blocks; and a second shared bus with a second arbiter configured to arbitrate, based at least in part on the arbitration policy, the processed plurality of network packets to a plurality of data buses.
The system of claim 11, wherein the plurality of packet processing blocks is configured to process a start and end of packet, SEOP, of the plurality of network packets.
The system of claim 11 or 12, wherein the first arbiter is configured to: for each respective packet of the plurality of network packets: determine a packet size for each respective packet; arbitrate each respective packet to the one or more full processing blocks (150A, 150B) when the packet size exceeds a threshold packet size of the arbitration policy; and route each respective packet to the one or more limited processing blocks (130A, 130B, 130C, 130D) when the packet size does not exceed the threshold packet size of the arbitration policy.
The system of any of claims 11 to 13, comprising at least one of the following features: wherein the one or more limited processing blocks (130A, 130B, 130C, 130D) are grouped into shared logical limited processing engines comprising at least two of the one or more limited processing blocks (130A, 130B, 130C, 130D) with shared logic and lookup hardware, and wherein the one or more full processing blocks (150A, 150B) are grouped into shared logical full processing engines comprising at least two of the one or more full processing blocks (150A, 150B) with shared logic and lookup hardware, wherein in particular each of the shared full processing engines comprises an aggregated packet processing pipe with coherent data access structures; wherein the first arbiter is configured to queue ancillary operations into the plurality of packet processing blocks that are not processing any of the plurality of network packets; wherein the second quantity of network packets is at least twice the first quantity of network packets; wherein the first quantity of network packets is one; wherein each of the one or more full processing blocks (150A, 150B) is further coupled to at least one first in first out, FIFO, queue.
A non-transitory storage medium comprising instructions that, when read by one or more processors, cause a method comprising: receiving a plurality of network packets from a plurality of data paths (110A-110H); arbitrating, based at least in part on an arbitration policy, the plurality of network packets to a plurality of packet processing blocks comprising one or more full processing blocks (150A, 150B) and one or more limited processing blocks (130A, 130B, 130C, 130D), each of the one or more limited processing blocks supporting a limited set of packet processing features; processing, in parallel, the plurality of network packets via the plurality of packet processing blocks, wherein each of the one or more full processing blocks (150A, 150B) processes a first quantity of network packets during a clock cycle, and wherein each of the one or more limited processing blocks (130A, 130B, 130C, 130D) processes during the clock cycle a second quantity of network packets that is greater than the first quantity of network packets; and sending the processed plurality of network packets through a plurality of data buses.

Description

TECHNICAL FIELD The present disclosure generally relates to packet processing, and more specifically relates to methods and systems for providing hyperscalar packet processing to optimize circuit integration, reduce power consumption and latency, and improve performance. BACKGROUND In packet processing devices such as network switches and routers, transitioning to smaller processing nodes was often sufficient to meet ever increasing performance targets. However, as the feature size of processing nodes approaches physical limitations, performance improvements become harder to achieve from process shrinkage alone. Meanwhile, high performance computing and other demanding scale out applications in the datacenter continue to require higher performance that is not met by conventional packet processing devices. Latency sensitive applications further require specialized hardware features, such as ternary content addressable memory ("TCAM"), which in turn imposes performance constraints that raise further hurdles in meeting performance targets. US 8 775 685 B1 relates to parallel processing of network packets and discloses processing pipelines, of which each processing pipeline is configured and arranged to process packets having a size greater than or equal to the associated processing size. US 8 335 224 B1 relates a data buffering apparatus and method. The buffering apparatus has a small primary buffer and a large secondary buffer. US 2006/0114914 A1 relates to a pipeline architecture of a network device. WO 2019/165355 A1 relates to technologies for NIC port reduction with accelerated switching. EP 3 562 110 B1 discloses a scaled-up shared-buffer architecture for a network switch that processes two packets per cycle. Each physical bank of the buffer memory implemented in the shared-buffer architecture is capable of supporting two random access reads within a single bank because two scheduled packets for transmission may reside in the same bank at the same time. The received packets in the same cycle can always be directed to be written into memory banks other than the ones being read while avoiding collisions with other rights. SUMMARY It is an object of the invention to process packets efficiently. This object is solved by the subject matter according to the independent claims. The claimed invention is defined by the independent claims. Further embodiments are described in the dependent claims. DESCRIPTION OF THE FIGURES Various objects, features, and advantages of the present disclosure can be more fully appreciated with reference to the following detailed description when considered in connection with the following drawings, in which like reference numerals identify like elements. The following drawings are for the purpose of illustration only and are not intended to be limiting of this disclosure, the scope of which is set forth in the claims that follow. FIG. 1A depicts an example network environment in which hyperscalar packet processing may be implemented, according to various aspects of the subject technology.FIG. 1B depicts a logical block diagram of ingress / egress packet processing within an example network switch for providing hyperscalar packet processing, according to various aspects of the subject technology.FIG. 2A depicts an example system for processing a single packet from a single data path, according to various aspects of the subject technology.FIG. 2B depicts an example system for processing dual packets from two data paths, according to various aspects of the subject technology.FIG. 2C depicts an example system for logically grouping two dual packet processing blocks together, according to various aspects of the subject technology.FIG. 2D depicts an example system for using an arbiter on a shared bus to enforce arbitration policy, according to various aspects of the subject technology.FIG. 2E depicts an example system for arbitrating data paths through individual packet processing pipes, according to various aspects of the subject technology.FIG. 2F depicts an example system for arbitrating data paths through an aggregate packet processing pipe, according to various aspects of the subject technology.FIG. 2G depicts an example system combining the logical grouping of FIG. 2C with the aggregate packet processing pipe of FIG. 2F, according to various aspects of the subject technology.FIG. 2H depicts an example system combining the features shown in FIG. 2A-2G to provide hyperscalar packet processing, according to various aspects of the subject technology.FIG. 3A depicts an example system for supporting 8 data paths through 4 packet processing blocks, according to various aspects of the subject technology.FIG. 3B depicts an example system for reducing latency by providing slot event queues and a scheduler to read out events, according to various aspects of the subject technology.FIG. 4 depicts an example process for using hyperscalar packet processing to optimize circuit integration, reduce p