Search

EP-4735970-A1 - SYSTEM AND METHODS FOR FAST-SWITCHED OPTICAL DATA CENTER NETWORKS

EP4735970A1EP 4735970 A1EP4735970 A1EP 4735970A1EP-4735970-A1

Abstract

The present invention relates to methods and devices for transmitting packets in a data center network (DCN), the data center comprising a multitude of host servers, a multitude of top-of- rack, ToR, switches connected to the host servers and an optical network fabric connected to the multitude of top-of-rack switches, wherein the optical network fabric operates according to a given schedule, wherein the given schedule defines, for each time slice in a sequence of time slices, which pairs of top-of-rack switches are connected by a dedicated optical circuit established by an optical controller of the optical network fabric for said time slice, wherein a top-of- rack switch: synchronizes to another top-of-rack switch; receives a packet at an ingress port; and sends the packet to an egress port. According to the invention, synchronizing comprises sending a synchronization message to said another top-of-rack switch in-band.

Inventors

  • YITING, Xia
  • JIALONG, Li
  • YIMING, Lei
  • DE MARCHI, FEDERICO
  • RAJ, Joshi
  • CHANDRASEKARAN, BALAKRISHNAN

Assignees

  • Max-Planck-Gesellschaft zur Förderung der Wissenschaften e.V.
  • Stichting VU

Dates

Publication Date
20260506
Application Date
20230629

Claims (19)

  1. 1. Method for transmitting packets in a data center network (DCN), the data center comprising a multitude of host servers, a multitude of top-of-rack, ToR, switches connected to the host servers and an optical network fabric connected to the multitude of top-of-rack switches, wherein the optical network fabric operates according to a given schedule, wherein the given schedule defines, for each time slice in a sequence of time slices, which pairs of top-of-rack switches are connected by a dedicated optical circuit established by an optical controller of the optical network fabric for said time slice, wherein a top-of-rack switch: synchronizes to another top-of-rack switch; receives a packet at an ingress port; and sends the packet to an egress port, characterized in that synchronizing comprises sending a synchronization message to said another top-of-rack switch in-band, that is via a circuit of the optical network fabric.
  2. 2. The method of claim 1, wherein the synchronization message is sent, based on the given schedule.
  3. 3. The method of claim 2, wherein the given schedule is an initial synchronization schedule, used for synchronizing at coarse accuracy, before the data center network becomes operational.
  4. 4. The method of claim 3, wherein the given schedule is an operational schedule, after the initial synchronization has terminated.
  5. 5. The method of claim 4, wherein synchronizing comprises re-synchronizing the top-of- rack switch.
  6. 6. The method of claim 5, wherein re-synchronization is performed when a current amount by which the synchronized clock of said top-of-rack switch has drifted from a master clock exceeds a pre-defined threshold T.
  7. 7. The method of claim 6, wherein the amount of drift corresponds to a sum of synchronization and drift errors of the top-of-rack switch.
  8. 8. The method of claim 7, wherein the synchronization error and the drift error are estimated, based on an empirical profile or statistics of the synchronization and drift errors of the top-of- rack switch.
  9. 9. The method of claim 8, wherein the pre-defined threshold T is determined based on an accuracy and an overhead of re-synchronization.
  10. 10. The method of claim 9, wherein the pre-defined threshold T is further determined based on a relative significance of the synchronization error and the drift error.
  11. 11. The method of claim 10, wherein, if the synchronization error dominates, T is made large to reduce a re-synchronization phase into re-synchronizing said top-of-rack switch directly with a lead top-of-rack switch, once per a cycle of the schedule.
  12. 12. The method of claim 11 , wherein, if the drift error dominates, T is made small to resynchronize each ToR multiple times in a cycle through intermediate reference ToRs before its clock drifts off.
  13. 13. The method of claim 12, wherein a ToR with a minimal amount of drift is selected as said another ToR for re-synchronization.
  14. 14. The method of claim 13, wherein a time slice and a port connected to said another top- of-rack switch are stored in a lookup table on said top-of-rack switch, the lookup table being used to direct synchronization messages in different time slices to specific ports connected to top-of-rack switches to synchronize with.
  15. 15 The method of claim 14, wherein the lookup table is preloaded into a control plane of said top-of-rack switch.
  16. 16. The method of claim 15, wherein batches of the lookup table are periodically injected into the data plane of said top-of-rack switch.
  17. 17. The method of claim 16, wherein synchronization returns an offset for adjusting a local clock of the top-of-rack switch.
  18. 18. The method of claim 17, wherein the synchronization message is a DPTP message.
  19. 19. Top-of-rack switch, implementing the method steps of one of claims 1 to 18.

Description

SYSTEM AND METHODS FOR FAST-SWITCHED OPTICAL DATA CENTER NETWORKS The present invention relates to a system and to methods transmitting data in data center networks (DCN). TECHNICAL BACKGROUND Designs for data-center-networks have largely benefited from Moore’s law for networking - the bandwidth of electrical switches doubles every two years at the same cost and power. As this bandwidth scaling slows down, a series of optical DCN architectures have been proposed to leverage the bandwidth, power, and cost advantages of optical interconnects [13-16, 20, 23, 27, 31-33, 36, 37, 42^16], Compared to electrical interconnects in traditional DCNs, optical interconnects use circuit switching to establish dedicated optical circuits between end points and shift the circuits across “time slices” to create time-shared networks. So-called slow-switched optical DCNs have tens of milliseconds of switching delays [20, 37, 42, 44-46]. Limited by the switching speed, this type of optical network has to work in tandem with an electrical network to avoid network partitioning, e.g., either augmenting the electrical DCN with on-demand circuits to offload heavy traffic [20, 42], or serving as “patch panels” for electrical switches and reconfigure the network topology on a seconds to hours granularity [14, 37, 44-46], For example, Jupiter — Google’s DCN fabric — has achieved 5x capacity increase, 41% power reduction, and 30% cost reduction after deploying slow-switched optical interconnects in the network core [37]. These optical interconnects provide large port counts to interconnect electrical switches and reconfigure the DCN topology when needed, e.g., at device upgrade and failure times, or once a few hours as the DCN traffic evolves. Figure 1 , on the other hand, shows an example of a typical, general purpose fast-switched optical DCN that has increasingly been recognized as an alternative to slow-switched DCNs in recent years. The fast-switched optical DCN comprises optical switches that interconnect electrical top-of-rack switches (ToRs) and end servers [31-33, 36]. The fabric uses circuit switching to establish dedicated optical circuits that are time-shared amongst the different ToR pairs for high-speed transmission of aggregated traffic. Once established, a circuit is retained for a fixed interval of time, called a time slice, during which the connected ToRs have exclusive use of it, i.e., no contention with other ToRs. An optical controller, e.g., an FPGA board [13, 27, 32 and 33] controls the circuit switches to change the circuits continually, once per time slice, to route traffic in the optical domain on an all-optical network fabric. The sequence of ToR-wise connections associated with their time slices constitutes an optical schedule. Normally, the schedule is pre-defined and repeats every optical cycle. There is at least one circuit between every ToR pair per cycle. The removal of the electrical network further reduces cost compared to slow-switched optical DCNs, but at the same time deviates from the all-to-all connectivity assumed by conventional DCN designs. The switching delays of fast-switched optical DCNs vary between several nanoseconds [13] to tens of microseconds [32, 33, 36]. PRIOR ART Table 1 summarizes the limitations of fast-switched DCN architectures proposed until now: Table 1 : Implementation limitations of existing work. As table 1 shows, existing architectures have not gone beyond proof-of-concept prototypes of the proposed optical network fabric, and their systems with bare minimum testing functionalities do not suggest actual end-to-end system implementations. A common problem of ToRs emulated with Linux servers [33, 36], and hosts emulated with FPGA boards [13, 27] is time synchronization of ToRs and hosts with the optical fabric. Some architectures have the optical controller notify ToRs of upcoming circuits [27, 36], but it is hard to scale this design to large DCNs. Some others rely on link up/down events on ToR/host ports to detect circuit on/offs [32, 33], but resetting ports incurs millisecond-scale delays — not responsive enough for fast-switched optical DCNs. Mordia [36] and RotorNet [33] apply multi-hop routing to both long or heavy (“elephant”) flows and short or light (“mice”) flows (latency sensitive traffic), which requires large buffers on ToRs, e.g., ~70 MB per s witch port in Mordia on a 100 Gbps DCN. Mordia and SiP-Ring use dynamic optical schedules computed using real-time traffic estimation [27, 36], which is hard in fast- switched optical DCNs especially for bursty traffic, and neither implemented traffic estimation in their prototypes. Opera imposes a rigid relation between the number of ToRs and uplinks to ensure the existence of multi-hop paths for every ToR pair at any moment [32], making deployment and expansion challenging. Sirius and SiP-Ring require customized hardware [13, 27], e.g., custom optical modules and optical interfaces on GPUs, which are not manufacture