CN-122029783-A - Technique for achieving end-to-end flow isolation

CN122029783ACN 122029783 ACN122029783 ACN 122029783ACN-122029783-A

Abstract

Mechanisms for establishing/constructing a network architecture for a cluster of GPUs are discussed herein. A plurality of GPU sets are created, wherein each GPU set is created by selecting one GPU from each of a plurality of host machines. Each GPU set is coupled to a different one of the plurality of groups of switches. The coupling includes (i) coupling each GPU in the set of GPUs to a unique ingress port of a first switch included in a corresponding group of switches associated with the set of GPUs, and (ii) virtually mapping each ingress port of the first switch to a unique egress port of a plurality of egress ports of the first switch. Data packets originating from the source GPU and destined for the destination GPU are communicated via the network architecture.

Inventors

J. R. Yukel
BRAR JAGWINDER SINGH

Assignees

甲骨文国际公司

Dates

Publication Date: 20260512
Application Date: 20241010
Priority Date: 20231013

Claims (20)

1.A method, comprising: In a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of groups of switches, each of the plurality of host machines including a plurality of GPUs, each of the plurality of groups of switches arranged in a hierarchy including a first tier of switches and a second tier of switches, wherein each of the plurality of GPU sets is created by selecting one GPU from each of the plurality of host machines; Coupling each of the plurality of sets of GPUs to a different one of the plurality of groups of switches, the coupling comprising (i) coupling each of the sets of GPUs to a unique ingress port of a first switch included in a corresponding group of switches associated with the set of GPUs, and (ii) virtually mapping each ingress port of the first switch to a unique egress port of the plurality of egress ports of the first switch; communicatively connecting the plurality of groups of switches via one or more switches included in a third layer of switches of the network fabric, and For a data packet originating from a source GPU on a first host machine and destined for a destination GPU on a second host machine, the data packet is transmitted from the source GPU to the destination GPU using a network architecture.
2. The method of claim 1, wherein the first switch included in the first group of switches associated with the first set of GPUs is included in a first layer of switches in the network fabric.
3. The method of claim 2, wherein the plurality of egress ports of the first switch are communicatively coupled to a first set of switches included in a first group of switches associated with the first set of GPUs, the first set of switches included in a second layer of switches in the network fabric.
4. The method of claim 1, wherein the plurality of host machines are contained in a first chassis of a network environment.
5. The method of claim 2, wherein each GPU in the first set of GPUs is directly coupled to a unique ingress port of the first switch.
6. The method of claim 1, wherein each GPU contained in the host machine is assigned to a different set of GPUs.
7. The method of claim 1, wherein a switch included in a first tier of switches in the network fabric does not perform ECMP routing to forward the data packets.
8. The method of claim 3, wherein a number of egress ports of the first switch coupled to each switch in the first set of switches is equal, the first set of switches included in a first group of switches associated with the first set of GPUs.
9. One or more computer-readable non-transitory media storing computer-executable instructions that, when executed by one or more processors, cause: In a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of groups of switches, each of the plurality of host machines including a plurality of GPUs, each of the plurality of groups of switches arranged in a hierarchy including a first tier of switches and a second tier of switches, wherein each of the plurality of GPU sets is created by selecting one GPU from each of the plurality of host machines; Coupling each of the plurality of sets of GPUs to a different one of the plurality of groups of switches, the coupling comprising (i) coupling each of the sets of GPUs to a unique ingress port of a first switch included in a corresponding group of switches associated with the set of GPUs, and (ii) virtually mapping each ingress port of the first switch to a unique egress port of the plurality of egress ports of the first switch; communicatively connecting the plurality of groups of switches via one or more switches included in a third layer of switches of the network fabric, and For a data packet originating from a source GPU on a first host machine and destined for a destination GPU on a second host machine, the data packet is transmitted from the source GPU to the destination GPU using a network architecture.
10. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 9, wherein a first switch included in the first group of switches associated with the first set of GPUs is included in a first layer of switches in the network fabric.
11. The one or more computer-readable non-transitory media storing computer-executable instructions, of claim 10, wherein the plurality of egress ports of the first switch are communicatively coupled to a first set of switches included in a first group of switches associated with the first set of GPUs, the first set of switches included in a second layer of switches in the network fabric.
12. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 9, wherein the plurality of host machines are contained in a first chassis of a network environment.
13. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 10, wherein each GPU in the first set of GPUs is directly coupled to a unique ingress port of the first switch.
14. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 9, wherein each GPU contained in the host machine is assigned to a different set of GPUs.
15. The one or more computer-readable non-transitory media storing computer-executable instructions, as recited in claim 9, wherein a switch included in a first tier of switches in the network fabric does not perform ECMP routing to forward the data packets.
16. The one or more computer-readable non-transitory media storing computer-executable instructions, of claim 11, wherein a number of egress ports of the first switch coupled to each switch in the first set of switches is equal, the first set of switches included in a first group of switches associated with the first set of GPUs.
17. A computing device, comprising: One or more processors, and A memory comprising instructions that, when executed by the one or more processors, cause the computing device to at least: In a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of groups of switches, each of the plurality of host machines including a plurality of GPUs, each of the plurality of groups of switches arranged in a hierarchy including a first tier of switches and a second tier of switches, wherein each of the plurality of GPU sets is created by selecting one GPU from each of the plurality of host machines; Coupling each of the plurality of sets of GPUs to a different one of the plurality of groups of switches by (i) coupling each of the sets of GPUs to a unique entry port of a first switch included in a corresponding one of the groups of switches associated with the set of GPUs, and (ii) virtually mapping each entry port of the first switch to a unique exit port of the plurality of exit ports of the first switch; communicatively connecting the plurality of groups of switches via one or more switches included in a third layer of switches of the network fabric, and For a data packet originating from a source GPU on a first host machine and destined for a destination GPU on a second host machine, the data packet is transmitted from the source GPU to the destination GPU using a network architecture.
18. The computing device of claim 17, wherein the first switch included in the first group of switches associated with the first set of GPUs is included in a first layer of switches in the network fabric.
19. The computing device of claim 18, wherein the plurality of egress ports of the first switch are communicatively coupled to a first set of switches included in a first group of switches associated with the first set of GPUs, the first set of switches included in a second layer of switches in the network fabric.
20. The computing device of claim 17, wherein the plurality of host machines are contained in a first rack of a network environment.

Description

Technique for achieving end-to-end flow isolation Cross Reference to Related Applications The present application is and claims the benefit of U.S. provisional application No. 63/590,269 filed on day 13 of 10 in 2023 and U.S. provisional application No. 63/611,948 filed on day 19 of 12 in 2023, each of which are incorporated herein by reference in their entirety for all purposes. Technical Field The present disclosure relates to a network infrastructure for performing artificial intelligence or machine learning workloads, such as Graphics Processing Unit (GPU) workloads. Background Organizations are continually moving business applications and databases to the cloud to reduce the cost of purchasing, updating, and maintaining on-premise hardware and software. High performance computing applications always consume all available computing power to achieve a particular outcome or result. Such applications require dedicated network performance, fast storage, high computing power and large amounts of memory-these resources are not in supply in the virtualized infrastructure that constitutes today's commodity cloud. Cloud infrastructure service providers offer newer and faster Graphics Processing Units (GPUs) to address the requirements of these applications. GPU workloads are typically executed on one or more host machines. Typically, such workloads fail to reach the expected throughput level. One factor that contributes to this problem is the lack of stream entropy, e.g., equal cost multi-path (ECMP) stream entropy. In ECMP, multiple flows (e.g., from different host machines) may be hashed in a manner such that both flows are expected to traverse the same outgoing link/port of the switch. Furthermore, the fact that host machines (i.e., hosts) exchange traffic regardless of other hosts in their local network neighborhood exacerbates this problem. Other types of workloads are typically performed by selecting one or more host machines from an infrastructure in a random (i.e., arbitrary) manner. In other words, the workload is performed without consideration of locality information (e.g., physical location of the host machine). Thus, the throughput of these workloads is low. This situation typically leads to bandwidth contention problems, which are commonly referred to in the literature as congestion problems based on flow collisions. The embodiments discussed herein address these and other issues. Disclosure of Invention The present disclosure relates generally to a network infrastructure for executing Graphics Processing Unit (GPU) workloads. Various embodiments are described herein, including methods, systems, non-transitory computer-readable media storing programs, code, or instructions executable by one or more processors, and the like. These illustrative embodiments are not mentioned to limit or define the disclosure, but rather to provide examples to aid understanding of the disclosure. Additional embodiments are discussed in the detailed description section and further description is provided herein. One embodiment of the present disclosure is directed to a method comprising creating, in a network environment comprising a plurality of host machines, a plurality of sets of GPUs, each of the plurality of host machines communicatively coupled to each other via a network fabric comprising a plurality of groups of switches, each of the plurality of host machines comprising a plurality of GPUs, each of the plurality of groups of switches arranged in a hierarchy comprising a first tier of switches and a second tier of switches, wherein each of the plurality of sets of GPUs is created by selecting one GPU from each of the plurality of host machines, coupling each of the plurality of sets of GPUs to a different one of the plurality of groups of switches, the coupling comprising (i) coupling each of the plurality of sets of GPUs to a unique ingress port of a first switch included in a corresponding one of the plurality of switches associated with the sets of GPUs, and (ii) virtually mapping each ingress port of the first switch to a unique egress port of the plurality of ports of the first switch, communicating data packets from one or more of the plurality of switches included in a third tier of the network fabric to the GPU and from the host machines to the destination GPU. An aspect of the present disclosure provides a computing device comprising one or more data processors, and a non-transitory computer-readable storage medium containing instructions that, when executed on the one or more data processors, cause the one or more data processors to perform part or all of one or more methods disclosed herein. Another aspect of the present disclosure provides one or more computer-readable non-transitory media storing computer-executable instructions that, when executed by one or more processors, cause performance of part or all of one or more methods disclosed herein. The foregoing, along with other features and embodi