CN-122029784-A - Interconnecting global virtual planes

CN122029784ACN 122029784 ACN122029784 ACN 122029784ACN-122029784-A

Abstract

The network environment includes a plurality of host machines coupled to one another via a network fabric including a plurality of switches, which in turn include a plurality of ports. Each host machine includes one or more GPUs. The first subset of ports is associated with a first virtual plane, wherein the first virtual plane identifies a first set of resources to be used for transmitting data packets from/to a host machine associated with the first virtual plane. The second subset of ports is associated with a second virtual plane that is different from the first virtual plane. The first host machine and the second host machine are associated with a first virtual plane and a second virtual plane, respectively. Data packets are transferred from a first host machine to a second host machine using ports in the first subset of ports and the second subset of ports.

Inventors

BRAR JAGWINDER SINGH
DAVID D. BECKER
SHETTY NITYANAND
KUNDU PARTHASARATHI

Assignees

甲骨文国际公司

Dates

Publication Date: 20260512
Application Date: 20241010
Priority Date: 20231013

Claims (20)

1.A method, comprising: In a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of switches, the plurality of switches including a plurality of ports, each host machine of the plurality of host machines including one or more GPUs, associating a first subset of ports of the plurality of ports with a first virtual plane, the first virtual plane identifying a first set of resources to be exclusively used for transmitting data packets from and to host machines associated with the first virtual plane; associating a second subset of the plurality of ports with a second virtual plane different from the first virtual plane; Associating a first host machine with a first virtual plane and a second host machine with a second virtual plane, and For a data packet originating from a first GPU on a first host machine and destined for a second GPU on a second host machine, the data packet is transferred from the first GPU on the first host machine to the second GPU on the second host machine using ports in the first port subset and the second port subset.
2. The method of claim 1, wherein the plurality of switches are arranged in a hierarchy comprising a first tier switch, a second tier switch, and a third tier switch, wherein the plurality of host machines are directly coupled to switches included in the first tier switch, and wherein the second tier switch communicatively couples the first tier switch to the third tier switch.
3. The method of claim 2, wherein at least one port of the first subset of ports belongs to a first switch included in the third tier switch and associated with the first virtual plane.
4. The method of claim 3, wherein at least one port in the second subset of ports belongs to a second switch included in the third tier switch and associated with the second virtual plane.
5. The method of claim 4, wherein the first switch included in the third tier switch and the second switch included in the third tier switch are directly coupled to each other via a cable.
6. The method of claim 2, further comprising: For each of the plurality of switches included in the first tier of switches, a first Virtual Tunnel Endpoint (VTEP) associated with the first virtual plane and a second VTEP associated with the second virtual plane are created, the first VTEP having a first Autonomous System Number (ASN) and the second VTEP having a second ASN different from the first ASN.
7. The method of claim 2, wherein a subset of host machines included in the plurality of host machines are directly coupled to a first switch included in a first tier of switches, each host machine in the subset of host machines being associated with a different virtual plane.
8. The method of claim 7, wherein the number of virtual planes supported by the network fabric corresponds to the number of host machines included in the subset of host machines that are directly coupled to a first switch included in the first tier of switches.
9. The method of claim 7, wherein each host machine is associated with a unique Virtual Tunnel Endpoint (VTEP) created in a first switch included in the first tier switch, and wherein the one or more GPUs of the host machine communicate with one or more other GPUs of other host machines in the network environment via the unique VTEP associated with the host machine.
10. One or more computer-readable non-transitory media storing computer-executable instructions that, when executed by one or more processors, cause: In a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of switches, the plurality of switches including a plurality of ports, each host machine of the plurality of host machines including one or more GPUs, associating a first subset of ports of the plurality of ports with a first virtual plane, the first virtual plane identifying a first set of resources to be exclusively used for transmitting data packets from and to host machines associated with the first virtual plane; associating a second subset of the plurality of ports with a second virtual plane different from the first virtual plane; Associating a first host machine with a first virtual plane and a second host machine with a second virtual plane, and For a data packet originating from a first GPU on a first host machine and destined for a second GPU on a second host machine, the data packet is transferred from the first GPU on the first host machine to the second GPU on the second host machine using ports in the first port subset and the second port subset.
11. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 10, wherein the plurality of switches are arranged in a hierarchy comprising a first tier switch, a second tier switch, and a third tier switch, wherein the plurality of host machines are directly coupled to switches included in the first tier switch, and wherein the second tier switch communicatively couples the first tier switch to the third tier switch.
12. The one or more computer-readable non-transitory media storing computer-executable instructions, of claim 11, wherein at least one port in the first subset of ports belongs to a first switch included in the third tier of switches and associated with the first virtual plane.
13. The one or more computer-readable non-transitory media storing computer-executable instructions, of claim 12, wherein at least one port in the second subset of ports belongs to a second switch included in the third tier switch and associated with the second virtual plane.
14. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 13, wherein the first switch included in the third tier switch and the second switch included in the third tier switch are directly coupled to each other via a cable.
15. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 11, further comprising: For each of the plurality of switches included in the first tier of switches, a first Virtual Tunnel Endpoint (VTEP) associated with the first virtual plane and a second VTEP associated with the second virtual plane are created, the first VTEP having a first Autonomous System Number (ASN) and the second VTEP having a second ASN different from the first ASN.
16. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 11, wherein a subset of host machines included in the plurality of host machines are directly coupled to a first switch included in a first tier of switches, each host machine in the subset of host machines being associated with a different virtual plane.
17. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 16, wherein the number of virtual planes supported by the network fabric corresponds to a number of host machines included in the subset of host machines that are directly coupled to a first switch included in the first tier of switches.
18. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 16, wherein each host machine is associated with a unique Virtual Tunnel Endpoint (VTEP) created in a first switch included in the first tier of switches, and wherein the one or more GPUs of the host machine communicate with one or more other GPUs of other host machines in the network environment via the unique VTEP associated with the host machine.
19. A computing device, comprising: One or more processors, and A memory comprising instructions that, when executed by the one or more processors, cause the computing device to at least: In a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of switches, the plurality of switches including a plurality of ports, each host machine of the plurality of host machines including one or more GPUs, associating a first subset of ports of the plurality of ports with a first virtual plane, the first virtual plane identifying a first set of resources to be exclusively used for transmitting data packets from and to host machines associated with the first virtual plane; associating a second subset of the plurality of ports with a second virtual plane different from the first virtual plane; Associating a first host machine with a first virtual plane and a second host machine with a second virtual plane, and For a data packet originating from a first GPU on a first host machine and destined for a second GPU on a second host machine, the data packet is transferred from the first GPU on the first host machine to the second GPU on the second host machine using ports in the first port subset and the second port subset.
20. The computing device of claim 19, wherein the plurality of switches are arranged in a hierarchy comprising a first tier switch, a second tier switch, and a third tier switch, wherein the plurality of host machines are directly coupled to switches included in the first tier switch, and wherein the second tier switch communicatively couples the first tier switch to the third tier switch.

Description

Interconnecting global virtual planes Cross Reference to Related Applications The present application is and claims the benefit of U.S. provisional application No. 63/590,269 filed on day 13 of 10 in 2023 and U.S. provisional application No. 63/611,948 filed on day 19 of 12 in 2023, each of which are incorporated herein by reference in their entirety for all purposes. Technical Field The present disclosure relates to a network infrastructure for performing artificial intelligence or machine learning workloads, such as Graphics Processing Unit (GPU) workloads. Background Organizations are continually moving business applications and databases to the cloud to reduce the cost of purchasing, updating, and maintaining on-premise hardware and software. High performance computing applications always consume all available computing power to achieve a particular outcome or result. Such applications require dedicated network performance, fast storage, high computing power and large amounts of memory-these resources are not in supply in the virtualized infrastructure that constitutes today's commodity cloud. Cloud infrastructure service providers offer newer and faster Graphics Processing Units (GPUs) to address the requirements of these applications. GPU workloads are typically executed on one or more host machines. Typically, such workloads fail to reach the expected throughput level. One factor that contributes to this problem is the lack of stream entropy, e.g., equal cost multi-path (ECMP) stream entropy. In ECMP, multiple flows (e.g., from different host machines) may be hashed in a manner such that both flows are expected to traverse the same outgoing link/port of the switch. Furthermore, the fact that host machines (i.e., hosts) exchange traffic regardless of other hosts in their local network neighborhood exacerbates this problem. Other types of workloads are typically performed by selecting one or more host machines from an infrastructure in a random (i.e., arbitrary) manner. In other words, the workload is performed without consideration of locality information (e.g., physical location of the host machine). Thus, the throughput of these workloads is low. This situation typically leads to bandwidth contention problems, which are commonly referred to in the literature as congestion problems based on flow collisions. The embodiments discussed herein address these and other issues. Disclosure of Invention The present disclosure relates generally to a network infrastructure for executing Graphics Processing Unit (GPU) workloads. Various embodiments are described herein, including methods, systems, non-transitory computer-readable media storing programs, code, or instructions executable by one or more processors, and the like. These illustrative embodiments are not mentioned to limit or define the disclosure, but rather to provide examples to aid understanding of the disclosure. Additional embodiments are discussed in the detailed description section and further description is provided herein. One embodiment of the present disclosure is directed to a method comprising, in a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of switches, the plurality of host machines including a plurality of ports, each host machine of the plurality of host machines including one or more GPUs, associating a first subset of ports of the plurality of ports with a first virtual plane, the first virtual plane identifying a first set of resources to be dedicated for transmitting data packets from and to host machines associated with the first virtual plane, associating a second subset of ports of the plurality of ports with a second virtual plane different from the first virtual plane, associating the first host machine with the first virtual plane and the second host machine with the second virtual plane, the first host machine directly coupled to a first switch of the plurality of switches and the second host machine directly coupled to a second switch of the plurality of switches, the second virtual plane being different from the first, and for transmitting data packets from the first host machine to the first host machine and the second host machine to the first virtual plane, wherein the data packets are not transmitted from the first host machine to the first GPU and the second host machine at the first port, and the second GPU is not at the first subset of ports of the first host machine. An aspect of the present disclosure provides a computing device comprising one or more data processors, and a non-transitory computer-readable storage medium containing instructions that, when executed on the one or more data processors, cause the one or more data processors to perform part or all of one or more methods disclosed herein. Another aspect of the present disclosure provides one or more computer-readable non-transitory media storing computer-ex