CN-122029782-A - Interconnecting global virtual planes

CN122029782ACN 122029782 ACN122029782 ACN 122029782ACN-122029782-A

Abstract

The network environment includes a plurality of host machines coupled to one another via a network fabric including a plurality of switches, which in turn include a plurality of ports. Each host machine includes one or more GPUs. The first subset of ports is associated with a first virtual plane, wherein the first virtual plane identifies a first set of resources to be used for transmitting data packets from/to a host machine associated with the first virtual plane. The second subset of ports is associated with a second virtual plane that is different from the first virtual plane. The first host machine and the second host machine are associated with a first virtual plane and a second virtual plane, respectively. Data packets are transferred from a first host machine to a second host machine using ports in the first subset of ports and the second subset of ports.

Inventors

BRAR JAGWINDER SINGH
DAVID D. BECKER
SHETTY NITYANAND
KUNDU PARTHASARATHI

Assignees

甲骨文国际公司

Dates

Publication Date: 20260512
Application Date: 20241010
Priority Date: 20231013

Claims (20)

1.A method, comprising: In a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of switches, the plurality of switches including a plurality of ports, each host machine of the plurality of host machines including one or more GPUs, associating a first subset of ports of the plurality of ports with a first virtual plane, the first virtual plane identifying a first set of resources to be exclusively used for transmitting data packets from and to host machines associated with the first virtual plane; associating a second subset of the plurality of ports with a second virtual plane different from the first virtual plane; Associating a first host machine with a first virtual plane and a second host machine with a second virtual plane, the first host machine directly coupled to a first switch of the plurality of switches and the second host machine directly coupled to a second switch of the plurality of switches, the second switch different from the first switch, and For a first packet originating from a first GPU on a first host machine and destined for a second GPU on a second host machine, the packet is transferred from the first GPU on the first host machine to the second GPU on the second host machine using a first subset of ports and ports in the second subset of ports, wherein the first packet is processed by the first switch or the second switch to be transmitted on the second virtual plane instead of the first virtual plane.
2. The method of claim 1, wherein the plurality of switches are arranged in a hierarchy comprising a first tier switch, a second tier switch, and a third tier switch, and wherein the second tier switch communicatively couples the first tier switch to the third tier switch.
3. The method of claim 2, further comprising: The first data packet is transmitted by the first host machine to a third host machine associated with the first virtual plane, the third host machine and the second host machine are directly coupled to the second switch, the first switch and the second switch are included in the first tier of switches, and wherein the first data packet is processed by the second switch for transmission to the second host machine on the second virtual plane.
4. The method of claim 2, further comprising: Processing, by the first switch, the first data packet for transmission on the second virtual plane by a third host machine associated with the second virtual plane, and The first data packet is transmitted by the third host machine to the second host machine, wherein the third host machine and the first host machine are directly coupled to the first switch, the first switch and the second switch being included in the first tier switch.
5. The method of claim 1, wherein the first data packet is processed by the first switch or the second switch by (i) decapsulating the first data packet to extract information corresponding to a first header associated with the first virtual plane, and (ii) encapsulating the first data packet with information corresponding to a second header associated with the second virtual plane.
6. The method of claim 1, wherein each of the first switch and the second switch is a VxLAN router.
7. A method as claimed in claim 3, wherein each switch comprised in the first layer of switches is configured to transform data packets transmitted on the first virtual plane to data packets transmitted on the second virtual plane and vice versa.
8. The method of claim 2, wherein a subset of host machines included in the plurality of host machines are directly coupled to a first switch included in a first tier of switches, each host machine in the subset of host machines being associated with a different virtual plane.
9. The method of claim 8, wherein the number of virtual planes supported by the network fabric corresponds to the number of host machines included in the subset of host machines that are directly coupled to a first switch included in the first tier of switches.
10. One or more computer-readable non-transitory media storing computer-executable instructions that, when executed by one or more processors, cause: In a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of switches, the plurality of switches including a plurality of ports, each host machine of the plurality of host machines including one or more GPUs, associating a first subset of ports of the plurality of ports with a first virtual plane, the first virtual plane identifying a first set of resources to be exclusively used for transmitting data packets from and to host machines associated with the first virtual plane; associating a second subset of the plurality of ports with a second virtual plane different from the first virtual plane; Associating a first host machine with a first virtual plane and a second host machine with a second virtual plane, the first host machine directly coupled to a first switch of the plurality of switches and the second host machine directly coupled to a second switch of the plurality of switches, the second switch different from the first switch, and For a first packet originating from a first GPU on a first host machine and destined for a second GPU on a second host machine, the packet is transferred from the first GPU on the first host machine to the second GPU on the second host machine using a first subset of ports and ports in the second subset of ports, wherein the first packet is processed by the first switch or the second switch to be transmitted on the second virtual plane instead of the first virtual plane.
11. The one or more computer-readable non-transitory media storing computer-executable instructions, of claim 10, wherein the plurality of switches are arranged in a hierarchical structure comprising a first tier switch, a second tier switch, and a third tier switch, and wherein the second tier switch communicatively couples the first tier switch to the third tier switch.
12. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 11, further comprising: The first data packet is transmitted by the first host machine to a third host machine associated with the first virtual plane, the third host machine and the second host machine are directly coupled to the second switch, the first switch and the second switch are included in the first tier of switches, and wherein the first data packet is processed by the second switch for transmission to the second host machine on the second virtual plane.
13. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 11, further comprising: Processing, by the first switch, the first data packet for transmission on the second virtual plane by a third host machine associated with the second virtual plane, and The first data packet is transmitted by the third host machine to the second host machine, wherein the third host machine and the first host machine are directly coupled to the first switch, the first switch and the second switch being included in the first tier switch.
14. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 10, wherein the first data packet is processed by the first switch or the second switch by (i) decapsulating the first data packet to extract information corresponding to a first header associated with the first virtual plane, and (ii) encapsulating the first data packet with information corresponding to a second header associated with the second virtual plane.
15. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 10, wherein each of the first switch and the second switch is a VxLAN router.
16. The one or more computer-readable non-transitory media storing computer-executable instructions, of claim 12, wherein each switch included in the first tier of switches is configured to transform data packets transmitted on the first virtual plane to be transmitted on the second virtual plane, and vice versa.
17. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 11, wherein a subset of host machines included in the plurality of host machines are directly coupled to a first switch included in a first tier of switches, each host machine in the subset of host machines being associated with a different virtual plane.
18. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 17, wherein the number of virtual planes supported by the network fabric corresponds to a number of host machines included in the subset of host machines that are directly coupled to a first switch included in the first tier of switches.
19. A computing device, comprising: One or more processors, and A memory comprising instructions that, when executed by the one or more processors, cause the computing device to at least: In a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of switches, the plurality of switches including a plurality of ports, each host machine of the plurality of host machines including one or more GPUs, associating a first subset of ports of the plurality of ports with a first virtual plane, the first virtual plane identifying a first set of resources to be exclusively used for transmitting data packets from and to host machines associated with the first virtual plane; associating a second subset of the plurality of ports with a second virtual plane different from the first virtual plane; Associating a first host machine with a first virtual plane and a second host machine with a second virtual plane, the first host machine directly coupled to a first switch of the plurality of switches and the second host machine directly coupled to a second switch of the plurality of switches, the second switch different from the first switch, and For a first packet originating from a first GPU on a first host machine and destined for a second GPU on a second host machine, the packet is transferred from the first GPU on the first host machine to the second GPU on the second host machine using a first subset of ports and ports in the second subset of ports, wherein the first packet is processed by the first switch or the second switch to be transmitted on the second virtual plane instead of the first virtual plane.
20. The computing device of claim 19, wherein the plurality of switches are arranged in a hierarchy comprising a first tier switch, a second tier switch, and a third tier switch, and wherein the second tier switch communicatively couples the first tier switch to the third tier switch.

Description

Interconnecting global virtual planes Cross Reference to Related Applications The present application is and claims the benefit of U.S. provisional application No. 63/590,269 filed on day 13 of 10 in 2023 and U.S. provisional application No. 63/611,948 filed on day 19 of 12 in 2023, each of which are incorporated herein by reference in their entirety for all purposes. Technical Field The present disclosure relates to a network infrastructure for performing artificial intelligence or machine learning workloads, such as Graphics Processing Unit (GPU) workloads. Background Organizations are continually moving business applications and databases to the cloud to reduce the cost of purchasing, updating, and maintaining on-premise hardware and software. High performance computing applications always consume all available computing power to achieve a particular outcome or result. Such applications require dedicated network performance, fast storage, high computing power and large amounts of memory-these resources are not in supply in the virtualized infrastructure that constitutes today's commodity cloud. Cloud infrastructure service providers offer newer and faster Graphics Processing Units (GPUs) to address the requirements of these applications. GPU workloads are typically executed on one or more host machines. Typically, such workloads fail to reach the expected throughput level. One factor that contributes to this problem is the lack of stream entropy, e.g., equal cost multi-path (ECMP) stream entropy. In ECMP, multiple flows (e.g., from different host machines) may be hashed in a manner such that both flows are expected to traverse the same outgoing link/port of the switch. Furthermore, the fact that host machines (i.e., hosts) exchange traffic regardless of other hosts in their local network neighborhood exacerbates this problem. Other types of workloads are typically performed by selecting one or more host machines from an infrastructure in a random (i.e., arbitrary) manner. In other words, the workload is performed without consideration of locality information (e.g., physical location of the host machine). Thus, the throughput of these workloads is low. This situation typically leads to bandwidth contention problems, which are commonly referred to in the literature as congestion problems based on flow collisions. The embodiments discussed herein address these and other issues. Disclosure of Invention The present disclosure relates generally to a network infrastructure for executing Graphics Processing Unit (GPU) workloads. Various embodiments are described herein, including methods, systems, non-transitory computer-readable media storing programs, code, or instructions executable by one or more processors, and the like. These illustrative embodiments are not mentioned to limit or define the disclosure, but rather to provide examples to aid understanding of the disclosure. Additional embodiments are discussed in the detailed description section and further description is provided herein. One embodiment of the present disclosure is directed to a method comprising, in a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of switches, the plurality of host machines including a plurality of ports, each host machine of the plurality of host machines including one or more GPUs, associating a first subset of ports of the plurality of ports with a first virtual plane, the first virtual plane identifying a first set of resources to be dedicated for transmitting data packets from and to host machines associated with the first virtual plane, associating a second subset of ports of the plurality of ports with a second virtual plane different from the first virtual plane, associating the first host machine with the first virtual plane and the second host machine with the second virtual plane, the first host machine directly coupled to a first switch of the plurality of switches and the second host machine directly coupled to a second switch of the plurality of switches, the second virtual plane being different from the first, and for transmitting data packets from the first host machine to the first host machine and the second host machine to the first virtual plane, wherein the data packets are not transmitted from the first host machine to the first GPU and the second host machine at the first port, and the second GPU is not at the first subset of ports of the first host machine. An aspect of the present disclosure provides a computing device comprising one or more data processors, and a non-transitory computer-readable storage medium containing instructions that, when executed on the one or more data processors, cause the one or more data processors to perform part or all of one or more methods disclosed herein. Another aspect of the present disclosure provides one or more computer-readable non-transitory media storing computer-ex