CN-122003838-A - Technique for handling overlay packages

CN122003838ACN 122003838 ACN122003838 ACN 122003838ACN-122003838-A

Abstract

The network environment includes a plurality of host machines communicatively coupled to one another via a network fabric including a plurality of switches, which in turn include a plurality of ports. Each host machine includes one or more GPUs that execute a customer workload. Various approaches are described herein to address the problem of handling network coverage encapsulation without adversely affecting the performance of the workload executing on the GPU cluster.

Inventors

BRAR JAGWINDER SINGH
DAVID D. BECKER

Assignees

甲骨文国际公司

Dates

Publication Date: 20260508
Application Date: 20241010
Priority Date: 20231013

Claims (20)

1. A method, comprising: in a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of switches, receiving, by a source host machine, a first data packet from a first switch included in the plurality of switches, the first data packet indicating congestion occurred in the network fabric; configuring, by the source host machine, a network interface card associated with a GPU included in the source host machine in response to receiving the first data packet, the configuring comprising reducing a transmission rate of the GPU by a first predetermined amount; Responsive to receiving, by the source host machine, a second data packet from the first switch, wherein the second data packet indicates congestion in the network fabric, reconfiguring, by the source host machine, a network interface card associated with the GPU, the reconfiguring including further reducing a transmission rate of the GPU by a second predetermined amount, and In response to not receiving the second data packet, modifying, by the source host machine, a transmission rate of the GPU to correspond to an initial transmission rate of the GPU prior to receiving the first data packet.
2. The method of claim 1, wherein the plurality of switches are arranged in a hierarchy comprising a first tier switch, a second tier switch, and a third tier switch, wherein the plurality of host machines are directly coupled to switches included in the first tier switch, and wherein the second tier switch communicatively couples the first tier switch to the third tier switch.
3. The method of claim 1, further comprising: Responsive to receiving, by the source host machine, a third data packet from the first switch after the second data packet, the third data packet indicating an increase in congestion in the network fabric, reconfiguring, by the source host machine, a network interface card associated with the GPU by further reducing a transmission rate of the GPU by a third predetermined amount.
4. A method as in claim 3, further comprising: Responsive to receiving, by the source host machine, a fourth data packet from the first switch after the third data packet, the fourth data packet indicating a further increase in congestion in the network fabric, reconfiguring, by the source host machine, a network interface card associated with the GPU by further reducing a transmission rate of the GPU by a fourth predetermined amount.
5. The method of claim 1, wherein upon receiving the first data packet, the transmission rate is reduced by a first predetermined amount, the first predetermined amount being 1%.
6. The method of claim 1, wherein upon receiving the second data packet, the transmission rate is further reduced by a second predetermined amount, the second predetermined amount being 2%.
7. The method of claim 4, wherein the transmission rate is further reduced by a third predetermined amount upon receipt of a third data packet, the third predetermined amount being 5%, and the transmission rate is reduced by 10% upon receipt of a fourth data packet.
8. The method of claim 2, wherein the first switch is included in a first tier switch and the source host machine is directly coupled to the first switch.
9. The method of claim 1, further comprising: A set of parameters associated with the first switch is configured, the configuring comprising assigning a first threshold parameter associated with a number of data packets contained in a queue of the first switch to a first value, assigning a second threshold parameter associated with the queue of the first switch to a second value, and assigning a third parameter corresponding to a probability of tagging to a third value, and wherein the second value is greater than the first value.
10. The method of claim 9, wherein the first value is set to 63000, the second value is set to 80000, and the third value is set to 20%.
11. The method of claim 10, wherein the third value increases linearly from 0% to 20% for a number of packets contained in the queue of the first switch greater than the first value and less than the second value.
12. The method of claim 4, wherein each of the first data packet, the second data packet, the third data packet, and the fourth data packet is a congestion notification data packet transmitted by the destination host machine to the source host machine via the first switch.
13. One or more computer-readable non-transitory media storing computer-executable instructions that, when executed by one or more processors, cause: in a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of switches, receiving, by a source host machine, a first data packet from a first switch included in the plurality of switches, the first data packet indicating congestion occurred in the network fabric; configuring, by the source host machine, a network interface card associated with a GPU included in the source host machine in response to receiving the first data packet, the configuring comprising reducing a transmission rate of the GPU by a first predetermined amount; Responsive to receiving, by the source host machine, a second data packet from the first switch, wherein the second data packet indicates congestion in the network fabric, reconfiguring, by the source host machine, a network interface card associated with the GPU, the reconfiguring including further reducing a transmission rate of the GPU by a second predetermined amount, and In response to not receiving the second data packet, modifying, by the source host machine, a transmission rate of the GPU to correspond to an initial transmission rate of the GPU prior to receiving the first data packet.
14. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 13, wherein the plurality of switches are arranged in a hierarchy comprising a first tier switch, a second tier switch, and a third tier switch, wherein the plurality of host machines are directly coupled to switches included in the first tier switch, and wherein the second tier switch communicatively couples the first tier switch to the third tier switch.
15. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 13, further comprising: Responsive to receiving, by the source host machine, a third data packet from the first switch after the second data packet, the third data packet indicating an increase in congestion in the network fabric, reconfiguring, by the source host machine, a network interface card associated with the GPU by further reducing a transmission rate of the GPU by a third predetermined amount.
16. The one or more computer-readable non-transitory media storing computer-executable instructions of claim 15, further comprising: Responsive to receiving, by the source host machine, a fourth data packet from the first switch after the third data packet, the fourth data packet indicating a further increase in congestion in the network fabric, reconfiguring, by the source host machine, a network interface card associated with the GPU by further reducing a transmission rate of the GPU by a fourth predetermined amount.
17. The one or more computer-readable non-transitory media storing computer-executable instructions, of claim 13, wherein upon receiving the first data packet, the transmission rate is reduced by a first predetermined amount, the first predetermined amount being 1%.
18. The one or more computer-readable non-transitory media storing computer-executable instructions, of claim 13, wherein upon receiving the second data packet, the transmission rate is further reduced by a second predetermined amount, the second predetermined amount being 2%.
19. A computing device, comprising: one or more processors, and A memory comprising instructions that, when executed by the one or more processors, cause the computing device to at least: in a network environment including a plurality of host machines communicatively coupled to each other via a network fabric including a plurality of switches, receiving, by a source host machine, a first data packet from a first switch included in the plurality of switches, the first data packet indicating congestion occurred in the network fabric; configuring, by the source host machine, a network interface card associated with a GPU included in the source host machine in response to receiving the first data packet, the configuring comprising reducing a transmission rate of the GPU by a first predetermined amount; Responsive to receiving, by the source host machine, a second data packet from the first switch, wherein the second data packet indicates congestion in the network fabric, reconfiguring, by the source host machine, a network interface card associated with the GPU by further reducing a transmission rate of the GPU by a second predetermined amount, and In response to not receiving the second data packet, modifying, by the source host machine, a transmission rate of the GPU to correspond to an initial transmission rate of the GPU prior to receiving the first data packet.
20. The computing device of claim 19, wherein the plurality of switches are arranged in a hierarchy comprising a first tier switch, a second tier switch, and a third tier switch, wherein the plurality of host machines are directly coupled to switches included in the first tier switch, and wherein the second tier switch communicatively couples the first tier switch to the third tier switch.

Description

Technique for handling overlay packages Cross Reference to Related Applications The present application is and claims the benefit of U.S. provisional application No. 63/590,269 filed on day 13 of 10 in 2023 and U.S. provisional application No. 63/611,948 filed on day 19 of 12 in 2023, each of which are incorporated herein by reference in their entirety for all purposes. Technical Field The present disclosure relates to a network infrastructure for performing artificial intelligence or machine learning workloads, such as Graphics Processing Unit (GPU) workloads. Background Organizations are continually moving business applications and databases to the cloud to reduce the cost of purchasing, updating, and maintaining on-premise hardware and software. High performance computing applications always consume all available computing power to achieve a particular outcome or result. Such applications require dedicated network performance, fast storage, high computing power and large amounts of memory-these resources are not in supply in the virtualized infrastructure that constitutes today's commodity cloud. Cloud infrastructure service providers offer newer and faster Graphics Processing Units (GPUs) to address the requirements of these applications. GPU workloads are typically executed on one or more host machines. Since the architecture of the network architecture including the GPU supports multiple customers, i.e., they are multi-tenant, it is desirable to obtain strong traffic isolation properties between/across multiple customers. This is typically done via encapsulation, where metadata is added to the data packet to uniquely identify each customer's traffic. But encapsulation has an adverse effect on throughput. In particular, the encapsulation has little (typically less than 1%) impact on the throughput of customer traffic. Ultra-high performance RDMA (remote direct memory access) services, which are extremely desirable for performance, may be negatively impacted by such packaging. In addition, RDMA workloads use congestion control protocols (such as DC-QCN) to detect network congestion and respond to the congestion by aggressively reducing throughput. Since this small amount of packaging overhead reduces throughput very slightly, RDMA services erroneously treat it as an indication of congestion in the network and cut down throughput drastically (e.g., about 50%). This means that a very small amount of packaging overhead can significantly reduce the throughput of RDMA services. The embodiments discussed herein address these and other issues. Disclosure of Invention The present disclosure relates generally to a network infrastructure for executing Graphics Processing Unit (GPU) workloads. Various embodiments are described herein, including methods, systems, non-transitory computer-readable media storing programs, code, or instructions executable by one or more processors, and the like. These illustrative embodiments are not mentioned to limit or define the disclosure, but rather to provide examples to aid understanding of the disclosure. Additional embodiments are discussed in the detailed description section and further description is provided herein. One embodiment of the present disclosure is directed to a method that includes, in a network architecture including a plurality of host machines communicatively coupled to one another via a network architecture including a plurality of switches, receiving, by a source host machine, a first data packet from a first switch included in the plurality of switches, the first data packet indicating congestion in the network architecture, configuring, by the source host machine, a network interface card associated with a GPU included in the source host machine in response to receiving the first data packet, the configuring including reducing a transmission rate of the GPU by a first predetermined amount, reconfiguring, by the source host machine, the network interface card associated with the GPU in response to receiving a second data packet from the first switch, the second data packet indicating congestion in the network architecture, the reconfiguring, by the source host machine, the transmission rate of the GPU further reduced by a second predetermined amount, and modifying, by the source host machine, in response to not receiving the second data packet, the transmission rate of the GPU to correspond to an initial transmission rate of the GPU prior to receiving the first data packet. An aspect of the present disclosure provides a computing device comprising one or more data processors, and a non-transitory computer-readable storage medium containing instructions that, when executed on the one or more data processors, cause the one or more data processors to perform part or all of one or more methods disclosed herein. Another aspect of the present disclosure provides one or more computer-readable non-transitory media storing computer-executable instructions that, wh