Search

US-12621234-B2 - Distributed application call path performance analysis

US12621234B2US 12621234 B2US12621234 B2US 12621234B2US-12621234-B2

Abstract

In general, techniques are described for managing a distributed application based on call paths among the multiple services of the distributed application that traverse underlying network infrastructure. In an example, a method comprises determining, by a computing system, and for a distributed application implemented with a plurality of services, a call path from an entry endpoint service of the plurality of services to a terminating endpoint service of the plurality of services; determining, by the computing system, a corresponding network path for each pair of adjacent services from a plurality of pairs of services that communicate for the call path; and based on a performance indicator for a network device of the corresponding network path meeting a threshold, performing, by the computing system, one or more of: reconfiguring the network; or redeploying one of the plurality of services to a different compute node of the compute nodes.

Inventors

  • Tarun Banka
  • Rahul Gupta
  • Mithun Chakaravarrti Dharmaraj
  • Amandeep Chauhan
  • Thayumanavan Sridhar
  • Raj Yavatkar

Assignees

  • JUNIPER NETWORKS, INC.

Dates

Publication Date
20260505
Application Date
20230929

Claims (20)

  1. 1 . A computing system, comprising: memory; and processing circuitry in communication with the memory, and configured to: determine, for a distributed application implemented with a plurality of services executing on compute nodes interconnected by a network of network devices, a call path from an entry endpoint service of the plurality of services to a terminating endpoint service of the plurality of services; determine a corresponding network path traversed by corresponding calls for each pair of one or more pairs of adjacent services from a plurality of pairs of services that communicate for the call path; and for a pair of the one or more pairs of adjacent services, based on a performance indicator for a network device, of the network devices, of the corresponding network path traversed by the corresponding calls for the pair of the one or more pairs of adjacent services meeting a threshold, perform one or more of: reconfigure the network; or redeploy one of the plurality of services to a different compute node of the compute nodes.
  2. 2 . The computing system of claim 1 , wherein to determine the corresponding network path, the processing circuitry is configured to: determine, for the pair of the one or more pairs of adjacent services, a source IP address and a destination IP address, wherein the source IP address is the IP address of a caller service of the pair of the one or more pairs of adjacent services, and wherein the destination IP address is the IP address of a callee service of the pair of the one or more pairs of adjacent services; and correlate, based on the source IP address and the destination IP address, a flow to the corresponding network path.
  3. 3 . The computing system of claim 2 , wherein to correlate the flow to the corresponding network path the processing circuitry is configured to determine, based on flow data, one or more of the network devices that have processed flows that include the source IP address and the destination IP address.
  4. 4 . The computing system of claim 2 , wherein the source IP address is one of a physical IP address for a compute node that hosts the caller service or a virtual IP address for a workload for the caller service.
  5. 5 . The computing system of claim 1 , wherein the network devices comprise at least one top-of-rack switch and at least one chassis switch.
  6. 6 . The computing system of claim 1 , wherein to reconfigure the network, the processing circuitry is further configured to: identify a replacement service for either service of the pair of the one or more pairs of adjacent services, wherein the replacement service is another instance of a first service of the pair of the one or more pairs of adjacent services executing on a different compute node; and reconfigure the distributed application to execute using the replacement service.
  7. 7 . The computing system of claim 1 , wherein the call path is a first call path, wherein the entry endpoint is a first entry endpoint, wherein the terminating endpoint is a first terminating endpoint, and wherein the processing circuitry is further configured to: based on determining that no performance indicators for the network device meet the threshold, determine, for the distributed application implemented with the plurality of services, a second call path from a second entry endpoint service of the plurality of services to a second terminating endpoint service of the plurality of services; determine a second corresponding network path traversed by corresponding calls for a different pair of the one or more pairs of adjacent services from the plurality of pairs of services that communicate for the second call path; identify a second network device on one of the second corresponding network path; and based on a performance indicator for the second network device meeting the threshold, perform one or more of: reconfigure the network; or redeploy one of the plurality of services to a different compute node of the compute nodes.
  8. 8 . The computing device of claim 1 , wherein the performance indicator is one or more of: a packet loss rate, a transmission time, a resource utilization, or a latency.
  9. 9 . The computing device of claim 1 , wherein the call path is a call path of a plurality of call paths for the distributed application that has a highest end-to-end latency of the plurality of call paths.
  10. 10 . A method comprising: determining, by a computing system, for a distributed application implemented with a plurality of services executing on compute nodes interconnected by a network of network devices, a call path from an entry endpoint service of the plurality of services to a terminating endpoint service of the plurality of services; determining, by the computing system, a corresponding network path traversed by corresponding calls for each pair of one or more pairs of adjacent services from a plurality of pairs of services that communicate for the call path; and for a pair of the one or more pairs of adjacent services, based on a performance indicator for a network device of the network devices, of the corresponding network path traversed by the corresponding calls for the pair of the one or more pairs of adjacent services meeting a threshold, performing, by the computing system, one or more of: reconfiguring the network; or redeploying one of the plurality of services to a different compute node of the compute nodes.
  11. 11 . The method of claim 10 , wherein determining a corresponding network path further comprises: determining, for the pair of the one or more pairs of adjacent services, a source IP address and a destination IP address, wherein the source IP address is the IP address of a caller service of the pair of the one or more pairs of adjacent services, and wherein the destination IP address is the IP address of a callee service of the pair of the one or more pairs of adjacent services; and correlating, based on the source IP address and the destination IP address, a flow to the corresponding network path.
  12. 12 . The method of claim 11 , wherein correlating the flow to the corresponding network path further comprises determining, based on flow data, one or more of the network devices that have processed flows that include the source IP address and the destination IP address.
  13. 13 . The method of claim 12 , wherein the source IP address is one of a physical IP address for a compute node that hosts the caller service or a virtual IP address for a workload for the caller service.
  14. 14 . The method of claim 10 , wherein the network devices comprise at least one top-of-rack switch and at least one chassis switch.
  15. 15 . The method of claim 10 , wherein reconfiguring further comprises: identifying a replacement service for either service of the pair of the one or more pairs of adjacent services, wherein the replacement service is another instance of a first service of the pair of the one or more pairs of adjacent services executing on a different compute node; and reconfiguring the distributed application to execute using the replacement service.
  16. 16 . The method of claim 15 , wherein the call path is a first call path, wherein the entry endpoint is a first entry endpoint, wherein the terminating endpoint is a first terminating endpoint, and further comprising: based on determining that no performance indicators for the network device meet the threshold, determining, by the computing system and for the distributed application implemented with the plurality of services, a second call path from a second entry endpoint service of the plurality of services to a second terminating endpoint service of the plurality of services; determining, by the computing system, a second corresponding network path traversed by corresponding calls for a different pair of the one or more pairs of adjacent services from the plurality of pairs of services that communicate for the second call path; identifying, by the computing system, a second network device on one of the second corresponding network path; and based on a performance indicator for the second network device meeting a threshold, perform one or more of: reconfiguring the network; or redeploying one of the plurality of services to a different compute node of the compute nodes.
  17. 17 . The method of claim 10 , wherein the performance indicator is one or more of: a packet loss rate, a transmission time, a resource utilization, or a latency.
  18. 18 . The method of claim 10 , wherein the call path is a call path of a plurality of call paths for the distributed application that has a highest end-to-end latency of the plurality of call paths.
  19. 19 . Non-transitory computer-readable storage media comprising instructions that, when executed, cause one or more processors to: determine, for a distributed application implemented with a plurality of services executing on compute nodes interconnected by a network of network devices, a call path from an entry endpoint service of the plurality of services to a terminating endpoint service of the plurality of services; determine a corresponding network path traversed by corresponding calls for each pair of one or more pairs of adjacent services from a plurality of pairs of services that communicate for the call path; and for a pair of the one or more pairs of adjacent services, based on a performance indicator for a network device, of the network devices, of the corresponding network path traversed by the corresponding calls for the pair of the one or more pairs of adjacent services meeting a threshold, perform one or more of: reconfigure the network; or redeploy one of the plurality of services to a different compute node of the compute nodes.
  20. 20 . The non-transitory computer-readable storage media of claim 19 , wherein the instructions further cause the one or more processors to: determine, for the pair of the one or more pairs of adjacent services, a source IP address and a destination IP address, wherein the source IP address is the IP address of a caller service of the pair of the one or more pairs of adjacent services, and wherein the destination IP address is the IP address of a callee service of the pair of the one or more pairs of adjacent services; and correlate, based on the source IP address and the destination IP address, a flow to the corresponding network path.

Description

TECHNICAL FIELD The disclosure relates to computing systems and, more specifically, to managing distributed applications operating over a network. BACKGROUND Computer networks have become ubiquitous and the number of network applications, network-connected devices, and types of network-connected devices are rapidly expanding. Such devices now include computers, smart phones, Internet-of-Things (IoT) devices, cars, medical devices factory equipment, etc. An end-user network-connected device typically cannot directly access a public network such as the Internet. Instead, an end-user network device establishes a network connection with an access network, and the access network communicates with a core network that is connected to one or more packet data networks (PDNs) offering services. There are several different types of access networks currently in use. Examples include Radio Access Networks (RANs) that are access networks for 3rd Generation Partnership Project (3GPP) networks, trusted and untrusted non-3GPP networks such as Wi-Fi or WiMAX networks, and fixed/wireline networks such as Digital Subscriber Line (DSL), Passive Optical Network (PON), and cable networks. The core network may be that of a mobile service provider network, such as a 3G, 4G/LTE, or 5G network. In a typical cloud data center environment, there is a large collection of interconnected servers that provide computing and/or storage capacity to run various applications. For example, a data center may comprise a facility that hosts applications and services for subscribers, i.e., customers of data center. The data center may, for example, host all of the infrastructure equipment, such as networking and storage systems, redundant power supplies, and environmental controls. In a typical data center, clusters of storage systems and application servers are interconnected via high-speed switch fabric provided by one or more tiers of physical network switches and routers. More sophisticated data centers provide infrastructure spread throughout the world with subscriber support equipment located in various physical hosting facilities. Virtualized data centers are becoming a core foundation of the modern information technology (IT) infrastructure. In particular, modern data centers have extensively utilized virtualized environments in which virtual hosts, also referred to herein as virtual execution elements or workloads, such virtual machines or containers, are deployed and executed on an underlying compute platform of physical computing devices. Workloads may also include bare metal processes. Virtualization within a data center can provide several advantages. One advantage is that virtualization can provide significant improvements to efficiency. As the underlying physical computing devices (i.e., servers) have become increasingly powerful with the advent of multicore microprocessor architectures with a large number of cores per physical central processing unit (CPU), virtualization becomes easier and more efficient. A second advantage is that virtualization provides significant control over the computing infrastructure. As physical computing resources become fungible resources, such as in a cloud-based computing environment, provisioning and management of the computing infrastructure becomes easier. Thus, enterprise information technology (IT) staff often prefer virtualized compute clusters in data centers for their management advantages in addition to the efficiency and increased return on investment (ROI) that virtualization provides. Containerization is a virtualization scheme based on operation system-level virtualization. Containers are light-weight and portable workloads for applications that are isolated from one another and from the host. Because containers are not tightly coupled to the host hardware computing environment, an application can be tied to a container image and executed as a single light-weight package on any host or virtual host that supports the underlying container architecture. As such, containers address the problem of how to make software work in different computing environments. Containers offer the promise of running consistently from one computing environment to another, virtual or physical. With containers' inherently lightweight nature, a single host can often support many more container instances than traditional virtual machines (VMs). These systems are characterized by being dynamic and ephemeral, as hosted services can be quickly scaled up or adapted to new requirements. Often short-lived, containers can be created and moved more efficiently than VMs, and they can also be managed as groups of logically-related elements (sometimes referred to as “pods” for some orchestration platforms, e.g., Kubernetes). These container characteristics impact the requirements for container networking solutions: the network should be agile and scalable. VMs, containers, and bare metal servers may need to coexist in the same c