Employees of a NetCraftsmen client reported highly variable performance of applications, particularly file transfers between two global locations. It was intermittent, with performance being acceptable at some times and highly dissatisfying at other times. Productivity was impacted across all applications that operated between the two sites. It was very frustrating for the staff. Sometimes operations would complete quickly and other times those same operations would seem to take forever.
A detailed analysis of the reports concluded that it was only between the two specific sites and that it affected all applications and users. The two sites were connected over an internet VPN, which suggested that the problem was likely to be somewhere in the internet. Network diagnostic tools like ping and traceroute could not identify the cause. We needed something that provided more detailed diagnostics and analysis.
NetCraftsmen decided to implement a WAN diagnostic system that would run continuously to capture evidence of the problem. It had to analyze network performance between servers in data centers at the two sites as well as to the enterprise’s staff endpoints (laptops and tablets). The diagnostics had to capture performance data from the internet to the servers in each data center individually, so that we could identify whether the problem was affecting only one data center or both. We also needed to gather data on the performance between staff endpoints and the data centers, so that we could identify problems with those network paths.
The diagnosis needed to collect data on paths within and between data centers, paths from the internet to each data center, and paths from staff endpoints to data centers. In particular, it needed visibility on a hop-by-hop basis. Then the data had to be correlated between the monitored paths to identify the performance problem’s location.
NetCraftsmen worked with a number of vendors with product capable of decoding modern application delivery chains and chose to build a DX offering using Catchpoint instrumentation.
Catchpoint has lightweight agents that are loaded on the data center servers and staff endpoints. Catchpoint also has internet-located data collection nodes that could provide visibility into internet performance to each data center as well as with staff endpoints. A cloud-based management system makes it easy to configure all the tests and to correlate the results.
We started the analysis on a Monday morning and let it run for 48 hours. By Wednesday morning, we had identified significant packet loss every 10-15 minutes on one link within an internet exchange carrier site. The analysis found that Border Gateway Protocol (BGP) network path information was changing periodically and there was high packet loss whenever the path transitioned with one ISP. The client’s direct ISP endpoint paths were fine. The point of that packet loss was upstream from the client’s ISP and on further investigation was tied to the BGP interactions impacting the clients address space between upstream providers (with which our client did not have any contracts or method to enforce an SLA).
In the screen captures below we are using an example from Catchpoint to ensure privacy for the client. Figure 1 illustrates the view from a number of backbone nodes into a sample datacenter to illustrate the process.
Once we had evidence of the packet drops, we used the tool to take a BGP Autonomous System (AS) view and identified the ISP that was flapping on the client’s IP routes that was causing the issue.
Again, for reasons of confidentiality the data shown is just for illustration.
The evidence was clear. Our analysis provided enough information for the Clients ISP to identify the problem. The ISP took about a week to achieve full resolution with its upstream peer.
Soon after our client reported that the problem was resolved.
NetCraftsmen is adept at solving challenging and complex network problems Bring us your lingering and challenging network problems. We’ll help you resolve them so that you can Rest Assured®.