I’ve done a couple of application slowness (brownout) troubleshooting sessions recently.
This blog is my attempt to condense some observations from both engagements, to share lessons learned. “Condense” might not be the right word, seeing how long this blog got!
Troubleshooting with some process awareness can help! I have a troubleshooting process. Sometimes I do all the aspects, sometimes I do the short version. I think it helps!
As I was drafting this blog, ipspace.net posted a very comprehensive blog on the same topic. Recommended! Darn, Ivan beat me to print again!
I usually start with a diagram. I’ve learned that the troubleshooting team and I may make unwarranted assumptions, miss aspects of the topology, etc. I’ve learned to ask “and what’s not in the diagram” — I want to know about every device and link, not the simplified or abstraction that’s usually what got diagrammed. I’ve also seen situations where the network diagram shows only network devices, the security diagram shows only security devices, and I need some help figuring out the One Diagram to Rule Them All.
The diagram should cover where the users who are affected are, where the application in question running, and where the key services are, all in relation to the network. If the problem is an app that has gone slow for everyone, then I want to know about services and internal flows in the application and include all of them in the diagram. The key thing here is to make the diagram completely cover all network traffic. So, for instance, “write to disk” might turn out to involve network when the disk being written to is an NFS file share. Don’t assume “that’s just storage”!
Gathering application flow information, which is almost never documented, can take considerable time and multiple iterations, so I usually hold off doing it in detail unless the problem is clearly the application and not the network. That is, I try to check out and eliminate what I / we can do easily first (low hanging fruit), while setting the slower info gathering in motion.
Diagramming makes sure all those involved (especially me, the visiting consultant) have a clear idea of the relevant parts of the network. It saves locking into any assumptions prematurely. It also helps later when you’re stumped and need to revisit where you might have missed something.
List Possible Causes
After building the diagram (or first draft), I go through it and list the things that can be a cause in a broad sense: user PC, user network connection, access to distribution switch uplink, etc. They go into rows in a spreadsheet. This gives me a framework to summarize scoping, test results, and other information. It makes sure I / the team don’t focus on any one thing too soon. It helps the team divide up checking various items.
Document It as You Go
I’ve repeated work too many times, going back to look more closely at something, or going over it with someone else. That costs time. So, I document raw data as I go: good notes, and save screen or CLI output, etc. That takes time but is very useful if you later want to re-check what you did or what you saw.
I usually put the info and captures into a folder. When they’re not too verbose, or screen captures, I use a Word document with section headers. Doing so with the Word Navigator pane helps you pull up the data quickly when needed.
Note: save screen caps separately in a file folder, Word reduces image resolution. Painful to go back later, try to zoom in, and realize the resolution isn’t there anymore.
This also helps spot things I (or often, the local network staff) don’t know or have assumed about the application in question. Depending on how important the gaps seem, some can need immediate resolution, others can be postponed for resolution only if they start seeming more important.
Most of us do this implicitly anyway, but it helps to do it consciously. Scope: what’s affected, what’s not affected.
In a broader sense, it’s always useful to think about what you know, what you can eliminate as a possible cause, or where you should focus.
I like documenting this in the possible “causes” spreadsheet. I usually put a column in for “how do I know this” because sometimes you find that you don’t really know something — or communicated information might be vague and inconclusive. That’s why when someone presents me with a conclusion, I tend to ask “and how do you know that?” or “why do you think that?”
Sometimes I add a column to the spreadsheet for priority: 5 = top, 1 = low priority, 0 = clean / not a problem. Excel automated color coding (5 = red, 0 = green) can help, although inserting / deleting spreadsheet rows messes that up (hint to Microsoft: poorly coded feature!).
Problem: doctors’ offices going via their business HQ, thence to hospital-based EPIC. Slowness.
Scoping: We knew that most sites were not complaining, but two had users that were experiencing slowness. That told us a couple of things up front, maybe not definitively, but well enough for first-cut elimination of some possible causes.
Verbal diagram: All sites were connected to a WAN, which connected back to HQ. Most sites’ Internet was also via HQ. So, application and Internet traffic were competing for WAN bandwidth.
HQ had a separate point-to-point connection back to the EPIC provider.
The user-based scoping information eliminated HQ, the link to the EPIC provider, and the EPIC provider’s network as likely causes. At least, pointed that way, I’d consider this to be about 70% proven, given the evidence was purely anecdotal.
Possible causes: user workstations, user site LAN connection, or user site WAN connection.
Further data from site staff: after review, the users in question had rather old computer hardware.
I’ll note that one problem with problem reports from users is that there is usually a good bit of delay before it gets to the helpdesk and percolates to you. It can also be vague, e.g. as to when the problem started occurring, and / or stopped.
Even in this particular case study, there’s the question whether anyone else was trying to use the app at the same time a couple of people were experiencing slowness. It is all subjective evidence. That can make it hard to correlate with link utilization spikes, etc.
Lesson Learned: Train users (gently) to note down time of onset and time when things improved (if they improved), and report those. That can help you see if their problems matched up with other data. (Think about journalism’s “5 W’s and How”: “Who, what, when, where, how, and why”).
Getting Hard Evidence
The EPIC provider staff had done something clever: EPIC provided centralized printing, meaning outbound traffic to printers at the “customer” site was allowed through the firewalls in the path. So, the staff set up smokeping to poll two printers at each customer site. In advance. Visibility for the win!
Where that trick isn’t feasible, tools like Appneta, Netbeez, or ThousandEyes can be useful. If you deploy them at each site, you can monitor things like ping response, DNS response time, or web application response time. Useful for site to Internet, SaaS, cloud, and internal apps. Having hard objective data with accurate timestamps about “user experience” lets you compare which sites were having problems at the same time, or whether the problems were independent.
I’ll also note internal DNS is key to many things today, so it is a good idea to monitor its responsiveness.
A DNS Gotcha
DNS is slow when it doesn’t get a reply, due to lost packets or slow recursive lookup.
This can affect app logins or even database authentication logging, masquerading as DB slowness.
I’ve seen slow reverse DNS lookup because central DNS was not authoritative for some of the private address blocks in use at a site, causing recursion to the Internet. That in turn caused slow logins to a key application (and copious logged complaints from the application).
Most of the above monitoring tools won’t catch that because you have to specify the address or name they resolve.
Hint to tool vendors: perhaps allow for not only fixed DNS name resolution, but reverse resolution of random IP’s in a block.
Lesson Learned Previously: Make sure your site DNS is authoritative for IP lookup for all private or public address blocks in use. E.g. all of 10.0.0.0/8 rather than just the subnets MS AD knows about. Ditto for other private address blocks: make sure reverse lookups stay local (and fail quickly).
Responsibility to be authoritative about reverse lookups can fall through the cracks when the network or another team manages the site and datacenter DNS, and the Microsoft / server team manages user DNS.
I’ll briefly hit one of my favorite rants, ahem, themes. I hope you’re already using a network management tool that captures and graphs SNMP stats on all active interfaces, preferably with 5-minute or finer time granularity. You can then look along the traffic path for link problems. This is where having user port stats can help detect if the user’s network connection is the problem.
User slowness can be caused by link congestion, errors, or discards. As I’ve noted before, anything over 0.001% errors or discards should be fixed, as it can slow things down.
If your NPM platform won’t or can’t poll everything frequently, or won’t threshold below 1%, consider getting a better one. While money may be tight, your / staff’s time may be even a more scarce resource.
Bonus points to Network Management products that take as input the endpoints, figure out the path(s) in each direction, taking ECMP into account, and then show you problems along the paths.
Back to Our Story
In the EPIC story, the WAN MetroEthernet data Comcast provided was rather summarized, only viewable by last day, week, or month. Pretty useless.
The bars shown were (apparently) averages over hours or days. Either that, or the readings were steady for long periods of time. I’ve noticed over the years that most network management products will graph data, but don’t tell you things you need to know to properly interpret the data. Is the data graphing the actual polled data, or is it being lumped into bigger buckets and averaged?
When you average over hours or days, peaks of traffic get averaged with zeroes. In this case, I suspect the data was for business hours only, but with averaging.
All that vagueness is why I don’t like having to guess what the graph is actually plotting.
The key point here is that if you’re looking for congestion, and all you’re seeing is traffic at about 50% of max that might be an average over multiple 5-minute periods or even hours, chances are that for small time intervals, the link could be maxed out. Or not — there’s no way to tell unless you can zoom in.
To wrap up, the problematic sites did seem a bit more heavily utilized, and smokeping did show more ping time variability, suggesting congestion.
Wrapping Up the Case Study
Our primary recommendation was to get the problem workstations upgraded, and as new slowness reports come in, capture the workstation model in use.
A second tentative recommendation (due to poor supporting data) was to consider adding more bandwidth to the problem sites. At the very least, monitor the links internally using 1-minute or 5-minute polling, so that if problems remained after upgrading old workstations, or as the user load increased as anticipated, there would be better data (cost justification) for upgrading the WAN links.
The ultimate answer is of course to monitor every workstation, or selected workstations, right from the workstation itself.
One product I’ve run across (not used directly) is Aternity, now owned by Riverbed. It does actual user experience monitoring. Word of mouth says it can be costly. There are likely other products in that space. Knowing which users have a problem = automated scoping data — could be pretty useful!
One point to this blog is that you can get pretty useful data without a large investment. The investment needed is in free or cost-effective products and as much of your time as is needed to ensure you’ll have the data you want when you need it.
Scoping and anecdotal user input can help troubleshooting, but do not form a very strong objective basis for doing so. One problem is accurate time: correlating bad UX with other performance statistics.
Network troubleshooting is slow if you don’t already have the data in hand. You can end up with guess work, or with ongoing problems while you get set up to gather the data you need. Slow!
Set yourself up for success by getting a good SNMP performance tool, monitoring everything. And think about adding one or more tools that provide hard objective data about user experience, or user-like experience, at least by site. Tracking wired versus WLAN UX is possible with some of the above tools, e.g. Netbeez.
As noted in a prior blog, cloud changes the game. That’s where good transaction logs and measurements of service / microservice response times become more essential (In effect, adding service experience to user experience!). Start talking to your APM / application folks and learning the tools.
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!
Hashtags: #CiscoChampion #TheNetCraftsmenWay
Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at firstname.lastname@example.org.