We’ve all encountered situations where an application is slow and the network gets blamed. I’ve been having some fun working with our Terry Slattery on consulting work to determine why a specific six applications are slow. He’s come up with some good insights into the applications at this particular site. And we’ve been talking about some of the reasons why applications might be slow. Yes, it might be the network. It also might be the application, particularly if the application writer or toolkit is oblivious to what it is doing in network terms.
I started brainstorming to come up with a list of ideas for things that could make an application slow, breaking it out by whether the cause is an application or a network problem. Some of these are items Terry touched upon in his recent blogs. I was thinking about blogging about them individually or in small groups, then decided a check-list of things to consider might be useful.
Please add your own favorite application slowness causes as comments to this blog!
Application Causes of Slowness
- Many round trips (times the round trip time — you can’t change the speed of light)
- DB match row by row vs. page or streaming approach (a common cause of many round trips)
- Reading many files on a CIFS or NFS drive (CIFS can be slow, and directory recursion is round-trip intensive)
- Opening many TCP connections
- Pulling a lot of data across the network unnecessarily, e.g. fat client or server-based join rather than DB-based join or stored procedure
- Synchronous replication with latency and locking
- Making many Active Directory, LDAP or DNS calls (un-cached*)
- Overloaded / slow AD, LDAP, or DNS server
- Broadcast/multicast Altiris image distribution: poorly planned groups can clobber your WAN
- High traffic between different locations — lack of location awareness or uncautious VMotion to lighten main datacenter load
- Massive numbers of Unix scripting shell invocations
- Server performance
- Lack of RAM or Microsoft handles on server
- Resource locks, lock contention
- Application that does Reverse DNS logging rather than IP logging, coupled with use of NAT (see Solving ASA Slowness).
In general, the thing that makes troubleshooting all this challenging is getting good information about the application. It helps to know where the chatty (“ping-pong”) traffic occurs, where the massive data transfers occur (if any), or where lots of CIFS/NFS files get accessed. For that matter, it helps to know which other servers the main application talks to, and roughly why it is talking to each of them. One could hope that application documentation would cover that. I have yet to see it do so.
Network Side of Things
If you don’t have comprehensive monitoring of all network devices, servers, and links, you’re flying with your eyes closed. Pervasive monitoring and a pro-active stance are the best way to not have to go do heavy research when the network is blamed for application slowness. As an article I saw said, MTTR depends on MTTI, MTTB = 0. That is, Mean Time to Blame (the network) is 0 seconds, until you establish Innocence (MTTI) the real repair effort often doesn’t start. The better shops I’ve worked with have both network and server people doing research in parallel, that way it speeds up problem resolution should the application be at fault.
Here are some things to look out for on the network side
- Sharing WAN or MAN links with Internet traffic and no QoS de-prioritizing the non-business Internet traffic
- Bufferbloat (see Terry’s blog Application Analysis Using TCP Retransmissions, Part 2)
- Client side buffer tuning / lack of tuning causing poor TCP throughput
- Link congestion (covert over-subscription / micro-bursts)
- Retransmissions = symptom of congestion, most visible on the server side
- Overruns / oversubscription of ASICs, backplane, device L2/L3 switching performance, etc.
- Server to network problems, e.g. duplex mismatch
- Poorly deployed / overloaded Packeteer
- Inappropriate QoS / policing / shaping
- MTU sized wrong and fragmentation
A couple of those need more explanation.
End system MTU, TCP buffers, and TCP parameters can be tuned. There is a lot of advice and mis-advice on the web about TCP buffer tuning. Increasing buffers on end systems can help. See also bufferbloat (above). For an introduction to the topic, see http://en.wikipedia.org/wiki/TCP_tuning. The national supercomputer sites used to have good information on this, and now sites dealing with massive high-speed and international data transfers for scientific data probably do have similar info.
Retransmissions: it is hard to see these in the network. Windows will report retransmissions per second, which I consider fairly useless (what’s normal, what’s high — can’t tell unless you have stored history). But retrans / second divided by packets (segments) / second gives retrans / segments transmitted, which can easily be turned into a percentage, something you can more readily threshold across a number of servers without setting different thresholds for each one. If the TCP MIB is supported, it will tell you segments retransmitted and total segments sent, which are all you need. I like to look at retransmission since that tells me whether something bad is happening — it covers the case where my normal reporting is missing something in terms of drops or other counters, e.g. internal drops due to crypto capacity being exceeded.
Packeteer: I’ve had limited and unhappy experience with them. My impression is they can easily introduce an additional problem source that can be hard to troubleshoot, in part due to poor documentation about what the various types of QoS policies actually do under the hood. I once got told “we don’t document that — take the class or bring in a consultant”. Not a good reply — might as well say “our documentation lacks detail”. I don’t want to see each flow, that’s playing whack-a-mole. I’ve seen Packeteers easily overwhelmed by auto-discovery of too many flows. Then there’s having to fork-lift upgrade them when you upgrade the WAN link speed. I’ve also seen a Packeteer with network-based crypto, where we had to figure out how to account for crypto headers, and also needed to turn on LLQ in the Cisco router to complement the Packeteer VoIP QoS. I ended up feeling it would have been a whole lot simpler to just do the QoS on the routers.
QoS: my favorite source of Cisco IOS QoS confusion is the different units for the priority and bandwidth commands (Kbps), shaping and policing (bps). Also the burst parameter for shape is in bits, for police it is in bytes. The Nexus allows you to specify the units, reducing by one the number of easy ways to get it wrong. If you shape thinking the units are Kbps, you are allowing 1/1000 as much traffic as you intend to. That’ll really slow things down!
Real-World Examples
Some real-world “war stories” might amuse.
I still recall from a while ago the site that had to do a quick datacenter core and access switch upgrade after the application folks upgraded Lotus Notes and performance was terrible. I heard that 2-3 weeks later someone realized the new version did a lot more LDAP calls and was killing the LDAP server, also impacting other applications. My conclusion: instrument the heck out of your LDAP, AD, DNS servers and the links to them, they have datacenter-wide impact.
There’s also the government agency that was doing nightly data rollups to HQ, and they were taking 23 hours and a climbing number of minutes to complete. Obviously that wasn’t going to work much longer. It turned out the problem was the backup was a Unix shell script with nested shell loops, that did GZIP and FTP transfer one file at a time. I heard that just doing TAR on the lowest level directory, GZIP on that, and FTP of the compressed file got the data rollup down to something like 1 hour and change, by just reducing the number of individual shell invocations. The person consulting on this noted that there were other optimizations possible (‘expr’ command sub-shell invocations, etc.), moved some constants out of loops, wrote a small C program to do something in the middle of the remaining nested loops more efficiently — and got the task down to 7 minutes, as I recall it. Moral to the story: if the application is slow and does something over and over, that repeated operation is where you need to be most efficient.
Relevant Prior Blogs
Here are some links to prior blogs on this topic by me or by Terry Slattery:
Our president, David Yarashus, points out the following:
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:UsersDavid Yarashus>netstat -sP tcp
TCP Statistics for IPv4
Active Opens = 2854
Passive Opens = 0
Failed Connection Attempts = 2361
Reset Connections = 204
Current Connections = 0
Segments Received = 6975
Segments Sent = 6498
Segments Retransmitted = 3148
Active Connections
Proto Local Address Foreign Address State
C:UsersDavid Yarashus>
My whole context in the article is automated repeated measurements that a net management tool can collect and report on. CLI is useful but not suitable for ongoing observation across many servers.
Good Article. Thanks! I can use some of the tips here 🙂
On server performance, some system administrator does not practice applying "Hardware, Optional". In my dealing with them, they either believe that they only need to apply "High Priority" or hardware does not need update or they are using WSUS which does not include updates for hardware firmwares and drivers.
Applying NIC firmware and driver updates is important (so is firmware and driver for other parts of the server). In one of my real-world experience, while transferring large files during server migration on a 1GB link, the throughput is below 100Mbps. After I forced the system administrator to apply NIC firmware and driver updates, the throughput hits 600Mbps.
I always recommend to system administrator to apply "Hardware, Optional" updates for optimum server performance.
I have recently found the following to support this practice.
http://msdn.microsoft.com/en-US/library/cc615012(v=bts.10).aspx
http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00264524/c00264524.pdf
Thanks for the feedback. Funny timing, in that I’ll shortly post a blog that contains some thoughts along a related line of reasoning…
I recently realized that it seems like nobody [b][u]ever[/u][/b] validates server to network connections. One team sets up the switch and specifies the port, someone cables the patch cable, and the server guys do (or already did) their thing. Link light green, must be done.
But does it actually work? There are potential duplex issues, teaming and switch config mis-matches, etc. And bad NIC driver issues.
In a troubleshooting session a few months back, it turned out two weeks had been spent because the server team was doing LACP port channel on an HP TCP Offload card. They told me HP didn’t support LACP in the driver yet. 25% retransmissions. I couldn’t think of a diplomatic version of "why did you waste 2 weeks of the network team’s time" so I kept my mouth shut. (And maybe they just found out about the HP non-support.)
At another customer site, they were having FCoE issues on a 10 Gbps CNA. Turned out some QLogic CNA NICs worked fine, others had intermittent issues cleared up by a brand new driver release. Ok, testing probably would have missed that. (And why the heck QLogic CNA’s come in copper and fiber versions escapes me — isn’t the whole point to an SFP+ to render the NIC copper/fiber neutral?)
Cutting to the chase… wouldn’t it make sense to actually [u][b]validate[/b][/u] the server / OS / NIC / driver version combination, and then stick with what works. And then only run [b][u]known good [/u][/b]combinations? Rather than "this is the NIC and driver version I happened to install, it seems to be passing packets, on to the next server install"? If there’s a new version, maybe validate it – although that runs contrary to your advice to just upgrade the driver.
Do some sites actually do this? If you do, pat yourself on the back: well done!
My practice is to validate the update in UAT first then to DR before applying it to Production.
True from the network side there are other possible reasons like duplex mismatch and interface errors. For a critical system and fixed connection these should be hardcoded and monitored automatically. In my infra, I don’t want to see a single interface error that I cannot explain the root cause. There are systems that does not allow hardcoding for GE though – the switch has to be automatically monitor for duplex mismatch.
Before I move to networking, I was doing system administration and NIC issues doesn’t give details or errors – maybe I’m not using the right tool :D. But I seen a reputable manufacturer brought their tools to check the system and later told me its the on-board NIC problem then they replace the entire motherboard without even telling me how did they found out.
Unlike Cisco IOS which details the bugs it fixed, server firmare and drivers does not tell in detail what it fixed to what OS (i.e. throughput problem).
So far, updating NIC firmware and drivers for Windows, Solaris, AIX, Linux for decades have not failed me (not all updates speed up the network throughput though). I find that there are system admin that struggling with it and the reason is that they are not doing it properly. Either using the wrong firmare (for manual upgrade in the case of those using WSUS) or not validating it first.
I have not work on 10G yet. When I was working as system admin, I find that there are limiting factors at the server server side that may cause network slowness. i.e. bus, disk. Lucky for me, TomsHardware publish this article http://www.tomshardware.com/reviews/gigabit-ethernet-bandwidth,2321.html
The article is old, but can be use in today’s situation (10G). I’m able to use the above article in the past to justify server tech refresh from PCI to PCIe 🙂
Some system administrator (including application and database administrators) have this wrong mindset that speed and throughput is a network problem and nothing to do with them since the HTML error says check with your network administrator 🙂
Thanks, Dandy. Agree, Test, Dev, then Prod. Although that may end up being mostly sanity checking, as in "it seems to work"?
I suspect NICs with bugs won’t necessarily report on problems those bugs cause. Yes, driver fix notes generally leave a lot to be desired. For that matter, I’d say 50% or more of the Cisco bug tracker notes leave me scratching my head, and I often end up with "this sounds sort of like my problem, but I can’t tell for sure.". Programmers, terse with language, in a hurry, etc.?
Re 10 Gbps, supposed modern multi-core CPUs can do 50 Gbps of IO, in part due to PICe improvements. I suspect some of 10 Gbps and TCP Offload has been a bit bleeding edge, judging by the number of driver bugs I keep hearing about. I do hope we’re getting beyond that.