Buffer tuning has long been an interesting topic for me. I recently found a blog post by Brough Turner, who wrote an interesting article about a potential misconfiguration of AT&T wireless routers that cause ping round trip times to be either < 200ms or around 8000ms (yes, 8 seconds!).
There was quite a bit of controversy in the comments, which make for interesting reading. One of the things that I didn’t see covered in the original post or in the comments was TCP retransmissions. TCP measures the round-trip time and retransmits the unacknowledged segment if it does not receive an ACK within the retransmission timeout (see RFC2988). Retransmissions occur at an interval that doubles upon each timeout (the “back-off timer”). An increase to from 200ms to 8000ms is possible with six retransmissions. If there are a lot of TCP connections using the same link, and a lot of buffering is used, the buffers contain more and more retransmitted data, increasing the congestion on the link as retransmitted data needs to be transmitted. If there are just a few TCP connections, then something else is causing the long delays.
I can envision other mechanisms causing the long delays. The pings were data. If QoS were used to prioritize voice and there was a lot of voice traffic at the time of the test, the data could have been buffered for a long time. I’ve seen this in network testing. It is easy to replicate this in a relatively small network running old routing protocols like RIP and IGRP. Create a lot of routes, so the updates are relatively large and use a slow link. Setup a workstation to do a ping at 1 second intervals over several minutes and capture the resulting data. Import the data into Excel and plot the sequence number against the round-trip time (you’ll need a ping output that includes the sequence number so you can detect packet loss). You’ll see the ping packets get delayed when the routing updates occur. A saw-tooth pattern appears in the plotted data. Ping packets can be delayed by several seconds when the updates are large and the links are slow. I am not familiar with the Layer 1/2 protocols used in celular networks, but I could also believe that there’s a low-level protocol, maybe similar to X.25, that’s buffering the data and eventually getting it pushed through a very lossy link.
Back to buffer tuning. It is not well covered in router classes – it is something that you have to dig to find. I can certainly believe that network staff (or their managers) who don’t understand how TCP works would focus on packet loss and insist on configuring enough buffering to avoid packet loss. I can hear them now: “Our network is great! We don’t have any packet loss!” [I could see a manager thinking that packet loss is like dropped calls and wanting to minimize it.]
There are times when buffer tuning is valuable. I look for interfaces that have a lot of buffer misses. In the Cisco gear, this occurs when a packet arrives, needs a buffer, and no buffers of the appropriate size are available. The packet is dropped and a new buffer is created to handle future packets of the same size. If an interface shows a high number of buffer misses of a particular packet size, I recommend increasing the number of fixed buffers by no more than 10% and then watching for further buffer misses.
If there are more misses, I start to look at other mechanisms to handle the load. Increasing the interface speed is the preferred mechanism, where it can be done. QoS obviously allows important data to receive priority treatment. (Refer to my post “Cisco Router Interface Wedged” for an example of what happens when QoS isn’t properly implemented.)
Since some pings were 8000ms, it is clear that something somewhere is hanging onto the packets. I doubt that they circulated in the network that long without the TTL counting down to zero. Unfortunately, there is no way to absolutely know what is happening without additonal data like a packet trace. How many concurrent connections exist? What is the direction of the data flows? How many retransmissions are occurring? Unfortunately, flow monitoring tools (netflow, sflow, IPFIX) do not provide the level of detail that is needed. Getting TCP retransmission data from one of the endpoints is valuable, but often difficult to obtain and is dependent on the OS in use. See the Windows Performance Monitor Counters for an example.
I like to use Cisco’s IP SLA tool, combined with another tool to manage a lot of IP SLA tests, to let me know that a particular link or path to a remote site is experiencing high latency, jitter, or packet loss. Once I’ve identified a poorly operating path, I can apply whatever tools are needed to determine the origin of the problem.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html