In my last post, Application Analysis Using TCP Retransmissions, Part 1, I talked about TCP retransmissions that are caused by link errors. Another source of TCP retransmissions is too much buffering in network routers and L3 switches.
Some buffering allows the network to handle traffic bursts without dropping many packets, which is good. However, too much buffering causes problems. How much buffering is too much? Looking at how TCP works can help us understand what the limits should be.
TCP measures the Round Trip Time (RTT) of its connection (see Wikipedia on Transmission Control Protocol). A packet is assumed to be lost when an ACK has not been received within 2 * RTT. So network buffering along the path needs to be less than 2*RTT or buffer-induced retransmissions will occur when a path with excessive buffering becomes congested. This means that the maximum buffering is proportional to the link speed. Higher speed links can handle greater buffering than lower speed links. When excessive buffering exists, and a link becomes congested, TCP cannot accurately measure the path capacity and it oscillates between sending too much data and sending too little data for the path. When it sends too much data, it adds to network congestion, which impacts all applications using the path. The amount of buffering that is excessive is surprisingly small.
Of course, when network operations staff sees interface discards, the temptation is to increase interface egress buffering until the discards are reduced or eliminated. But that creates a situation in which TCP can’t get an accurate measurement of the path’s available bandwidth and the result is increased network congestion, increased latency, and lower overall throughput. More things than just TCP retransmissions are at work here. Jim Gettys at Bell Labs (Alcatel/Lucent) has been working on the problem for the last year and coined the term Bufferbloat to describe the phenomenon. Watch the video of Jim’s Google Tech Talk and you’ll see the oscillations in TCP performance in the graphs that he has collected. Note that congestion induced by UDP has the same effect.
Let’s work through a specific scenario. The network topology is typical of a two data center organization where the data centers are connected by a 1Gbps link. The L3 switches are using FIFO queuing. See the figure below.
We’ll start with an unloaded link. Let’s say that the unloaded path RTT is 2ms. The network staff has seen a lot of egress discards on the L3 switch 1G links, so they increased the egress buffering to 2000 buffers, which reduced the discards to a level that they were happy with. But now the customers are complaining about slow network performance. The network team has tested the network and they can’t find anything wrong. Discards are down to a level that should not be a problem, no errors on the interface, ping shows 2ms almost all the time they test it (they ignore the higher ping times since they don’t seem to occur very frequently). The sending server is sending big frames, averaging 900 bytes and is capable of sending continuously at 10Gbps. The receiving server is configured with 2000 receive buffers, so the TCP window is scaled up.
It doesn’t take TCP very long to exceed 1Gbps, but because ACKs are still being received and no packet loss is detected, it continues to increase the number of segments that are transmitted per received ACK (see Wikipedia Slow-start).
However, the rate at which the L3 switch’s buffers drain is limited to 1Gbps. It can send 1Mb per millisecond, or 139 frames per millisecond. But there are 2000 egress buffers! At an average frame size of 900 bytes (7200 bits), that’s 14,400,000 bits, which takes .0144 sec (14ms) to transmit. TCP’s RTT smoothing algorithm will not have caught up, so TCP will retransmit packets at 2*RTT. If we assume that the RTT has increased to 4ms, TCP will retransmit the last unacknowledged packet at 8ms. So now there are two copies of this packet in the egress queue. An ACK will soon arrive, but TCP will time out a successive packet and retransmit it, because the value of the measured and smoothed RTT takes time to catch up with the delay induced by the buffering.
Each retransmitted packet cuts the 1Gbps link’s goodput (see Wikipedia Goodput), the amount of which depends on the link speed and RTT. The link can lose a substantial percentage of bandwidth to needless retransmissions due to excessive buffering.
The message here is that you don’t need packet loss in order to cause TCP retransmissions. They can occur due to excessive buffering. So don’t experiment with the hold-queue sizes unless you’re open to a lot of experimentation and analyzing packet captures.
You can detect excessive buffering in the network by looking at server TCP retransmissions, or examining packet captures on key network interfaces. A few retransmissions are expected. Packet loss is the feedback that it needs to learn the link capacity. Referring back to my prior blog post, Application Analysis Using TCP Retransmissions, Part 1 related to link errors, a retransmission rate that exceeds what a clean link provides is an indication that something is amiss.
What else could be amiss? The link could be overwhelmed. If there are a lot of systems communicating over the link, each opening multiple TCP connections (browsers tend to do this), the traffic bursts are likely to be significant. In an enterprise, if the link is running at greater than 20% utilization on a 5 minute polling interval for significant parts of the day, and the number of egress discards is greater than 0.00001% of all packets, more bandwidth is needed. 5 minutes is a long time and a lot of bits on a high speed link. It’s like the Washington, DC beltway; average utilization doesn’t mean much during rush hour. Most enterprises can increase link speeds and the above measurements can help you identify when adding bandwidth is needed.
When researching this topic, I found a blog post that suggested using large buffers for Research & Engineering networks. The author doesn’t include any metrics for RTT, which is one of the key factors. Another page by the same author mentions RTT and Bufferbloat, but suggests that it isn’t needed in high speed network links. I’ve seen problems with excessive buffering in enterprise networks on 1Gbps links with low RTT. If the path has a high RTT, it isn’t a problem, because the amount of buffering typically won’t result in exceeding the the TCP retransmit timeout period. I suspect that the paths over which this author is operating have large RTT values and therefore don’t exhibit the bufferbloat problem.
With the above as the background, I am recommending to one customer to set the egress hold-queue. The RTT on the main path (between two data centers) is 2ms, as described above. Because there are major flows in each direction, I can only configure 1ms of buffering in each direction and still have some room for a margin of error and some room for bigger packet sizes than the average. So I am recommending that they set the hold-queue to allow buffering of 1ms of data at the maximum packet size. The link can currently handle 1500 byte frames, which is 12,000 bits. I am recommending 83 buffers (the default is 40 egress and 2000 ingress – no need to change ingress buffering). That way if a packet gets buffered behind another big flow that filled the buffer, it won’t experience more than 1ms of additional delay in that direction. Hopefully, the ACK coming back the other direction won’t experience more than 1ms of additional delay and TCP won’t retransmit. If the link is changed to support jumbo frames, the number of buffers would have to be reduced to reflect the larger amount of data that is possible per frame.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html