I’ve been doing a number of network assessments recently and one of the factors that I use to determine network health is the number of TCP retransmissions that occur. I look at servers and user workstations as well as looking for retransmisisons using packet capture tools. There is a fair amount of information on the web about TCP retransmissions and how TCP works. But I haven’t seen anything much about the sources of TCP retransmissions. The few things I did find typically contained inaccuracies of some sort.
TCP guarantees data delivery by using an ACK mechanism to make sure that the data is received (note that a failure internal to the receiving system may corrupt the data after TCP hands it off to the kernel, but we’re not going to go there in this article). If TCP doesn’t receive an ACK within 2*RTT (twice the smoothed Round Trip Time), it will retransmit the first previously sent and unacknowledged segment. Features like Selective ACK (SACK) and fast retransmit speed up the process. See descriptions of TCP such as at Wikipedia for more information about how TCP works.
Several things can cause a retransmitted packet (technically, a TCP segment). Most people think of packet errors as a common reason for TCP to retransmit a segment and that’s correct. Packet errors can be caused by bad cabling or cables run near sources of EMI/RFI, such as high-voltage power lines or fluorescent lighting ballasts. A bad NIC, perhaps containing a memory chip that has a stuck bit in its internal buffer, could be another source of packet errors. Any source of electronic noise could potentially cause bit errors, which result in packet errors. These are sources of errors at the physical layer.
At the data link layer, duplex mismatch on Ethernet is probably the most common cause of packet errors. Old NICs that do not properly do auto-negotiation are one source. With more modern equipment, auto-negotiation works correctly, so the source that I most often see are IT departments that are still living in the past and setting speed/duplex manually and not getting it right in all cases. I’ve written before about auto-negotiation and prefer to use it because of its benefits with Gigabit and higher speed links. With duplex mismatch, the full-duplex side will see runts, FCS errors, and CRC errors. The half-duplex side will see late collisions. Depending on the specific configuration and the result of the negotiation, you can look for duplex mismatches by having your network management platform report these types of errors from the switch port. Ideally, you would get configuration information from the connected device, but that’s not necessary. Simply knowing the types of errors that are being generated is sufficient to determine the likely cause. A duplex mismatch configuration will cause packet loss, regardless of which side is full or half duplex. One of the other signatures of a duplex mismatch is that as the number of packet errors increases as the traffic load increases.
The other source of TCP retransmissions is excessive buffering. That’s a relatively complex topic, so I’ll cover it in the next post.
We know we have interface errors, but how many is too much? As we saw with the Mathis Equation (see my blog on TCP Performance and the Mathis Equation and Pete Welcher’s blog on TCP/IP Performance Factors), dropped packets have a significant impact on TCP goodput (delivered user data). A modern network link will typically have a bit-error rate (BER) of fewer than one bit out of 1E10 bits (commonly written as 1E-10) for fiber and 1E-6 for copper T1 (see Phil Dykstra’s SuperComputing 2004 presentation, pp 155). A BER of 1E-10 is a reasonable expectation due to the number of 1G copper links in a typical network today. Yes, there are a lot of exceptions. Satellite systems may only achieve 1E-7 BER. When I mention 1E-10 to the military people, they laugh, because much of their infrastructure is based on Satellite and Microwave, which often have much high bit error rates. At the other end of the spectrum, fiber links are common and they have much lower bit error rates due to lower electromagnetic and RF interference.
Let’s pick some figures and see how the turn out. If the transmitted frames are 10,000 bits long (about 1200 bytes), a BER of 1E-10 would result in one error per 1 million frames/packets (1E-6 packet error rate). [Note: at the interface level, the loss is technically a frame loss, since a bit error in the framing header will cause the frame to be discarded just as if the error were in the payload, causing a checksum failure. Network management systems typically use the term ‘packet’ in the MIBs and in the UI, so when you see the work ‘packet’ or ‘frame’, they are referring to the same thing.] That’s 0.000001 or 0.0001% packet loss. If your average frame size is 1000 bits, then the packet error rate would be one tenth of that value, or 0.00001%. So now we know the range of the threshold for packet loss for a clean link.
On network management systems allow me to set the thresholds for alerting on interface errors, I typically set the NMS platform to use a threshold of 0.001% (1E-5 packet error rate), which is equivalent to a BER of 1E-9. Unfortunately, many NMS platforms only allow a one or two decimal points of precision in the packet loss rate for thresholding and alerting, so you may need to use larger values, like 0.01% or 0.1%. In these cases, you may want to configure the system to alert based on the total number of packet errors over a day or a week.
When implementing these thresholds, you’ll find that duplex mismatch will create packet error rates that are much greater than the figures I recommend, so you may need to start with a more generous number, fix the common problems, then gradually refine the threshold as you correct the sources of large numbers of packet errors. I know of one site that has about 20 interfaces with over 1,000,000 (yes, one million) packet errors per day, about 30% packet loss. The systems attached to those interfaces are obviously trying to transfer a lot of data and not having much success. Sites that have a lot of interfaces with high packet loss will need to tackle the problem incrementally. One suggestion I make is to have everyone on the network team investigate one or two interfaces a day. By keeping the number down, it doesn’t interfere with other daily tasks and projects and yet achieves the desired effect over time, which is to reduce network errors, increase network throughput, and customer productivity.
In the next post, I’ll talk about excess buffering as the other source of TCP Retransmisions.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html