Understanding Interface Errors and TCP Performance

Not all network engineers understand the impact of interface errors on TCP performance. Interface errors can cause a BIG impact, although it may not be intuitive at first glance.
We recently pointed out some interfaces with extremely high errors to a customer. We mentioned that the links with the highest percentage loss were likely getting very little useful data through them, and that they should investigate the cause of these errors. Initially the customer did not appear to be very concerned because the percent of errors was below 3%. We personally find error rates of greater than 0.001% to be a cause for concern.

Based on this experience, I thought I’d write up an article to illustrate the impact of interface errors.

Best TCP/IP Performance Expected
Perhaps the first question to consider is “What is the best TCP/IP performance you can expect on a Gigabit Ethernet link in the campus?”
First let’s look at the buffering required for TCP which is the bandwidth delay product (BDP). With a Gigabit Ethernet link, the buffering required in a receiving system for maximum performance is the amount of data that can be sent between ACKs. The bandwidth of a Gigabit link is 1000 Mbps. If the data exchange is inside a campus, say between a data center server and a user, the RTT should be very small, perhaps 2 milliseconds or .002 seconds. So for a Gigabit link, the receiving system needs to be able to buffer bandwidth * delay:

BDP = 1000 Mb/s * .002 seconds
BDP = 1000 Mb/s (1 byte/8 bits) * .002 seconds
BDP = 125,000,000 bytes * .002 seconds
BDP = 250,000 Bytes

When the BDP is less than the TCP window size, the path BW is the limiting factor in throughput. For a Gigabit Ethernet link, the BDP of 250,000 Bytes is greater than the default TCP window of 32,000 Bytes (the default TCP window size), so the path bandwidth will not be the limiting factor.
When the TCP window size is less than the buffering required to keep the pipe filled, the mechanics of TCP operation affect the maximum throughput. In this case, the sending system sends a full TCP window worth of data, waits for an acknowledgement from the receiver, then sends again. The application is not using the send-window mechanism that would allow TCP to fill the bandwidth pipe. Only when an ACK is received can more data can be sent. Therefore, the maximum throughput that can be achieved for a source and destination is the window size divided by the time it takes to get back an ACK (i.e., the round trip time). In this case, the best throughput you can achieve is the chunk size (amount of data sent per window) divided by the round trip time or

Max Throughput = chunk size / RTT
Max Throughput in bps = [Bytes * 8 (bits/byte) ] / RTT

Another question to consider is “What is the maximum throughput for a GE link in the data center?”

For this best case calculation, I assume the application sends a chunk of 64,000 Bytes of data across multiple TCP segments and waits for an ACK before sending more data. If the data exchange is inside a campus, say between a data center server and a user, the RTT should be very small, perhaps 2 milliseconds or .002 seconds. So the maximum rate for a single file transfer would be

64,000 * 8 / .002 =256,000,000 bps or 256 Mbps

Conclusion: If the RTT is 2ms, a maximum rate of about 256Mbps is possible in the campus across a Gigabit Ethernet link.

Expected TCP/IP Performance With Errors
A third question to consider is “What is the impact of errors on TCP/IP performance on a Gigabit Ethernet link in the campus?”

Note: There are several potential sources of interface errors, including interface discards when there is insufficient bandwidth to support the traffic volume, misconfigured duplex and speed settings, excessive buffering on interfaces, misconfigured EtherChannels, and faulty cables or hardware.

First we consider what is an acceptable error rate. Based on the IEEE 802.3ab standards, the Bit Error Rate (BER) considered acceptable for 1000BaseT circuits is 1 in 1*10^10 bits.

1 bit loss in 1*10^10 bits/sec = 1 bit loss in 1.25*10^9 bytes per second

If we assume an average packet is 1000 bytes long, the 1000BaseT BER would be 1 packet loss in 1.25*10^6 packets. On a percentage basis, 1 packet lost/1.25*10^6 = 8*10^-7 = .00008%
Therefore we could round this up and really expect to see at most .0001% packet loss on the Gigabit Ethernet cable.

Note: This is a very generous packet size, perhaps 300 to 450 bytes may be a more common average for enterprises including VoIP. However, the 1000 byte packet size was chosen for easier math.

However, the TCP path can experiences packet loss due to performance and configuration issues with the servers and network devices. TCP performance is degraded as packets are lost and need to be retransmitted. The Mathis equation is a formula that approximates the actual impact of loss on the maximum throughput rate:

Max Rate in bps < (MSS/RTT)*(1 / sqrt(p))

where

MSS = maximum segment size in bytes
RTT = round trip time in seconds
p = the probability of packet loss

Note that this formula includes constant with a value that is approximately 1 that resolves the bytes to bits… The formula is known as the Mathis equation, from a 1997 paper titled The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm.

Now we can apply the Mathis equation to the example GE link. For the MSS we will use 1460 bytes, since this will fit into one TCP packet (when the MTU of the network gear is 1500 bytes.) We assume that the application will send a chunk of 1460 Bytes of data and waits for an ACK before sending more data. Since this data exchange is inside the campus, we are again assuming that the RTT is.002 seconds. So the maximum rate will be for a single file transfer with standard BER for 1000BaseT cable of 0.0001% losses:

Mathis Max Rate in bps < (MSS/RTT)*(1 / sqrt(p))
Max rate in bps < (1460/.002)*(1/ sqrt(.000001))
Max rate in bps < 7.3*10^8 bps
Max rate in bps < 730 Mbps

The predicted Mathis rate exceeds the maximum rate of 256Mbps we calculated without losses, so the maximum rate will be the lesser of these two calculations or 256Mpbs. This result is reasonable, circuits that meet the acceptable BER for Gigabit Ethernet do not adversely impact TCP performance.
What happens at our threshold rate of concern? In this case, we have 0.001% losses, or 1 packet in 100,000.

Mathis Max Rate in bps < (MSS/RTT)*(1 / sqrt(p))
Max rate in bps < (1460/.002)*(1/ sqrt(.00001))
Max rate in bps < 2.3*10^8 bps
Max rate in bps < 231 Mbps

Since this is within 10% of the predicted 256Mbps, so we deem it as “acceptable.”
However, we then look at what happens if the line has 0.01% losses, or 1 lost packet in 10,0000 packets?

Mathis Max Rate in bps < (MSS/RTT)*(1 / sqrt(p))
Max rate in bps < (1460/.002)*(1/ sqrt(.0001))
Max rate in bps < 7.3*10^7 bps
Max rate in bps < 73 Mbps

This is significantly below the predicted 256Mbps. This reduced rate will cause a noticeable impact on application performance.
Looking back at the beginning of the article, what is the impact of less than 3% errors? If we use an error rate of 3%, the maximum throughput on the Gigabit Ethernet link is drastically reduced:

Mathis Max Rate in bps < (MSS/RTT)*(1 / sqrt(p))
Max rate in bps < (1460/.002)*(1/ sqrt(.02))
Max rate in bps < 4.2*10^6 bps
Max rate in bps < 4.2 Mbps

This drastically reduced rate could easily cause sluggish application performance.

Conclusion:
The following diagram illustrates the impact of interface errors on TCP throughput. We see that host-to-host system performance quickly degrades.

I believe that any interface error rates that exceed 0.01% should be a cause for alarm and immediate investigation/resolution. I hope this discussion helps explain why you should be very concerned about interface errors!

— cwr

_________________________________________________________________________________________

References on TCP Performance, Reliability, and the Mathis Equation

This article summarizes ideas from several sources of information:

6 responses to “Understanding Interface Errors and TCP Performance”

Mike Courtney says:

November 5, 2011 at 4:20 am

This is a really great post! Thanks for taking the time to put it together.
Shiran says:

November 6, 2011 at 8:36 pm

very nice article 🙂
note: 125,000,000 * 0.002 = 250,000 not 625,000
Carole Warner Reece says:

November 7, 2011 at 2:35 am

thanks, updated my typo…

Carole
Ken says:

December 6, 2011 at 6:30 am

Excellent article. I looked in the details and may have missed it, but another issue with errors is TCP slow start:
http://www.faqs.org/rfcs/rfc2001.html

If the device sending data performs TCP slow start then that further slows down your transfer speeds while the data flow is "ramping back up". A 64k window with a fairly small amount of packets loss can cause a "sustained" throughput of about 20k.
Carole Warner Reece says:

December 19, 2011 at 2:10 am

Thanks for your comment Ken. Yes, TCP slow start will further slow down your throughput.

Carole
Dandy says:

June 27, 2012 at 4:38 am

Hi Carole,

Great article about interface errors impact on TCP performance.

May I have your permission to re-post this in our internal blog? I put the original author name and link on top of the article I re-posted in our internal blog.

Dandy

You must be logged in to post a comment.

6 responses to “Understanding Interface Errors and TCP Performance”

Leave a Reply

Related Topics