Not all network engineers understand the impact of interface errors on TCP performance. Interface errors can cause a BIG impact, although it may not be intuitive at first glance.
We recently pointed out some interfaces with extremely high errors to a customer. We mentioned that the links with the highest percentage loss were likely getting very little useful data through them, and that they should investigate the cause of these errors. Initially the customer did not appear to be very concerned because the percent of errors was below 3%. We personally find error rates of greater than 0.001% to be a cause for concern.
Based on this experience, I thought I’d write up an article to illustrate the impact of interface errors.
Best TCP/IP Performance Expected
Perhaps the first question to consider is “What is the best TCP/IP performance you can expect on a Gigabit Ethernet link in the campus?”
First let’s look at the buffering required for TCP which is the bandwidth delay product (BDP). With a Gigabit Ethernet link, the buffering required in a receiving system for maximum performance is the amount of data that can be sent between ACKs. The bandwidth of a Gigabit link is 1000 Mbps. If the data exchange is inside a campus, say between a data center server and a user, the RTT should be very small, perhaps 2 milliseconds or .002 seconds. So for a Gigabit link, the receiving system needs to be able to buffer bandwidth * delay:
BDP = 1000 Mb/s * .002 seconds
BDP = 1000 Mb/s (1 byte/8 bits) * .002 seconds
BDP = 125,000,000 bytes * .002 seconds
BDP = 250,000 Bytes
When the BDP is less than the TCP window size, the path BW is the limiting factor in throughput. For a Gigabit Ethernet link, the BDP of 250,000 Bytes is greater than the default TCP window of 32,000 Bytes (the default TCP window size), so the path bandwidth will not be the limiting factor.
When the TCP window size is less than the buffering required to keep the pipe filled, the mechanics of TCP operation affect the maximum throughput. In this case, the sending system sends a full TCP window worth of data, waits for an acknowledgement from the receiver, then sends again. The application is not using the send-window mechanism that would allow TCP to fill the bandwidth pipe. Only when an ACK is received can more data can be sent. Therefore, the maximum throughput that can be achieved for a source and destination is the window size divided by the time it takes to get back an ACK (i.e., the round trip time). In this case, the best throughput you can achieve is the chunk size (amount of data sent per window) divided by the round trip time or
Max Throughput in bps = [Bytes * 8 (bits/byte) ] / RTT
Another question to consider is “What is the maximum throughput for a GE link in the data center?”
For this best case calculation, I assume the application sends a chunk of 64,000 Bytes of data across multiple TCP segments and waits for an ACK before sending more data. If the data exchange is inside a campus, say between a data center server and a user, the RTT should be very small, perhaps 2 milliseconds or .002 seconds. So the maximum rate for a single file transfer would be
Conclusion: If the RTT is 2ms, a maximum rate of about 256Mbps is possible in the campus across a Gigabit Ethernet link.
Expected TCP/IP Performance With Errors
A third question to consider is “What is the impact of errors on TCP/IP performance on a Gigabit Ethernet link in the campus?”
Note: There are several potential sources of interface errors, including interface discards when there is insufficient bandwidth to support the traffic volume, misconfigured duplex and speed settings, excessive buffering on interfaces, misconfigured EtherChannels, and faulty cables or hardware.
First we consider what is an acceptable error rate. Based on the IEEE 802.3ab standards, the Bit Error Rate (BER) considered acceptable for 1000BaseT circuits is 1 in 1*10^10 bits.
If we assume an average packet is 1000 bytes long, the 1000BaseT BER would be 1 packet loss in 1.25*10^6 packets. On a percentage basis, 1 packet lost/1.25*10^6 = 8*10^-7 = .00008%
Therefore we could round this up and really expect to see at most .0001% packet loss on the Gigabit Ethernet cable.
Note: This is a very generous packet size, perhaps 300 to 450 bytes may be a more common average for enterprises including VoIP. However, the 1000 byte packet size was chosen for easier math.
However, the TCP path can experiences packet loss due to performance and configuration issues with the servers and network devices. TCP performance is degraded as packets are lost and need to be retransmitted. The Mathis equation is a formula that approximates the actual impact of loss on the maximum throughput rate:
- MSS = maximum segment size in bytes
- RTT = round trip time in seconds
- p = the probability of packet loss
Note that this formula includes constant with a value that is approximately 1 that resolves the bytes to bits… The formula is known as the Mathis equation, from a 1997 paper titled The Macroscopic Behavior of the TCP Congestion Avoidance Algorithm.
Now we can apply the Mathis equation to the example GE link. For the MSS we will use 1460 bytes, since this will fit into one TCP packet (when the MTU of the network gear is 1500 bytes.) We assume that the application will send a chunk of 1460 Bytes of data and waits for an ACK before sending more data. Since this data exchange is inside the campus, we are again assuming that the RTT is.002 seconds. So the maximum rate will be for a single file transfer with standard BER for 1000BaseT cable of 0.0001% losses:
Max rate in bps < (1460/.002)*(1/ sqrt(.000001))
Max rate in bps < 7.3*10^8 bps
Max rate in bps < 730 Mbps
The predicted Mathis rate exceeds the maximum rate of 256Mbps we calculated without losses, so the maximum rate will be the lesser of these two calculations or 256Mpbs. This result is reasonable, circuits that meet the acceptable BER for Gigabit Ethernet do not adversely impact TCP performance.
What happens at our threshold rate of concern? In this case, we have 0.001% losses, or 1 packet in 100,000.
Max rate in bps < (1460/.002)*(1/ sqrt(.00001))
Max rate in bps < 2.3*10^8 bps
Max rate in bps < 231 Mbps
Since this is within 10% of the predicted 256Mbps, so we deem it as “acceptable.”
However, we then look at what happens if the line has 0.01% losses, or 1 lost packet in 10,0000 packets?
Max rate in bps < (1460/.002)*(1/ sqrt(.0001))
Max rate in bps < 7.3*10^7 bps
Max rate in bps < 73 Mbps
This is significantly below the predicted 256Mbps. This reduced rate will cause a noticeable impact on application performance.
Looking back at the beginning of the article, what is the impact of less than 3% errors? If we use an error rate of 3%, the maximum throughput on the Gigabit Ethernet link is drastically reduced:
Max rate in bps < (1460/.002)*(1/ sqrt(.02))
Max rate in bps < 4.2*10^6 bps
Max rate in bps < 4.2 Mbps
This drastically reduced rate could easily cause sluggish application performance.
The following diagram illustrates the impact of interface errors on TCP throughput. We see that host-to-host system performance quickly degrades.
I believe that any interface error rates that exceed 0.01% should be a cause for alarm and immediate investigation/resolution. I hope this discussion helps explain why you should be very concerned about interface errors!
References on TCP Performance, Reliability, and the Mathis Equation
This article summarizes ideas from several sources of information: