Of course, the network management tools that I was using report interfaces with high percentages of errors, drops, or overruns. Fortunately for me, the NMS was not looking at the switch interfaces that connected to several key servers that were having performance problems. So I collected some basic stats using show interface. How could that be ‘fortunate’? Isn’t that a bad way to collect interface performance data? Normally, I would say yes, but having to fall back to the CLI for data collection taught me something new.
In the interface stats, I found ingress overruns, which is when the server is sending data faster than the switch can handle. The switch ports were an old interface card that had eight ports serviced by one ASIC and the ASIC had an aggregate throughput figure of 1Gbps. The server interfaces were configured for 1Gbps operation. It only took a couple of these servers to overrun the ASIC. That’s what caused the ingress overruns.
Finding the ingress overruns got me to thinking about all the data that was being collected. Co-worker Carole Warner Reece cooked up a quick Python script to take the output of show interface from all interfaces and created a spreadsheet. I then sorted the spreadsheet by the error counts. Some of the interfaces had high numbers of total errors, high traffic levels, and the percentage of errors on many of them was about 0.01%. These were key Gig links, so it was worth investigating. Looking back at my prior blog posts on Application performance (see blog links above) and the one on the Mathis Equation, you will note that this is enough packet loss to cause problems for TCP throughput.
So I went back to the network assessment tools that I used and found that the interfaces that were reported by my tools all had much higher percentage errors, but had very low data rates. The high-throughput interfaces that I found in the CLI output had error percentages that kept them from appearing in the top few pages of interfaces with high error percentages. While it is important to identify the high-percentage error interfaces (which also had low traffic volumes), it was the high volume interfaces that were impacting the applications that communicated across the network backbone.
The interfaces that I was investigating had very high traffic volume, had hundreds of thousands of errors, and were key interfaces in the infrastructure. Now I had a clear understanding of my misconception in looking for interface errors. I had always thought that I should look for high percent errors. But here were key infrastructure interfaces that were exhibiting high errors, but because of the total volume transiting the interfcaes, their percentage was low, relative to other, low-volume interfaces. How should I handle this case?
After thinking about it, I now think that the proper interface error sorting order should be based on two things. There should be a Top-N report that sorts interfaces by percent errors. This catches all interfaces that have high percent errors. With some very high percent errors, an interface won’t have much good traffic, keeping the error percentage high. A duplex mismatch on a busy interface will look like this. The second report should be a Top-N report that sorts interfaces by the total volume of errors. There may need to be a third report that identifies interfaces that have more than 0.0001% errors and sorts them by total traffic volume. This would highlight the busiest interfaces in the infrastructure where enough errors are occurring that it would impact TCP throughput.
I’ll now need to start looking for products that allow me to produce reports like those above. I know that NetMRI displays interface stats and I can use the Search and sort criteria in the displays to show interfaces with problems. I can also export the NetMRI data to a .csv file, open it in a spreadsheet or import to a database and do all the filtering and sorting I want. I’m not sure what other tools provide this level of flexibility, but I’ll certainly be looking now.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html