Rethinking Interface Error Reports

Author
Terry Slattery
Principal Architect

Of course, the network management tools that I was using report interfaces with high percentages of errors, drops, or overruns. Fortunately for me, the NMS was not looking at the switch interfaces that connected to several key servers that were having performance problems. So I collected some basic stats using show interface. How could that be ‘fortunate’? Isn’t that a bad way to collect interface performance data? Normally, I would say yes, but having to fall back to the CLI for data collection taught me something new.

In the interface stats, I found ingress overruns, which is when the server is sending data faster than the switch can handle. The switch ports were an old interface card that had eight ports serviced by one ASIC and the ASIC had an aggregate throughput figure of 1Gbps. The server interfaces were configured for 1Gbps operation. It only took a couple of these servers to overrun the ASIC. That’s what caused the ingress overruns.

Finding the ingress overruns got me to thinking about all the data that was being collected. Co-worker Carole Warner Reece cooked up a quick Python script to take the output of show interface from all interfaces and created a spreadsheet. I then sorted the spreadsheet by the error counts. Some of the interfaces had high numbers of total errors, high traffic levels, and the percentage of errors on many of them was about 0.01%. These were key Gig links, so it was worth investigating. Looking back at my prior blog posts on Application performance (see blog links above) and the one on the Mathis Equation, you will note that this is enough packet loss to cause problems for TCP throughput.

So I went back to the network assessment tools that I used and found that the interfaces that were reported by my tools all had much higher percentage errors, but had very low data rates. The high-throughput interfaces that I found in the CLI output had error percentages that kept them from appearing in the top few pages of interfaces with high error percentages. While it is important to identify the high-percentage error interfaces (which also had low traffic volumes), it was the high volume interfaces that were impacting the applications that communicated across the network backbone.

The interfaces that I was investigating had very high traffic volume, had hundreds of thousands of errors, and were key interfaces in the infrastructure. Now I had a clear understanding of my misconception in looking for interface errors. I had always thought that I should look for high percent errors. But here were key infrastructure interfaces that were exhibiting high errors, but because of the total volume transiting the interfcaes, their percentage was low, relative to other, low-volume interfaces. How should I handle this case?

After thinking about it, I now think that the proper interface error sorting order should be based on two things. There should be a Top-N report that sorts interfaces by percent errors. This catches all interfaces that have high percent errors. With some very high percent errors, an interface won’t have much good traffic, keeping the error percentage high. A duplex mismatch on a busy interface will look like this. The second report should be a Top-N report that sorts interfaces by the total volume of errors. There may need to be a third report that identifies interfaces that have more than 0.0001% errors and sorts them by total traffic volume. This would highlight the busiest interfaces in the infrastructure where enough errors are occurring that it would impact TCP throughput.

I’ll now need to start looking for products that allow me to produce reports like those above. I know that NetMRI displays interface stats and I can use the Search and sort criteria in the displays to show interfaces with problems. I can also export the NetMRI data to a .csv file, open it in a spreadsheet or import to a database and do all the filtering and sorting I want. I’m not sure what other tools provide this level of flexibility, but I’ll certainly be looking now.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.