Network Monitoring: Top Performance Items to Watch, Part 2

Peter Welcher
Architect, Operations Technical Advisor

We’ve been discussing the many things that could be killing your network’s performance – often quietly and without your knowledge. Last time, we covered the value of using the right tools to get network management data that you need. Let’s continue with a discussion of syslog, debug, managing voice and video traffic, and more.

I encounter a lot of sites that ignore syslog. Yes, there’s a large noise-to-signal ratio there. There are free tools that summarize the syslog data, and there are golden needles in the haystack as well. A tool like Splunk or syslog-NG (free in most Linux distributions) can help you send alerts based on the items of interest. Splunk can also give you frequency count based reports to separate out repeated happenings that might be of concern from one-time blips that aren’t worth investigating.

The one big syslog item that comes immediately to mind is Spanning Tree topology changes, which indicate instability. I don’t know of any other simple way to be alerted when your Spanning Tree gets unstable.

I also like to look at things like EIGRP or OSPF neighbor loss. If that’s happening too often, you have little periods of routing instability, which translates to little blips with possibly no application throughput. My recommendation is to capture syslog, then pull, say, a month or two’s worth, run the free Python scripts that are out there to summarize, and look for “gee, I wish I’d known that” items.

Another item that comes to mind is using Cisco debug when piloting something new. A while back I was working in the lab with an IPsec connection problem, and debug showed me that the endpoint was trying all sorts of combinations of policy and IPsec parameters looking for a match. How many of us settle for “I got it to connect” instead of “it is connecting in the most efficient way”? That’s the great thing about Cisco debug: it shows us what we could not otherwise see.

There’s another little tidbit I’ve tucked away, based on talking to our VMware/server/UCS team. If your storage can’t keep up, the server or VM will get slow. If a slow application is reported, particularly with, say, time of day sensitivity, check network traffic, CPU and memory, and free disk space, but also check your SAN IOPS. Are other servers using the same storage also running slow?

In the QoS arena, I’m a fan of monitoring. First you need a QoS design, in writing, and you need to know what traffic levels you’re expecting for your various classes. You can then monitor (via CBQOS MIB data, or NetFlow or other data source) what traffic levels your network actually has. There is a pre-requisite for that: you need to have consistently and carefully deployed your QoS policy end-to-end. If there are gaps, you’ve got bigger problems than monitoring and traffic levels.

The big one to think about is “Call Admission Control” for voice and video traffic. The point is that with voice or video, someone should think about how much there might be and how that relates to the WAN and LAN bandwidth available (with QoS percentage applied to the respective voice and video classes). Better yet, your Call Server should be doing Call Admission, that is, denying calls or video that would exceed the amount of traffic the network can handle. The idea of Admission Control is to not allow calls that will put your bandwidth “over the top” and degrade other calls, like the old “the system is busy” message we used to get from the landline phone system. With cell phones, your call just doesn’t go through.

When consulting on QoS, I usually present the topic of Admission Control as a spectrum or scale running from zero to 10.

  • Position zero on that scale is, of course, doing nothing. That may work if you have tons of bandwidth. In that case, your company clearly has too much money and you deserve a raise!
  • Position 1 on the scale is where you calculate something like the number of users multiplied by the per-call bandwidth including Layer 2 overhead, and then add 10% (multiply by 1.1). You then make sure your QoS voice and video percentages provide at least that much bandwidth. The extra 10% covers calls on hold, etc. If you do that and then monitor and manage bandwidth, you may well be OK. It’s quick and sloppy, but may be adequate. Generally, most people have at most one call in progress at any given time. A few may have calls on hold, etc.
  • Position 10 on my scale involves full coordination between Admission Control settings in the Call Server and the QoS settings in the network equipment, possibly factoring in key Busy Hour or other information. Unless a site is a Call Center, it is likely that only a fraction of the phones are in use. Doing what this requires takes some real work, plus configuring your Call Server with the appropriate bandwidth numbers. There are aspects of the way most Call Servers handle this that I’m not too impressed with, depending on your network topology.

There’s a lot more to be said on this topic, which I’ll have to leave for another blog post. The key point for now is, have you thought about having too much voice or video, and what are you doing about it? Yes, with QoS, voice and video probably only eat part of your network, but it is a highly visible piece. See also our Terry Slattery’s presentation, How to Keep Video from Blowing Up Your Network, which covers many forms of video, some of which would not be subject to Admission Control.

While on the topic of voice/phone calls, many sites are shifting to SIP trunking. If you do that and have a Call Center operation, your voice team does need to think about tracking and managing high call queue volumes. We’ve worked with a site that ended up with a deadline triggering days of about two hours of call queue and 700 Mbps of Music On Hold, all running over a main site’s 1 Gbps MPLS link and SIP trunk. That level of MPLS traffic would likely impact your other applications, while irritating your customers at the same time. Politely dropping calls earlier might be useful, as might web-based chat and other low bandwidth scaling solutions for customer support.

To sum up this series of two blogs: It pays to think about what you need to know and how to get your devices and your management software to provide the necessary information. It also pays to think about what’s “going on under the hood,” to understand how too much of a good thing might impact you, or what the hidden limitations of chips, network devices, backplanes, stacks of gear, etc. might actually be.


Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

Twitter: @pjwelcher

Disclosure Statement
Cisco Certified 20 YearsCisco Champion 2016

2 responses to “Network Monitoring: Top Performance Items to Watch, Part 2

  1. Nice one Peter. Seems Cisco centric though. A followup article with telemetry analysis techniques and also other vendors would be great.

  2. In the monitoring space there are a lot of challenges, in my opinion here it is the 3 main challenges.

    1) Set of tools:
    sometimes you don´t have the appropriated tools to achieve what you need or you have an overlap of many tools to do the same thing.

    2) Threshold definition:
    One of most difficult area in monitoring. What is good for my application might not be good for your application. Complexity of custom monitors.
    Eg: Business says that a transaction must run in 1 sec but it was never validated before.

    3) What to Monitor:
    More is less, if you want to monitor everything you will probably monitor nothing. Less is Less!

    Also, in my opinion this is what needs to be monitored.

    BFD Problem
    BGP Peer Not Established
    Chassis Temperature
    Device Average CPU Utilization
    Device Fan Failure
    FR DLCI Link Down
    Module Problem (Minor/Major/Down)
    Network Outage
    Port Error Disable
    Port Inbound Fault High (Packet Corruption)
    Port Outbound Fault High (Transmit Errors)
    Port Status
    Syslog Events

Leave a Reply