Over the past many years, I’ve been observing Network Management (NM) products and what vendors do. I’d like to share with you some things to look for in a good quality network management product, a small amount of cynicism, and a little bit of vision as to where network management products need to go. That destination, by the way, is not the one you might infer from the title.
The cynical observation is that some, maybe even many, NM products are built with eye candy so as to sell. If you look closely, you can even see trends in the eye candy. This year and last, it seemed to be NetFlow displays (and SD-WAN displays) with colorful pie charts. The message was, “You can see your application traffic mix on the link.”
Yes, that’s cool. My question: What is actionable about it? You can’t play whack-a-mole, tuning your traffic daily. So the information might be useful for spotting bandwidth-hogging applications, those that belong in a QoS Scavenger class. Once you’ve dealt with those, what’s left? If the graphs support it, it might be interesting to see what is increasing or shrinking over time. How many interfaces and specific applications are you going to have time to look at? In how much detail?
Conclusion: The pie charts are mildly and occasionally useful. The app mix breakouts, more so. If I can trend an app on an interface, that’s got some value. But maybe not of everyday value to me?
That leads to the vision part. Actionable information is part of it. Conserving the administrator’s time is the other.
Are you interested in loading SNMP MIBs and then picking variables to poll and manually setting thresholds and alerts? If so, you must have a lot of time on your hands. Most networking people don’t. So why, after 30 years, do we still have to do that? Why can’t the NM tool say, “That’s a Cisco switch,” and go get the right info, set thresholds, and tell us what we need to know right out of the box? Why are tools still being stingy about polling for data and storing it?
Yes, tools do some things out of the box now. But most are still way too “fiddly.”
As far as actionable information, I was recently once again looking at the RRD-based tools, some of which I like. The price is right, and they provide visibility.
Okay, there may be some time lost fiddling to get them working and tuned. However, they are all graph-based. How do I rapidly figure out which few of several thousand interfaces are having utilization problems? I don’t want to look at the interface graphs one by one; I need some sort of summary table. It should probably show me the 95th percentile-and-above utilization over a specifiable period of time. For the large view, I’ve been using the Interface Performance table in NetMRI. It allows me to sort on columns by max in or out percent utilization, broadcast percent, error percent, or discard percent. That gives me a good quick read on what I need to know about performance. It’s not even the main purpose of NetMRI. Well done, NetMRI!
My point about actionable information is that having some data is great, but what really helps is finding anomalies, i.e. extreme values. Having thresholds and alerts that tell you about extreme conditions also helps — preferably ones that persist for some period, since one-time, short-lived anomalies are not so useful. Who has the time to track such “blips” down, to try to figure out why they blipped (so to speak)?
Most products currently report the data you ask for. For example, you might be able to find out a lot about a given user WLAN connection. That strikes me as backward. What I want is instead to find out about user WLAN connections with problems, sorted by seriousness. Tell me what I need to know. Don’t make me go looking for the needle in the haystack!
Think Internet of Things. The industry, and we, too, need to start thinking about managing a seriously large number of things. How do we scale that up?
Seven Deadly Sins
To put a positive spin on this, you might think of this as how to be an informed consumer. I titled these items “sins” because NM vendors keep making these mistakes over and over.
Here’s the deadly seven:
- The product does not automatically discover the network and continue doing so. The tool should find new devices; the admin shouldn’t have to add them. Make good use of the admin’s time!
- The NM tool has high licensing costs per-interface, so users do not manage all active interfaces (and/or disable auto-discovery). I want to know about all of the interfaces when I’m hunting problems or being proactive. There’s a user failing lurking here: namely, managing only “important” interfaces, and missing key problems by ignoring others.
- Reporting numbers instead of percentages. Yes, big numbers look scary. But if the huge number of errors is less than 0.001% of all packets, it’s not as big a problem as a slower interface with 1% errors. Percentages generally help us determine significance of the data — actionable or not.
- Combining in and out utilization. Just don’t. Especially if you don’t tell the user what you’re doing. The problem: When you see 100%, is that 50% in + 50% out? Or 99% in and 1% out? The former is OK, the latter a congested link. If you can’t readily tell the two apart, that’s bad. Showing the greater of the two is a bit better, but hides other issues, such as 90% in and 1% out utilization. As soon as you combine the in and the out, you lose critical information.
- Not reporting high broadcast percentages (in and out separately, of course). But you need to factor in that the MIB-2 variable is really non-unicast traffic (multicast + broadcast). What I’ve noticed in tools that report the data properly is that STP-blocked ports show a high percent of “broadcasts,” but only low bps or kbps of actual traffic. So the ideal vendor report would delete apparently high broadcasts from low traffic interfaces. I export to Excel, sort, and delete with data from one product. I shouldn’t have to. Build intelligence in!
- Not reporting error and discard percentages on interfaces (in and out separately). Errors indicate bad cabling, dirty fiber optics, etc. And discards are the sign that an interface is over-tasked, or the device or ASIC chip is overloaded. Both are good to know about. Most products ignore them, or report the number or number per second. Those don’t tell me anything very useful. It’s the percentages that matter! And please (yes, you, HP OpenView!), packets per second are well nigh useless.
- Not providing good export of the data. I’ve been in situations such as capturing quarterly data for trend analysis and capacity planning, where I had to manually estimate the y-axis value and type a number into my spreadsheet. That precluded just taking the hourly data and running averaging and percentile functions on it within the spreadsheet, limiting me to just a few data points. They were probably good enough anyway, but the manual process just felt rather lacking.
There’s a user tip or even sin lurking here, too. The biggest problem I see with user interpretation of SNMP data is to forget that it is inherently averaged. The way that usually shows up is where traffic gets averaged with a lot of zeroes. The result is that you see a graph or number that is far lower than the peaks of actual traffic.
A NM tool typically gets a counter reading every so often. If you divide the amount the counter changed by the how often, you get bits per second or whatever. But that is an average. The reality is that actual traffic was both above and below the average.
When you’re doing capacity planning, averages may not be what you want. Instantaneous traffic bursts (“microbursts”) and dropped VoIP, Video, of VDI traffic may be important to your users. I’ll spare you more detail.
Thanks for sticking with me this far. We need to work together (“user community”) to tell NM vendors what we need. My discussion above contains some of the things on my wish list.
Are “classic NM tools” useless? No, they give visibility, and they can help us eliminate obvious problems (congestion, errors, etc.) easily, if we manage all interfaces by percentages and fix problem interfaces.
Do they help us find the problem? Yes, often they help when the problem is an outage. Not so much, when it is a “performance brownout.” (See my earlier and future blogs.)
Do they help us measure whether a change helps with a performance problem? Not really. That’s where tools for reporting on User Experience (UX) might help. But that’s a topic for another blog.
Comments are welcome, both in agreement or informative disagreement with the above, and especially good questions to ask the NFD9 vendors! Thanks in advance!