I just returned from a week at CiscoLive 2011. The show floor had a lot of network management products on display, making it an easy place to walk around and investigate various products. I did an informal product survey of functions that are important to a good NMS. Without going into the details of what I found for each vendor, I thought it would be worth describing the key factors that I investigated.
Alerting is top on my list of key functions. A network with more than a few devices can generate a lot of data each day. It is important that the NMS sort through the data and report on exceptions. I call this “management by exception” because it is managing those things that appear as exceptions to normal operation. The altering page should allow exceptions to be easily sorted by various criteria, such as severity, timestamp, description, and device or interface. I prefer to sort by severity, so that the most critical alerts are at the top of the action list. I was surprised at the number of products that didn’t provide a clear view of alerts. My time is too important to spend looking through all the data collected by an NMS product. I need the product to identify potential problems and alert me so that I can investigate.
Most vendors provided a way to manage alerts and generate emails when an alert is created. Some vendors didn’t provide good sorting mechanisms, often showing the most recent alerts while older alerts scrolled off the display. Other vendors included both problem alerts and “its corrected” alerts together. What I’d like to have is a display of the currently active alerts. If something is corrected, simply remove the problem alert from the list (moving it to an alert archive).
Performance Data Collection
Collecting and displaying network performance data is next on my list. It doesn’t help if the system can’t collect network operational data at a rate that allows it to provide useful information. Can the system be configured to collect data from most interfaces at a low frequency, say 10-15 minutes, while collecting data from important interfaces at a higher frequency, say 1 or 5 minutes? There may even be a need to one or two interfaces to be monitored in real-time, where the data is collected and graphed in 1 second intervals. This latter approach is very useful for showing small packet bursts, which may show periodic traffic surges that correlate with a protocol or application timer. I’ve used this before to show that debug was left enabled on a router, because traffic periodically stopped while the debug process dumped a chunk of data to the console.
The presentation of the collected data is very important to its usability. I investigated how interface performance data is collected and displayed, which gives me a good idea of the usability of the product. Can I look back at historical data? Is historical data rolled up after some time period and if it is rolled up, does the algorithm cause the data peaks to be obscured? Note that averaging samples over an interval causes peaks to be averaged out. The peaks are often what are needed for capacity planning or to investigate network congestion. Can I setup some interfaces for faster polling when I’m troubleshooting a problem?
The products I saw ranged from those that keep all the data all the time to products that roll up the data daily. Most everyone had a way to poll some number of interfaces at a higher frequency, but some could not poll at 1 second intervals.
I also like to investigate the NMS system architecture and scaling, because the architecture will often indicate how the system must be structured for a good implementation. Knowing the architecture also tells me how many servers will be needed and where they may need to be deployed. If multiple collectors are needed, will the system automatically divide the workload, or do I need to configure each collector? Is a separate database server required and where should it be located, relative to the location of the collectors? I investigate how well the system scales up to some number of interfaces and devices, which I typically put at about 100,000.
One vendor’s staff became very excited when I asked about their ability to manage 100,000 interfaces. I’m sure they were thinking “hot prospect,” when all I wanted was to understand was their ability to scale to handle a large enterprise network. My next question is about “managed objects” because some NMS systems consider each device and each interface to be an independent managed object. Other vendors only count the number of devices, so I like to understand which method of operation a product uses and how that impacts the deployment and cost.
The products I reviewed scaled from 10,000 interfaces per poller to over 100,000 interfaces per poller, which is a pretty wide range of performance. I didn’t get into the details of implementation with any vendor to determine whether the architecture that was chosen would have significant problems at 100,000 interfaces. That step would be next if I were getting serious about any particular vendor’s products. Device Viewer
The last thing I checked was the existence of a device viewer. The NMS collects a lot of information about devices and it is surprising to me how few products include a way to view the collected data. The device viewer should show any of the common things that you would normally use the CLI to collect. I find this an extremely useful feature because it puts all the data in one place. I don’t have to login to multiple devices to see the data and often, the data is displayed in a way that’s cleaner than the CLI’s output. Let me give a few examples.
1. A L3 switch was sending syslog messages about a MAC address flapping between two interfaces. The NetMRI device viewer allowed me to quickly look at the interface operational parameters as well as look through the configuration and verify the switch port configurations. A little research with the customer showed that it was a Linux system that had been configured for bonding (Microsoft’s term is ‘teaming’). The switch ports needed to be configured into an Etherchannel to match, but the instructions overlooked that fact.
2. One of NetMRI’s analysis rules is to check that the subnet masks are consistent on interfaces across the network. At a recent site, several subnets appeared in this issue. One subnet reported inconsistency among five devices. I opened the device viewer for each device that was reported and drilled into the interface addressing page, which shows the IP address and subnet mask of each addressed interface. It was easy to compare the masks of the interfaces that shared the subnet in question and determine that one of them had the incorrect mask.
3. Viewing device neighbors gives me a quick view of adjacent network devices, which is useful for troubleshooting connectivity problems. Is the data taking the path that I think it should? Is a trunk link to a neighboring device experiencing problems? Is there a problem with HSRP to a neighboring router? All these come from the device neighbors view.
A good device viewer seems to be a rare feature. Only NetMRI seemed to have anything like it.
The result of my comparison was a mix of good new and bad news. Some features are available across all vendors while other features that I like were completely missing in some products. My synopsis is that many NMS vendors focus on data collection with simple alerting based on thresholds. I’d like to see the industry embrace more useful functions that don’t drive the network staff person back to the CLI for collecting the data necessary to diagnose problems. Otherwise, we will never have products that scale to the size and complexity that networks currently have, much less the size and complexity that they will have in a few years.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html