I’ve been participating on several network management product forums for the past six months and have some observations that I’d like to share.
First, most organizations are focused on network performance monitoring. The forum questions tend to be around the areas of monitoring network link performance and server parameters (CPU, disk, memory). While these can be important metrics, they seem to get an overwhelming level of interest. Interestingly, it is seldom performance that causes a total network outage.
I think that performance is easy to understand at one level, so that’s where we start trying to use an NMS product. However, properly reporting performance metrics is not simple. Managers ask about link utilization and typically don’t understand the metrics such as 95th percentile. They want to know “What’s the link utilization?” Well, when a packet is being transmitted, it’s 100% utilization. When there’s no packet being transmitted, it is 0% utilization. Data transmission is bursty and the key is picking an appropriate averaging period and reporting the average utilization for that period.
If I pick a 5 minute period, I get a very different utilization figure than if I pick a 10 or 15 minute period. Consider a 5-minute, 97% utilization transfer, followed by 10 minutes of 0% utilization. The 15-minute average may be the same, but the 95th percentile utilization will increase with the 5-minute samples. Averaging these samples over a day nearly always produces useless data (I wouldn’t even call it “information”), due to periods of low utilization that skew the results. A lot of NMS products roll up the periodic samples to hourly, and eventually daily figures, essentially making useful data useless.
We should get smart enough to demand that vendors provide 95th percentile figures for each day instead of averages. Keep the 95th percentile figures to allow us to do better forecasting and trending. If this happens, we’ll have to educate the managers about what the new figures represent. I like to explain the 95th percentile figure as a rough approximation of the minimum “busy hour” utilization of a day (it winds up being about 72 minutes if you examine the math behind the determination of the 95th percentile metric). Stated another way, if the link had a 95th percentile utilization figure of 65% for a given day, then the link was at least 65% busy for about 72 minutes of the day. The accuracy of the calculation varies slightly as the polling period changes.
Try to get the 95th percentile figure in a dashboard or daily report. You’ll find that it is not very accessible in most network management platforms. The average and peak figures are always available, and since they get rolled up to longer intervals after a few days, their utility is greatly diminished after the rollup occurs. I recently spent some time trying to get a rolling 95th percentile chart and finally gave up. I wanted to chart the 95th percentile figure every hour for the trailing 24 hours. How does the figure go down during non-busy parts of the day and increase during busier parts of the day? It would be interesting to try different trailing time periods too, perhaps displaying it over the trailing 12 hours or 8 hours.
So we’re focused on performance, but haven’t really spent the time to understand which performance figures to trust and which figures to distrust and why.
But performance is only part of the story. Most network outages are due to poor network design. For example, I’ve done a number of network assessments where a layer 2 spanning tree was extended between two data centers. As I’ve written before (Spanning Layer 2 Between Data Centers), this isn’t a good idea. A spanning-tree loop will take out a major piece of an organization’s network. This is much more serious than poor performance on a link. Because spanning-tree loop outages are more rare, we tend to ignore the bad design and focus instead of the link performance. The result is that network management tools don’t tend to allow us to look at the data that tells us about design problems.
A simpler example, and one that some products are now addressing, is duplex mismatch. A product I’ve been using simply looks at overall errors. However, a duplex mismatch produces some very specific errors, depending on the interface setting. If the interface is running half-duplex, it will occasionally have a collision. Some vendors treat this as an error. Well, it is an error of sorts, but it is expected on a half-duplex interface. What isn’t expected are late collisions, which are often seen on a duplex mismatched interface with the local interface running half-duplex (note: full-duplex links never experience collisions and should never increment this counter). Any late collisions on a half-duplex interface are most frequently an indicator of a duplex mismatch. (The other is a cable that’s too long or a spanning tree that has too many hops, but these are much less frequent.)
A full-duplex interface will experience FCS and CRC errors as the remote end stops transmitting a frame when a collision is detected. Poor cabling, connectors, or dirty optics can also cause these problems, only much less frequently than a duplex mismatch.
Switching to a more complex routing design problem, the NMS should be able to show me the devices that have a default route statically configured. I maintain that having too many sources of a default route is a recipe for a disaster, such as the one that I described last week (Network Configuration Management – Know when it is wrong) in which a default route was accidentally injected into the core of a network, creating a black hole route.
Similarly, in an OSPF network, show me the areas and the ABRs that interconnect them. Do I have the desired level of redundancy in the network? If I’m running an MPLS network, where do the MPLS VPNs extend? Are they carrying only the routes that I want them to have? These are tough problems to solve in large networks. These types of problems require some pretty deep thinking to design a useful solution, but the payoff is in being able to significantly improve our network designs and improve network uptime.
NMS vendors need to start using the data that they collect to identify network design deficiencies. They don’t have an incentive to take this step unless we network engineers begin asking for such features and until we begin voting with our purchasing dollars for products that provide visibility into network design problems.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html