Almost all network management products that collect performance data incorporate some type of rollup as the data ages. In most cases, the data over longer time intervals is averaged and the resulting average value is used in the performance graphs. I find that the average value is seldom what I want. Particularly bad examples are where the data is averaged over several hours or over a day. Almost all links have long periods of low utilization, which negatively impacts the resulting data (and the graph that we tend to use to show it).
What do I want? My major interest in link performance data is typically regarding the maximum values. A link that is running at less than its capacity is seldom a problem and therefore I’m not likely to be interested in its average value. I am more concerned with network utilization, congestion, and errors. Seeing the peak values allows me to see when a link needs to be upgraded, detect the potential for congestion or see bursts of errors. When I’m looking at long-term data, I need to see whether a link has been running near capacity during the busy period and how long it continues to run at that rate. A short burst at link capacity is acceptable, because TCP wants to use the full path bandwidth. But if I see utilization running for long periods of time at near-capacity, I know that the link is a candidate to be upgraded to higher bandwidth. At a minimum, I know that the link needs to have QoS enabled so that important traffic gets priority over less important traffic.
Errors are particularly important to not average because they tend to be lower absolute figures and averaging will quickly trend them towards zero, unless it is on a particularly bad link. I recently had a good example of this at a customer who had a set of fiber links, one of which exhibited some errors. It was showing about 50-100 errors per day. That’s not a large number and wouldn’t have a detrimental effect on the performance of flows taking that path. It stood out because all the comparable links were running with zero errors. Replacing the patch cable on the affected interface eliminated the errors. (We’re not sure if it was the patch cable or dust/dirt on the ends of the fiber.)
My recommendation to NMS vendors is to store the peak values or the 95th Percentile values for utilization data and store peak values for discards and errors. It should be easy to add an option to historical displays to select either average or peak data.
Regarding the storage of performance data, the ideal is to keep the raw performance data for historical analysis. Yes, the volume of data can be quite large, but the network operations team sometimes needs that data for doing trend analysis or to see if a particular pattern has existed for a long period of time. Storing the data in a separate NAS makes a lot of sense, allowing me to archive as much data as I like. I can then go back to the data to answer the following questions:
- How long has the link utilization been exhibiting some characteristic? (like nearly 100% load, or perhaps nearly 0% load)
- What has been the typical network utilization during March Madness Basketball season? (or pick some other reason for people at work to be watching streaming video)
- When did errors or discards start happening on a link?
- How frequently are errors or discards occurring on a link? Do the times correlate with other data?
- When will a link need to be upgraded (trend analysis)?
- Did the QoS implementation reduce drops in high priority traffic classes?
The NMS should provide a mechanism that allows me to easily answer the above questions.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html