Reachability SLA and SNMP Data Collection

Author
Terry Slattery
Principal Architect

Some network management systems track reachability by pinging network devices.  In fact, at a Networkers a few years ago, the Cisco IT team talked about using 5 second pings to measure network availability (what I prefer to call reachability) in the Cisco corporate network.

In addition to pings, most systems collect some sort of operational data, like interface utilization and error counters.  Since any communication with a network device implies that it is also reachable, SNMP data collection could be used instead of, or in addition to, pings.  If the NMS is collecting interface and device data every ten minutes, then that collection tells the NMS that the device is reachable within a ten minute certainty.  If the data collection were spread evenly across the ten minute period, the uncertainty period could be made significantly smaller.  Additional instrumentation of the SNMP data collection process could also produce round trip time information.

There are a couple of advantages to using this approach.  The first is the reduction in overall network traffic, which admittedly may not be significant on high speed links.  The second advantage is that I’ve seen a number of cases where a device stops working, but will respond to pings because they are handled at a very low level in the operating system.  SNMP requests, however, require more of the operating system to be running and therefore are a better measure of operational availability.  There is a disadvantage, which is that the SNMP data collection process needs more instrumentation to measure and record the round trip times and packet loss.  But this is simply moving the processing from a ping process to the SNMP data process.  So it isn’t really increasing the processing load on the NMS, and in fact, one could argue that having it in one process is more efficient than having a separate ping process, albeit with some additional complexity within the SNMP data collector.  A caveat is that some SNMP queries may take longer than others, for example polling large routing tables.  These specific requests are easily excluded from the timing data.

If a reply is not received within the time out window, a set of pings or other requests should be sent to determine if the loss was due to congestion or due to loss of connectivity.  These pings should probably be sent with a timeout that is similar to the TCP retransmit timer – 2*Smoothed_Round_Trip_Time.
Regardless of whether SNMP or ping is used to collect the raw data, a reachability report would show the following:

  • Reachability SLA – Minimum, Average, and Maximum device reachability for the day.  Graph the reachability metrics over the past 90 days and have drill-down to other graphs.
  • Response time SLA – Minimum, Average, Maximum, and Standard Deviation response time per SNMP request or per ping.  Of course, the maximum value is infinity, indicating no response.  But to be practical, some sort of high end metric is needed that is roughly equivalent to infinity.  This could be user settable, and might make sense to default to something like 10 seconds.  Graph the response time metrics over the past 90 days and have drill-down to other graphs.
  • Top N unreachable devices (N is user selectable), sorted by the length of time that they were unreachable.  Note that a singe failed interface or device may cause other systems to be unreachable, so this is a reachability metric, not an availability or uptime metric.
  • Top N response time devices, sorted from the device with the largest average response time to least average response time.  To avoid conflict with the reachability table, this table would show those devices where a response was received.
  • Top N congestion detection occurrences where devices didn’t respond to a request but did answer subsequent polls.

Please leave a comment if you think of other ways to display or use reachability data.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply