Reachability SLA and SNMP Data Collection

Author
Terry Slattery
Principal Architect

Some network management systems track reachability by pinging network devices.  In fact, at a Networkers a few years ago, the Cisco IT team talked about using 5 second pings to measure network availability (what I prefer to call reachability) in the Cisco corporate network.

In addition to pings, most systems collect some sort of operational data, like interface utilization and error counters.  Since any communication with a network device implies that it is also reachable, SNMP data collection could be used instead of, or in addition to, pings.  If the NMS is collecting interface and device data every ten minutes, then that collection tells the NMS that the device is reachable within a ten minute certainty.  If the data collection were spread evenly across the ten minute period, the uncertainty period could be made significantly smaller.  Additional instrumentation of the SNMP data collection process could also produce round trip time information.

There are a couple of advantages to using this approach.  The first is the reduction in overall network traffic, which admittedly may not be significant on high speed links.  The second advantage is that I’ve seen a number of cases where a device stops working, but will respond to pings because they are handled at a very low level in the operating system.  SNMP requests, however, require more of the operating system to be running and therefore are a better measure of operational availability.  There is a disadvantage, which is that the SNMP data collection process needs more instrumentation to measure and record the round trip times and packet loss.  But this is simply moving the processing from a ping process to the SNMP data process.  So it isn’t really increasing the processing load on the NMS, and in fact, one could argue that having it in one process is more efficient than having a separate ping process, albeit with some additional complexity within the SNMP data collector.  A caveat is that some SNMP queries may take longer than others, for example polling large routing tables.  These specific requests are easily excluded from the timing data.

If a reply is not received within the time out window, a set of pings or other requests should be sent to determine if the loss was due to congestion or due to loss of connectivity.  These pings should probably be sent with a timeout that is similar to the TCP retransmit timer – 2*Smoothed_Round_Trip_Time.
Regardless of whether SNMP or ping is used to collect the raw data, a reachability report would show the following:

  • Reachability SLA – Minimum, Average, and Maximum device reachability for the day.  Graph the reachability metrics over the past 90 days and have drill-down to other graphs.
  • Response time SLA – Minimum, Average, Maximum, and Standard Deviation response time per SNMP request or per ping.  Of course, the maximum value is infinity, indicating no response.  But to be practical, some sort of high end metric is needed that is roughly equivalent to infinity.  This could be user settable, and might make sense to default to something like 10 seconds.  Graph the response time metrics over the past 90 days and have drill-down to other graphs.
  • Top N unreachable devices (N is user selectable), sorted by the length of time that they were unreachable.  Note that a singe failed interface or device may cause other systems to be unreachable, so this is a reachability metric, not an availability or uptime metric.
  • Top N response time devices, sorted from the device with the largest average response time to least average response time.  To avoid conflict with the reachability table, this table would show those devices where a response was received.
  • Top N congestion detection occurrences where devices didn’t respond to a request but did answer subsequent polls.

Please leave a comment if you think of other ways to display or use reachability data.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.