Have you ever had a Cisco router or switch shutdown due to a fan failure? While looking through NetMRI‘s daily list of analysis issues, I found a Fan Failure issue (it is named “Device Fan Problem” in NetMRI’s Analysis page).
It was really interesting to me because a fan failure produces a syslog message and it should have been caught by the NOC, who uses other tools to identify important syslog and trap messages. Of course, the problem with syslog and SNMP traps is that they typically use UDP for their transport mechanism. UDP packets are not retransmitted if a packet is discarded due to congestion or because it is damaged in transit. Most network people know that UDP packets may not arrive at their destination, but because most networks are pretty reliable, we rarely see it.
Because UDP messages may be lost in transit, what can we do about network management that depends on UDP for much of its operation? A good network management system will retry SNMP queries until it is able to retrieve the data that it needs. In this case, NetMRI was able to gather information about a fan failure that had not made it into the logs. While using SNMP polling to retrieve similar information to that reported by syslog may seem like a waste, I think it is important to track transient values or detect problems where the syslog message didn’t make it to the syslog server.
When I saw the issue, I verified that it had failed. [I like to verify that my tools are operating correctly and that I can trust them – so many NMS products produce false alarms that I’ve grown accustomed to checking them for proper operation.] NetMRI was correct, the device CLI reported the failed fan. A quick email to the support team allowed them to dispatch someone to repair it before the device overheated and shutdown, potentially causing an unplanned network outage.
I like to understand failure modes and how things should operate when a failure occurs and what I can do to minimize the impact of the failure. In the case of UDP, I like to use alternate collection methods that aren’t as timely as a log message, but that still let me know when things break.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html