Working as a consultant at Chesapeake NetCraftsmen is great, because I get to see a lot of interesting network problems. Sharing information about these problems is valuable, because it is better to learn from someone else’s mistakes than to make them ourselves. In this case, it was a self-inflicted denial-of-service attack.
This customer has a few network management products, each of which provides utility for a component of their operation. One NMS system, hosted in the main data center, is configured to perform ping reachability testing of devices (lots of products include this type of functionality). It will send an alert to pagers or mobile phones when a monitored device is no longer reachable. There are 28 people that receive these alerts. The NMS is normally quite well behaved and the network operations staff is happy with how it works, providing early indication of network problems when reachability is affected.
One day, the NMS caused a major problem. A key distribution switch had a blade failure which caused a big chunk of the network to become unreachable. Suddenly, about 650 devices of various types were unreachable. The NMS quickly discovered that the devices were unreachable and started sending alerts. The problem was with the number of alerts that were sent. There were 28 people on the alert list and 650 devices that were unreachable.
28 people * 650 unreachable devices = 18,200 alerts generated
The paging and SMS systems were suddenly overwhelmed. It turned into an unintentional, self-inflicted DoS attack on the alerting system. It took a long time for the alerting system to process the alerts. New alerts couldn’t make it through, and even if they did, they were lost in the volume of alerts from the failed devices. Fortunately, there were no other critical outages at the same time, but the impact made it clear that the existing mechanism didn’t work well when a key switch failed.
So the customer has taken several steps to mitigate the impact of a similar failure in the future. The NMS includes a feature that allows an alerting hierarchy to be constructed, but it is a manual process, not automatic. They have spent the time to create the necessary hierarchy. Some NMS incorporate the ability to suppress downstream events, often called “root cause analysis”. In this case, understanding the topology is required in order to suppress the alerts about downstream devices. Some NMS products include automatic suppression of alerts based on topology, but many do not. This customer has now taken the time to configure suppression of many of the downstream devices.
The second remediation factor was to reduce the number of people who received the alert. Reducing the number to half the original figure cut the number of alerts in half. The combination of topology-based suppression and reducing the number of recipients made a big difference in the number of generated alerts. The only problem now is to maintain the suppression list. That’s where the NMS products that do automatic suppression are very useful.
What happens in your network management alerting system when a key network device fails?
-Terry
_____________________________________________________________________________________________
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html