The first component of a good network management system is an event handling system. A basic event handling system can be easily implemented with open source tools like syslog-ng. Just watching the new events arrive can be very enlightening. Many network operations teams gain visibility into their network operations by using the Unix ‘tail -f’ command to monitor the event stream as it arrives.
It is easy to miss rare but important events using ‘tail -f’. I like to use the syslog summary script (Syslog Summary Script blog post) to summarize daily messages. A quick daily review of the summary allows me to find those important events, such as STANDBY-3-DUPADDR, C4K_SWITCHINGENGINEMAN-4-TCAMINTERRUPT, C6K_POWER-SP-1-PD_HW_FAULTY, OSPF-4-DUP_RTRID_NBR, and SYSTEM_CONTROLLER-3-ERROR. Creating a daily summary is what I view as the second phase of event handling. It is a simple form of event de-duplication, resulting in a count of each message type and a subsequent breakdown of the devices and interfaces reported in each message.
A more advanced form of event correlation associates different events with one another. For example, assume that a device sends the following syslog messages:
10.1.1.18 Oct 5 16:43:26 edt: %LINK-3-UPDOWN: Interface GigabitEthernet1/0/3, changed state to down 10.1.1.18 Oct 5 16:43:41 edt: %LINK-3-UPDOWN: Interface GigabitEthernet1/0/3, changed state to up
The messages are different, but related because they are both about the same device and interface. Assuming that all the less important edge ports are configured with ‘no logging event link-status’, the link that’s reporting a problem is an important link. When the link goes down, I get a real-time notification of the event and can quickly begin troubleshooting the problem. When the link comes back up, I want the NMS to correlate the ‘up’ event with the ‘down’ event and automatically clear the ‘down’ event from the ‘active events’ list. Note that key server links, telepresence system links, and other key business process and control system links are often more important than network infrastructure links because the network is often engineered with a higher level of redundancy than the individual servers and edge devices.
What about links that only go down for a few seconds? No one may have reported it, but the NMS saw the event. If the NMS sends an alert and opens a trouble ticket for the interface down event, the operations team will find that the interface is back up and nothing needs to be done. After a few of these, the operations team will stop paying much attention to the interface down events. And they are right. It isn’t an interface down event. It is really an interface flapping event, which is an entirely different severity and troubleshooting scenario. Adding delays after event receipt but before generating an alert to the network operations team allows the system to detect intermittent events. In my experience, adding a delay of 30-60 seconds greatly reduces the number of interface down alerts and correctly categorizes flapping interfaces as such. Similar event correlation capabilities are useful for automatically clearing active events related to power supplies, device down, and many other network events. This frees the network staff from having to manually clear events, allowing them to spend their time doing more productive things.
Event suppression is the process of ignoring events that are generated due to a higher-level event. Wikipedia refers to this as Event Masking or Topological Masking. A good example is where a device or link fails, causing some number of systems downstream of the failed entity to become unreachable. An reachability monitoring system will generate an event for every unreachable system, potentially generating hundreds of events. An event suppression system will have knowledge of the topology and will suppress the downstream events. The result is a significant reduction in the volume of events that the network operations team must handle, allowing them to more easily see, and focus, on the event that caused the loss of connectivity. Event suppression is particularly important where email or pager alerts are generated. More than one organization has disabled certain alerts after experiencing a flood of hundreds of alerts when one key device or link went down.
Through the use of both Event Correlation and Event Suppression, network operations teams can make event processing much more useful. We like to know about the events, but sometimes it seems that events are just a lot of noise. By correlating events and suppressing downstream events, it is possible to reduce the volume of alerts that are generated. With lower volume, it is easier to identify the most important events (the signal to noise ratio goes up) and focus on taking action to correct those events.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html