Handling Event Data, Part 1

Terry Slattery
Principal Architect

I’ve been examining the handling of event data and wanted to share what I see as common requirements.  For the next few posts, I’ll describe the requirements and why they are important.

Syslog Summary
My first requirement was addressed in the prior post about Syslog Summary Scripts.  I need a way to easily see what events are happening in the network.  It is especially important to find the critical, low frequency events such as a power supply failure, a fan failure, or a pinnacle error.  The Syslog Summary Script makes it easy to find these with a low-cost tool that doesn’t take much of your time to implement and use.

Event correlation
One network that I’ve seen has a large number of interface transitions per day, with an interface going down for only a few seconds before it comes back up.  I need a way to correlate the ‘up’ event with the corresponding ‘down’ event and automatically clear, or reduce the severity, of the ‘Interface down’ alert.  Any interface that goes down and stays down will continue to have an active, high severity alert.  Of course, you must now have an alert console, and that requires that someone monitor the console.  This brings me to the next requirement.

Delayed notification of events

When an event occurs, it creates an alert that has a severity related to the importance of the network element that was affected, typically based upon the importance of the reporting device or interface.  I’ve noticed in some networks that the event was created due to something else occuring in the network and that there is a subsequent correcting event logged.  For example, an interface down event followed quickly by an interface up event caused by moving a connection from one port to another as part of a known configuration or maintenance action.  I’d like to see the alert automatically cleared, or reduced in severity to an informational alert.  However, if there is no correlated correcting event within a certain amount of time, I want to escalate the alert by sending an email, creating a trouble ticket, or sending an SMS message to call attention to the alert without requiring that someone continuously watch the alert console.

Real-time notification of key events
If a critical interface goes down, or another similarly important event occurs, the NMS must be able to notify the network staff in real-time (real-time in this case is within 60 seconds).  The notification could be by email, SMS, Twitter, or a phone call.  The criticality of the event could be set by the NMS using the rank of the device(s) involved.  The NMS could also delay the notification if the affected component is less critical.  For example, interfaces that have returned to the ‘up’ status as I described above.

In order to set an appropriate severity on each event, the system needs to know something about the relative importance of the devices and interfaces.  You may recall the same requirement in my post about variable interface polling frequency (Handling NMS Performance Data, Part 4).  Events involving a single device could be ranked the same as the device rank.  Events that involve multiple devices, such as a link up/down, or OSPF adjacency change, could be ranked according to the average of the rankings of the devices involved.  The NMS should use previously learned neighbor information to determine the affected systems when creating the severity of events like link up/down.

I’ll have more event analysis requirements next time.



Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html


Leave a Reply