Handling Event Data, Part 1

Terry Slattery
Principal Architect

I’ve been examining the handling of event data and wanted to share what I see as common requirements.  For the next few posts, I’ll describe the requirements and why they are important.

Syslog Summary
My first requirement was addressed in the prior post about Syslog Summary Scripts.  I need a way to easily see what events are happening in the network.  It is especially important to find the critical, low frequency events such as a power supply failure, a fan failure, or a pinnacle error.  The Syslog Summary Script makes it easy to find these with a low-cost tool that doesn’t take much of your time to implement and use.

Event correlation
One network that I’ve seen has a large number of interface transitions per day, with an interface going down for only a few seconds before it comes back up.  I need a way to correlate the ‘up’ event with the corresponding ‘down’ event and automatically clear, or reduce the severity, of the ‘Interface down’ alert.  Any interface that goes down and stays down will continue to have an active, high severity alert.  Of course, you must now have an alert console, and that requires that someone monitor the console.  This brings me to the next requirement.

Delayed notification of events

When an event occurs, it creates an alert that has a severity related to the importance of the network element that was affected, typically based upon the importance of the reporting device or interface.  I’ve noticed in some networks that the event was created due to something else occuring in the network and that there is a subsequent correcting event logged.  For example, an interface down event followed quickly by an interface up event caused by moving a connection from one port to another as part of a known configuration or maintenance action.  I’d like to see the alert automatically cleared, or reduced in severity to an informational alert.  However, if there is no correlated correcting event within a certain amount of time, I want to escalate the alert by sending an email, creating a trouble ticket, or sending an SMS message to call attention to the alert without requiring that someone continuously watch the alert console.

Real-time notification of key events
If a critical interface goes down, or another similarly important event occurs, the NMS must be able to notify the network staff in real-time (real-time in this case is within 60 seconds).  The notification could be by email, SMS, Twitter, or a phone call.  The criticality of the event could be set by the NMS using the rank of the device(s) involved.  The NMS could also delay the notification if the affected component is less critical.  For example, interfaces that have returned to the ‘up’ status as I described above.

In order to set an appropriate severity on each event, the system needs to know something about the relative importance of the devices and interfaces.  You may recall the same requirement in my post about variable interface polling frequency (Handling NMS Performance Data, Part 4).  Events involving a single device could be ranked the same as the device rank.  Events that involve multiple devices, such as a link up/down, or OSPF adjacency change, could be ranked according to the average of the rankings of the devices involved.  The NMS should use previously learned neighbor information to determine the affected systems when creating the severity of events like link up/down.

I’ll have more event analysis requirements next time.



Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html


Leave a Reply


Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.


Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.


John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.