Defining Event Correlation and Event Suppression

Author
Terry Slattery
Principal Architect

The first component of a good network management system is an event handling system. A basic event handling system can be easily implemented with open source tools like syslog-ng. Just watching the new events arrive can be very enlightening. Many network operations teams gain visibility into their network operations by using the Unix ‘tail -f’ command to monitor the event stream as it arrives.

It is easy to miss rare but important events using ‘tail -f’. I like to use the syslog summary script (Syslog Summary Script blog post) to summarize daily messages. A quick daily review of the summary allows me to find those important events, such as STANDBY-3-DUPADDR, C4K_SWITCHINGENGINEMAN-4-TCAMINTERRUPT, C6K_POWER-SP-1-PD_HW_FAULTY, OSPF-4-DUP_RTRID_NBR, and SYSTEM_CONTROLLER-3-ERROR. Creating a daily summary is what I view as the second phase of event handling. It is a simple form of event de-duplication, resulting in a count of each message type and a subsequent breakdown of the devices and interfaces reported in each message.

A more advanced form of event correlation associates different events with one another. For example, assume that a device sends the following syslog messages:

10.1.1.18 Oct  5 16:43:26 edt: %LINK-3-UPDOWN: Interface GigabitEthernet1/0/3, changed state to down
10.1.1.18 Oct  5 16:43:41 edt: %LINK-3-UPDOWN: Interface GigabitEthernet1/0/3, changed state to up

The messages are different, but related because they are both about the same device and interface. Assuming that all the less important edge ports are configured with ‘no logging event link-status’, the link that’s reporting a problem is an important link. When the link goes down, I get a real-time notification of the event and can quickly begin troubleshooting the problem. When the link comes back up, I want the NMS to correlate the ‘up’ event with the ‘down’ event and automatically clear the ‘down’ event from the ‘active events’ list. Note that key server links, telepresence system links, and other key business process and control system links are often more important than network infrastructure links because the network is often engineered with a higher level of redundancy than the individual servers and edge devices.

What about links that only go down for a few seconds? No one may have reported it, but the NMS saw the event. If the NMS sends an alert and opens a trouble ticket for the interface down event, the operations team will find that the interface is back up and nothing needs to be done. After a few of these, the operations team will stop paying much attention to the interface down events. And they are right. It isn’t an interface down event. It is really an interface flapping event, which is an entirely different severity and troubleshooting scenario. Adding delays after event receipt but before generating an alert to the network operations team allows the system to detect intermittent events. In my experience, adding a delay of 30-60 seconds greatly reduces the number of interface down alerts and correctly categorizes flapping interfaces as such. Similar event correlation capabilities are useful for automatically clearing active events related to power supplies, device down, and many other network events. This frees the network staff from having to manually clear events, allowing them to spend their time doing more productive things.

Event suppression is the process of ignoring events that are generated due to a higher-level event. Wikipedia refers to this as Event Masking or Topological Masking. A good example is where a device or link fails, causing some number of systems downstream of the failed entity to become unreachable. An reachability monitoring system will generate an event for every unreachable system, potentially generating hundreds of events. An event suppression system will have knowledge of the topology and will suppress the downstream events. The result is a significant reduction in the volume of events that the network operations team must handle, allowing them to more easily see, and focus, on the event that caused the loss of connectivity. Event suppression is particularly important where email or pager alerts are generated. More than one organization has disabled certain alerts after experiencing a flood of hundreds of alerts when one key device or link went down.

Through the use of both Event Correlation and Event Suppression, network operations teams can make event processing much more useful. We like to know about the events, but sometimes it seems that events are just a lot of noise. By correlating events and suppressing downstream events, it is possible to reduce the volume of alerts that are generated. With lower volume, it is easier to identify the most important events (the signal to noise ratio goes up) and focus on taking action to correct those events.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.