Handling Event Data, Part 2

Author
Terry Slattery
Principal Architect

I’m continuing the list of NMS event handling requirements.

Unstable interfaces
In the last post, Handling Event Data, Part 1, I talked about correlating interface up/down events.  I would also like to know if an interface is unstable.  This is where the interface is going up and down on a regular enough basis that it is a hint that something needs further investigation.  Any event type that has a count greater than some value over a given time interval would create an alert or would appear in a report.  For example, report any interfaces that have had more than two ‘down’ events per week.  I would use this information to identify and research intermittant problems and work to make the network more stable and to reduce churn in the routing and switching protocols.

Alert severity changes
An event (syslog or Trap) is received, it is processed and may generate an alert.  An alert is a notification to the administrators that something needs attention (e.g., link down).  When a correlating event is received, the alert may be cleared (link up), but it is useful to track that alert at a lower priority instead of clearing it.  Then, if the same events recur, a counter on the alert gets incremented.  It isn’t good to keep alerts around forever, so they must age out.  What’s useful is a daily or weekly or monthly event/alert report.  There are two ways to accomplish the tracking of repeated events.  One is to record the information in the alert status.  The other is to have reports that consolidate the data that is stored in the alerting and event database (note that I am not necessarily advocating that all events be kept in a database – the filesystem may be the most suitable place to store event data due to high volumes of events and the efficiency with which they can be stored).

Identifying rare events
Ideally, the common events are all identified in some way.  Think of this as a filter that identifies known problems and how to handle them.  Interface up/down events may be handled by generating alerts if a correcting event is not received within a certain time period, as described previously.  What’s left after recognizing the common set of events are rare events.  Pinnacle errors, fan failures, and power supply failures fall into this category.  Such events may then be highlighted by creating a high severity alert.  Simply because they are rare, they should receive extra attention, and should generate an immediate alert.  The network staff can then examine whether the event is critical to the operation of the network and take appropriate action.

I’ll cover additional requirements in the next post, Syslog Summary Script.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.