Real-Time Network Failure Detection

Author
Terry Slattery
Principal Architect

One of our customers had an interesting problem recently that caused a network outage in a critical part of the network. An interface blade was inserted into a Cisco Cat6500, but the blade didn’t pass diagnostics. The blade had been checked out in a lab switch, but for some reason, perhaps a bent pin, it was failing diagnostics in the operational switch chassis. The internal diagnostics detected the failure and shutdown the blade. Unfortunately, there was a problem in the IOS that didn’t properly reset the internal hardware and the forwarding engine stopped working. Strangely, though, the OSPF adjacencies didn’t die. That means that the 6500 could still send and receive packets; only forwarded packets were affected. The result was that the 6500 became a network black hole. Packets that should have transited the 6500 were silently discarded.

Because the interfaces were still up/up and routing adjacencies were maintained, the routing protocols in adjacent routers continued to include the 6500 as a valid next hop. When the outage was reported, it didn’t take long to determine what had changed, once the path through the 6500 was determined. A quick extraction of the card and the 6500’s forwarding engine came back to life.

Now I’m looking at how the network management system could have more quickly identified the problem. Note: I don’t think there would have been a way to prevent the problem, which seemed to originate with the hardware on the inserted interface card.

OSPF retained its adjacencies, and the interfaces were still up/up, so there were no syslog or SNMP traps generated. Since packet origination and reception still worked for OSPF, utilities like BFD (Bi-directional Forwarding Detection) and UDLD (Uni-Directional Link Detection) would not have detected a problem either. That leaves some type of reachability test.

Many network management systems include a ping-based reachability test, sourced from the network management system itself. The problem here is that the failure was on a path between two points in the network that could not have been tested with packets from the network management system. What tests run between routers and switches and can be controlled by a network management system?

Cisco’s built-in ping can be controlled via SNMP, and some network management systems contain functionality to setup a test and verify the results. However, there is no way for such a test to run and send a syslog or SNMP trap when a failure occurs.

The other mechanism is Cisco’s IP SLA. Numerous network management systems can automate the management and data collection of IP SLA tests. Tests can be setup to run from any Cisco device whose IOS supports IP SLA to other Cisco devices or to user endpoints, using a variety of protocols, including ping, DNS, HTTP, and UDP. The network management system needs to be able to monitor the IP SLA results, either through SNMP, or through the receipt of SNMP Traps, if the IP SLA test is configured to send traps when defined operational thresholds are exceeded.

It isn’t necessary to configure a full mesh of tests. It is frequently possible to configure a small subset of tests across the network that will provide visibility into connectivity problems. Setting thresholds for packet loss and jitter can provide useful information about the health of each network path. In most medium to large size enterprises that have reasonable network topologies, a few dozen tests should be sufficient for full testing. That’s a small enough number that it is reasonable to either use a network management tool to configure them, or to build a template and manually configure it on a few central routers. All that’s needed is the ability to generate alerts from the receipt of  the SNMP Traps that are generated when a threshold is crossed. Now you have a real-time network connectivity alerting tool.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.