Do You Conduct Network Failure Testing?

Author
Terry Slattery
Principal Architect

At Enterprise Connect a few weeks ago, I contributed to the three-hour workshop on “QoS and Network Design for Converged Networks.” The focus of my presentation was network design and resilient network infrastructure. What does it mean for a network to be resilient? The Free OnLine Dictionary says Marked by the ability to recover readily, as from misfortune. A resilient network will continue to operate when a failure occurs. A good design will continue to operate at reduced capacity when multiple failures occur. Most networks incorporate redundant links and redundant devices as the primary method of implementing resiliency.

So you have a redundant network design. How do you know that your network is truly resilient? You have to test it!

I know that big financial firms perform failure testing. NetCraftsmen has created resilient designs and verified that they performed as desired. One example was a big financial network that required rapid failover to a backup data center, either under manual control or in case of a failure in the main data center. Failover testing proved that the transition time was just a few seconds. It also proved that the network could fail over without the loss of business services.

We see a lot of network designs that incorporate redundancy. But very few organizations test that the redundancy actually results in a resilient network. Most network teams look at the design, imagine failures, and hope that the outcome is actually what they imagine it will be. Unfortunately, there are often things that are overlooked. What kind of failure can be handled? What happens if power to the main data center is lost, through a failure of the main power grid, as has happened several times in the northeastern US? What happens if a natural disaster cuts off links to the main data center? What happens if one of the core devices in the data center fails or is misconfigured? (Misconfiguration is much more frequent.)

I think that more organizations should perform failure testing. If a lab is available that can replicate the network core, perform some basic network failure testing there, starting with a plan, what you will fail, what you expect to see, and what you find when you conduct the failure test. The test plan should list the types of failures that you expect to handle. Create a spanning tree loop or a routing loop and understand how to diagnose and fix it. Simulate a link or device failure. Then start looking at what other services need to work in order for your business applications to continue to work in a failure situation. Select a place in the network where a failure would be limited in impact and create a failure test plan and execute it. Make sure that what you expected is what you found. Clearly document what you didn’t expect to see. Make sure that your NMS detects and reports the failure and then shows the ‘all clear’ when it is corrected.

By starting with limited tests that have low impact, you become accustomed to the process of doing failure testing and gradually learn what services may be impacted. You also learn how the NMS detects such failures. You can also adjust what the NMS does when a failure occurs and how it detects and reports the failure. When a key device fails, do you get flooded with alerts?

Good failure testing requires an investment in time and planning. But it delivers good rewards in understanding how your network reacts to failures and allows you to build processes and tools to quickly handle the failures. I like to think of it as fire drills for network teams.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.