At Enterprise Connect a few weeks ago, I contributed to the three-hour workshop on “QoS and Network Design for Converged Networks.” The focus of my presentation was network design and resilient network infrastructure. What does it mean for a network to be resilient? The Free OnLine Dictionary says Marked by the ability to recover readily, as from misfortune. A resilient network will continue to operate when a failure occurs. A good design will continue to operate at reduced capacity when multiple failures occur. Most networks incorporate redundant links and redundant devices as the primary method of implementing resiliency.
So you have a redundant network design. How do you know that your network is truly resilient? You have to test it!
I know that big financial firms perform failure testing. NetCraftsmen has created resilient designs and verified that they performed as desired. One example was a big financial network that required rapid failover to a backup data center, either under manual control or in case of a failure in the main data center. Failover testing proved that the transition time was just a few seconds. It also proved that the network could fail over without the loss of business services.
We see a lot of network designs that incorporate redundancy. But very few organizations test that the redundancy actually results in a resilient network. Most network teams look at the design, imagine failures, and hope that the outcome is actually what they imagine it will be. Unfortunately, there are often things that are overlooked. What kind of failure can be handled? What happens if power to the main data center is lost, through a failure of the main power grid, as has happened several times in the northeastern US? What happens if a natural disaster cuts off links to the main data center? What happens if one of the core devices in the data center fails or is misconfigured? (Misconfiguration is much more frequent.)
I think that more organizations should perform failure testing. If a lab is available that can replicate the network core, perform some basic network failure testing there, starting with a plan, what you will fail, what you expect to see, and what you find when you conduct the failure test. The test plan should list the types of failures that you expect to handle. Create a spanning tree loop or a routing loop and understand how to diagnose and fix it. Simulate a link or device failure. Then start looking at what other services need to work in order for your business applications to continue to work in a failure situation. Select a place in the network where a failure would be limited in impact and create a failure test plan and execute it. Make sure that what you expected is what you found. Clearly document what you didn’t expect to see. Make sure that your NMS detects and reports the failure and then shows the ‘all clear’ when it is corrected.
By starting with limited tests that have low impact, you become accustomed to the process of doing failure testing and gradually learn what services may be impacted. You also learn how the NMS detects such failures. You can also adjust what the NMS does when a failure occurs and how it detects and reports the failure. When a key device fails, do you get flooded with alerts?
Good failure testing requires an investment in time and planning. But it delivers good rewards in understanding how your network reacts to failures and allows you to build processes and tools to quickly handle the failures. I like to think of it as fire drills for network teams.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html