Network Stability Through Resilience Engineering

Terry Slattery
Principal Architect

Resilience Engineering

Resilience engineering is the practice of designing networks (and other things like airplanes) such that any failures are gracefully handled. But how do you know that your engineering efforts are successful? You can start with network emulators, where you can learn a lot about how your network is currently configured and what the tests do. But ultimately, there’s really only one way: with real-world testing.

When I say “real-world testing,” I mean that you really have to turn off parts of the network and see if the applications properly fail over to the redundant infrastructure. Half-way testing doesn’t work. See my associated article at for how to not do failure testing: Resilience Engineering: Holy Grail of Business Continuity.

Application Resilience

True resilience may require application architecture changes. An application that can quickly switch between data centers is going to be much more resilient than an application that must be restarted or reconnected when a failure occurs. Unfortunately, software architecture changes are unlikely if you’re running software from a third party. You can get close with global server load balancing. The details depend greatly on the application. Your best bet will be to work with the vendor to understand what is possible in the short term and to identify how the vendor can improve the application’s resilience over the long term.

The Benefits of Testing

There are additional benefits from application testing. The IT team gets to practice troubleshooting and remediation processes. Did the failure get properly logged in the ticket system? Did all the processes work correctly? How long did it take to find and fix the failure, assuming that you’re simulating a real failure? Even if you’re not doing full failure testing, it is useful to see the symptoms of each type of failure and the diagnostic steps that uncover its cause.

The Risk of Testing

There will inevitably come a time when a failure test doesn’t go as planned, typically because some unforeseen part of the infrastructure also fails, or a dependency isn’t well understood. For example, you may think that your DNS server infrastructure is fully redundant, but for some reason it isn’t. Everything works as long as the primary data center is operational. But when you disconnect it to simulate a failure, the secondary DNS server is found lacking. Maybe it experienced an undetected failure or perhaps it isn’t able to handle the full production load. There may also be problems with applications that are suddenly split-brain, where both are functioning, but not communicating with each other. Map out all the failure tests that make sense and try some. I’ll bet that you find new corner cases that you didn’t anticipate.


Your organization’s management needs to understand the benefits and risks of testing. Propose that network and IT testing follow the lead of security penetration testing and phishing tests, which are becoming commonplace. Start with small tests and work up to larger, more impactful tests. Perform pre- and post-test debriefings to review what will happen and what you learned.

Good luck with your testing!


Hashtags: #TheNetCraftsmenWay #ResilienceEngineering #ApplicationTesting


Leave a Reply


Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.


Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.


John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.