Network Stability Through Resilience Engineering

Terry Slattery
Principal Architect

Resilience Engineering

Resilience engineering is the practice of designing networks (and other things like airplanes) such that any failures are gracefully handled. But how do you know that your engineering efforts are successful? You can start with network emulators, where you can learn a lot about how your network is currently configured and what the tests do. But ultimately, there’s really only one way: with real-world testing.

When I say “real-world testing,” I mean that you really have to turn off parts of the network and see if the applications properly fail over to the redundant infrastructure. Half-way testing doesn’t work. See my associated article at for how to not do failure testing: Resilience Engineering: Holy Grail of Business Continuity.

Application Resilience

True resilience may require application architecture changes. An application that can quickly switch between data centers is going to be much more resilient than an application that must be restarted or reconnected when a failure occurs. Unfortunately, software architecture changes are unlikely if you’re running software from a third party. You can get close with global server load balancing. The details depend greatly on the application. Your best bet will be to work with the vendor to understand what is possible in the short term and to identify how the vendor can improve the application’s resilience over the long term.

The Benefits of Testing

There are additional benefits from application testing. The IT team gets to practice troubleshooting and remediation processes. Did the failure get properly logged in the ticket system? Did all the processes work correctly? How long did it take to find and fix the failure, assuming that you’re simulating a real failure? Even if you’re not doing full failure testing, it is useful to see the symptoms of each type of failure and the diagnostic steps that uncover its cause.

The Risk of Testing

There will inevitably come a time when a failure test doesn’t go as planned, typically because some unforeseen part of the infrastructure also fails, or a dependency isn’t well understood. For example, you may think that your DNS server infrastructure is fully redundant, but for some reason it isn’t. Everything works as long as the primary data center is operational. But when you disconnect it to simulate a failure, the secondary DNS server is found lacking. Maybe it experienced an undetected failure or perhaps it isn’t able to handle the full production load. There may also be problems with applications that are suddenly split-brain, where both are functioning, but not communicating with each other. Map out all the failure tests that make sense and try some. I’ll bet that you find new corner cases that you didn’t anticipate.


Your organization’s management needs to understand the benefits and risks of testing. Propose that network and IT testing follow the lead of security penetration testing and phishing tests, which are becoming commonplace. Start with small tests and work up to larger, more impactful tests. Perform pre- and post-test debriefings to review what will happen and what you learned.

Good luck with your testing!


Hashtags: #TheNetCraftsmenWay #ResilienceEngineering #ApplicationTesting


Leave a Reply