What is the future of Disaster Recovery? That’s the obvious question after my blog on Improving Disaster Recovery, at https://netcraftsmen.com/blogs/entry/improving-disaster-recovery.html. To recapitulate, that blog concluded that by doing datacenter basics well, we can help ourselves out in terms of present performance and troubleshooting, while easing our preparations for DR/COOP. We also gain by reducing complexity, specifically by going to a limited set of application and network services architectures. After all, the fewer variants we have to wrap our brains around for DR/COOP purposes, the more likely we are to get them right, and the more likely we’ll be able to find the time to do them well. This blog looks at new techniques that might be useful for BC/DR (Business Continuity / Disaster Recovery). There is so much to say so there will be a follow-on Part 2 blog to complete the discussion — and I’m trying to just scratch the surface of this topic.
New Tools for DR?
Looking at new / recent technologies, we have some tools which might help us change the game, and do a more comprehensive and effective job of DR/COOP planning — and avoiding the “DR Dance of Due Diligence.” (See the above blog for an explanation of this term I coined.)
These new tools include:
- Enterprise Load Balancers (“Application Delivery Controller” or ADC) — OK, old tool, but still not that commonly used for DR
- Network technologies for Layer 2 Data Center Interconnect (DCI, be it OTV, EoMPLS, VPLS).
- VMware / Cisco VXLAN
- VMware capabilities (basic)
- VMware advanced capabilities (in particular Fault Tolerance and Site Recovery Manager)
- Storage-based solutions (active-active or active-cached storage)
- (Am I missing anything else here?)
It is always helpful to examine one’s assumptions. Wrong conclusions may be due to an incorrect starting point. Anyway, the following are my assumptions that seem relevant to DR/COOP/failover.
- Test it or your failover or spare won’t work. Using your spare actively for something greatly increases the odds failover to it will work when needed. Dare I say “use it or lose it”?
- Layer 2 is simple, but becomes evil when carried too far.
- Clustering or stateful failover technologies are never perfect … too many programmers code and test on LAN, but WAN and dark fiber have other failure modes. See also RFC 3439.
Doing It The Old Way
A lot of sites use or used to use the old “the IP subnets are now over here” trick. Bring up the replacement servers, activate / address the connecting router or switch interface(s), using dynamic routing internally and externally to direct traffic to the DR site servers. Main drawback: most sites do it with manual intervention, and the RTO is generally on the order of hours to days involving shuffling hardware, restoring server backups, and all sorts of activity.
One thing I’ve learned is how little we can troubleshoot quickly. If you’re doing a datacenter move, DR, whatever, if one thing goes wrong you can probably fix it, maybe in fairly short period of time. When you have tens or hundreds of servers, switch connections, etc., if more than a couple of things don’t work, it may take you quite a while to work through them all — especially if you let yelling management or server people cause you to thrash (as in swap back and forth between problems rather than tackling them one at a time).
With all the changes being made in the DR environment, in a hurry to hit the RTO objective, what what is the probability things will all happen correctly? I suspect the answer is “low”, and that many little problems will accumulate (cabling, port config errors, etc.). On top of server / data recover issues. To me, the more self-contained things are (and ready to roll), the more likely they are to work in a DR situation. Murphy’s Law will still be present, but to a lesser degree.
What Does L2 Data Center Interconnect Add?
DCI techniques (OTV, A-VPLS, EoMPLS, etc.) allow us to have a split subnet, one subnet or VLAN that is bridged between two locations.
The first price of bridging L2 between data centers in any form is that BUM (Broadcast, Unknown Unicast, Multicast) traffic must be transported anywhere in the Spanning Tree Protocol (STP) domain. By the way, you do have free WAN bandwidth, so lots of BUM traffic doesn’t matter?
BUM traffic relates to what happens when STP fails and a loop occurs (aka “evil”): you get recycled BUM traffic too. The STP or bridged domain becomes the failure domain. It’s bad enough having an entire datacenter, hospital, etc. down for hours (or days). When your backup datacenter goes down as well, that’s not only business impact, it’s a Time for New Resume and Job event. See also my recent blog OTV Best Practices for defensive measures you should take.
Balance this against convenience. From a server / management perspective, building a Microsoft or other cluster split across two sites may look like a simple way to do DR. Or VMotion or related server / database snapshotting. Consider that (a) sub-optimal traffic flows and how to resolve that, (b) integration with stateful firewalls, load balancers, etc., and (c) the possibility of “shared fate”, i.e. something at one datacenter taking down its clone at the other datacenter. All of a sudden it looks like one of those easy-for-the-server-team and hard-from-the-network-perspective things.
As far as failover, some of us have experienced problems with stateful failover. I’ve personally seen it with CheckPoint firewall pairs “losing it” when there is a bouncing interconnect. They mutually corrupted each others’ rules databases. You do have a backup copy of the config you can blow back into the firewalls at 3 AM? Assuming enough of the rules survived so that you can remotely access them?
Rumor says Cisco may soon have the ability to “cluster” ASA firewalls. How do you feel about clustering between two datacenters? Rumor says that is not a recommended practice. Heck, I’ll recommend not doing it. It has the potential to be a lot simpler … at least until you have a bad day. Then it’s not so simple.
So DCI offers some interesting possibilities, and my (and others’) main concern is shared failure modes.
This approach does potentially meet the “simpler DR alternative” criterion. If key VLANs are in both datacenters, and if you’re using VMotion-based techniques, then there’s a lot less change happening in your network. What I can see being a potential subtle problem is in application dependencies and unanticipated latency changes impacting critical applications, e.g. if DR is split across a couple of datacenters or cloud.
I have VXLAN high on my list of topics to write about. For the purposes of this discussion, consider VXLAN to be a VMware- or 1000v-based DCI technique, sort of like VMware based OTV. Like OTV, VXLAN allows you to split subnets across datacenters. It is another L2 tunneling over L3 technique.
Using VXLAN therefore implies some degree of BUM traffic radiation. It is likely somewhat chattier on your WAN / dark fiber than OTV is.
The biggest current concern with VXLAN is getting traffic into or out of the VXLAN (distributed VLAN). Currently that requires a VXLAN gateway. The ones I’ve heard of to date are vShield Edge (VSE) and ASA 1000v. (See also http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps9902/white_paper_c11-685115.html.)
There is no way at present to have redundant VSE gateways for a VXLAN. ASA 1000v gatewaying probably allows stateful failover. Google search is not turning up any documents confirming that, however. To me, that rules it out as a DR candidate in any form.It also appears to suffer from the sub-optimal traffic or tromboning behavior problems just like DCI techniques (above). BZZT! I end up thinking OTV is a more mature solution — and the nicest (in the sense of least nasty) of the DCI variants.
Please do comment if you agree or disagree. I’d especially welcome other perspectives that reach different conclusions than I do, or technologies I missed above or in the coming Part 2 (technologies listed above)!
RFC 3439 on Simplicity is always relevant: http://tools.ietf.org/html/rfc3439
Concerning testing, I like the Netflix idea of deliberately failing to make sure things will keep working. See also http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html and http://www.codinghorror.com/blog/2011/04/working-with-the-chaos-monkey.html.
See also one of my prior blogs, Cloud and Latency, at https://netcraftsmen.com/blogs/entry/cloud-and-latency.html.
For OTV and Data Center Interconnect (DCI), see my prior blogs. One way to find them: google search “otv site:netcraftsmen.net”, which gets you to https://www.google.com/search?q=otv+site:netcraftsmen.net. Or the just-posted OTV Best Practices, at https://netcraftsmen.com/blogs/entry/otv-best-practices.html, which contains some links to NetCraftsmen blogs. My posted CMUG sessions have some VXLAN coverage. As noted above, I’ve been meaning to blog about VXLAN, but too many topics, too little time…
I’d like to list specific Ivan Pepelnjak blogs relevant to L2 failover, Data Center Interconnect, VXLAN, sub-optimal paths, failure domains, and lower risk DR techniques. They’ll all great reading. However, the task is daunting, too many blogs, too little time. Go read ipspace.net, especially the blogs. I used to worry that Ivan and I were becoming grumpy old men, too questioning of new technology. We may still both be grumpy about some new technologies — but I’ve stopped worrying about it .
Here are some good relevant Ivan P. blogs that google search helped me rediscover:
For DCI in general, start at Cisco’s technology page, http://www.cisco.com/en/US/netsol/ns975/index.html. You can drill down from there to OTV and other resources. Networkers presentations are another great source of fresh DCI info!