The New Disaster Recovery — Part 2 - NetCraftsmen, a BlueAlly Company

I’ve been blogging about Disaster Recovery, most recently about new techniques that might be used for BC/DR (Business Continuity / Disaster Recovery. Prior blogs on the topic: Improving Disaster Recovery and The New Disaster Recovery — Part 1. This blog is Part 2, briefly discussing some new (and old) techniques that one might use for BC/DR. See Part 1 for the list of relevant techniques / technologies.

Load Balancers and DR

We start with a classic (but one that doesn’t seem to get as much use as it should).

Cisco GSS or other vendors’ GSLB type solutions allow you to use Load Balancers (Application Delivery Controllers) to control failover between sites. If your applications can be written to allow active servers at both datacenters, you can then be constantly using servers at both sites. This meets the “testing” axiom in the Part 1 blog..

My current inclination (preference?) is to use a GSLB and SLB / ADC solution for DR/COOP between datacenters. The relevant point is that while it introduces some mild complexity, it tends to minimize the shared fate between datacenters. Yes, your staff has to get the Global Site Load Balancing and Load Balancing configurations right, but once that’s down and tested that’s fairly cookie cutter. To maintain optimal flows, you probably want a VIP for each application / data center combination, i.e. one or several VIP subnets per datacenter.

One argument for the GSLB / SLB approach is simplicity. Very little changes. You’re running with the same everything, it’s just that some of your servers have gone away and the GSLB and SLB deal with that. Potential issue: the classic dual {server, network, resource} setting where nobody notices that both are running at say 75% of capacity until a failure / DR takes place and the surviving resource is trying to run at 150% of its max capacity. (This seem to happen a bit with e.g. dual WAN connections to a second datacenter, for example.)

In writing this I realized there is a good question lurking: with GSLB you have some sort of availability tracking and state mechanism between sites. (NetScaler for instance can tunnel traffic arriving at one GSLB to the one at the other site.) With split-datacenter clustering of ASA or ACE, you have something similar. So why not just do the latter?

Answering my own question after some superficial analysis, my thought is that with GSLB you’re tracking less state and you’re handling TCP flow connections rather than packet by packet (unless failover triggers tunneling, anyway). Furthermore, the GSLB devices are standalone as to configuration and behavior, whereas clustered firewalls or SLBs are sharing state and possibly configurations as well. Propagation of configuration information is where I’ve seen problems in the past with CheckPoint firewalls. In short, GSLB seems to have less shared state.

I suspect deeper analysis might mean thinking through the details of how each vendor’s SLB products handle redirection of traffic to the other datacenter.

What do you think? Can you share some experience or knowledge about any of this in a comment?

Basic VMware

In googling around, I found some apparently older (but still valid) VMware documents talking about how VMware virtualization makes doing DR easier (my summary of what the sales document says). With older DR techniques, you end up with clones of hardware at a very detailed level (down to NIC cards and drivers). Or “shoe-horning”, where one big server / mainframe is to be reconstituted by splitting its workload across a couple of smaller machines at the DR site. That strikes me as requiring a lot of attention to detail and planning.

Contrast that, at least for x86 based OS’s and apps, VMware or another hypervisor abstracts VMs from the actual hardware, so that instead of managing physical servers / NICs / CPU / RAM / disk at the per-server level, you get the aggregation benefit of doing it at the hypervisor host system level. If you figure 12 to 50+ VMs per hypervisor host system, that’s a large reduction in managed entities. That’s clearly a fairly big win. Less things to manage and get right in a DR scenario, so you can either handle more systems, free up time (staff) for other things, or be up and running faster.

Interpolating a bit, if you set up hypervisor hosts based on application / VM criticality, that might help you manage VM snapshotting and data replication, potentially using tools VMware would love to sell you.

The VMware vmbook (see URL below) suggests alternatives for your overall approach, with VM retention of IP address or with IP re-addressing / alternative NIC activation. It looked like it might be useful reading even if you don’t use VMware.

Advanced VMware

This category includes VMware Fault Tolerance and SRM. (I’m sure if I missed another VMware solution, someone will comment. Ditto for other hypervisors.)

Some people think VMotion and related tools are a panacea (cure all). You might find Ivan Pepelnjak’s blog about long distance VMotion for disaster (see References below) interesting. Maintaining state between a running VM and a stored snapshot can require a lot of bandwidth. And you still have to have some replication mechanism for any back-end storage used by the VM. Worse, even if you do invest in the bandwidth (which also supports an “evacuate the main datacenter” approach), there’s the question of software crashes. If you fully replicate VM state and the primary VM crashes, if that state gets replicated to the snapshot copy that “wakes up”, might the replica not then also crash? (This leaves me with the thought that some very small lag in state replication might be useful in preventing such problems.)

Let’s summarize this item with: VMware SRM manages a lot of the process of replication and recovery for you. See the URL below.

VMware SRM: http://www.vmware.com/files/pdf/products/SRM/VMware-vCenter-Site-Recovery-Manager-with-vSphere-Replication-Datasheet.pdf.

Storage Based DR

I ran across an interesting document from EMC. I imagine other SAN vendors may have similar solutions. What caught my eye was the notion of interaction with VMware: when you do VMotion, you always have the question of what SAN devices is the VM interacting with, and do you tie VM VMotion to storage VMotion to localize storage IO . If you have SAN caching, then conceivably that simplifies the “where’s my storage” aspect of this.

Quoting the best practices guide: “VPLEX uses a unique clustering architecture to help customers break the boundaries of the data center and allow servers at multiple data centers to have concurrent readand write access to shared block storage devices.”

Paraphrasing: EMC VPLEX GeoSynchrony allows access to a single copy of data in different datacenters, supporting transparent migration of running virtual machines between data centers. It is a SAN-based approach that provides local and distributed federation of storage resources. The AccessAnywhere technology enables a single copy of data to be shared, accessed, and relocated over distance.

Other words: … scale-out clustering … advanced data caching with SDRAM cache … distributed cache coherence.

Reading between the lines: VPLEX Metro is for sites within 5 msec round trip time (synchronous replication). VPLEX Geo is the longer-distance version, and does not support live VMotion.

I’d want a good understanding of how the replication and caching works under failure / recovery conditions. If my DR site was in the VPLEX Geo distance range, I’d also want to understand the impact of failing over (cache performance under failover conditions, how do I avoid losing data that didn’t yet get replicated, etc.).

See the document Using VMware vSphere with EMC VPLEX. Also interesting:
Vblock, VPLEX and VDI and VMware View Disaster Recovery Scenarios & Options.

Summary / Conclusions

The new technologies listed above open up some exciting opportunities to improve DR practices and greatly speed up recovery time (important since the global Internet is 24 x 7).

DR / COOP is complicated. COOP brings with it other issues, such as hotelling staff / staff access, as well.

As you can see from the above, I have lots of possible answers. Which is best? It depends. As I have blogged elsewhere, I’m a big fan of simplicity. Even with that as a major criterion, there are a couple of good answers possible.

I do get the impression that getting out in front of the DR issue might be good for organizations, in terms of coming up with a limited set of application architectures along with DR strategy for each. Otherwise you end up with the various ad hoc techniques each application developer, internal administrator, and/or project manager came up with. If every application is a one-off, then your DR planning will be very complicated and costly — and probably not work so well when needed. If you have only a couple of application recovery scenarios, then you can do more thorough planning, executing the plan will be simpler, and you’ll be up and running faster, probably with fewer problems along the way.

Comments Please

Please do comment if you agree or disagree. I’d especially welcome other perspectives, different opinions, or hearing if you reach different conclusions than I do, or technologies I missed above! Also what works well for you. If we share best practices and success, we all benefit!