Disaster recovery is one of those important things that seldom seems to get the attention it deserves. My experience is that most large organizations, and some smaller organizations, have disaster recovery plans that at least get partial testing.
But then, there are small to medium-sized organizations that are storing backups in some form offsite, but have not yet gotten around to having a DR site, for financial or other reasons. The NetCraftsmen team recently rallied around to support a customer that got lucky in that regard (transformer blew, water and smoke damage).
Subsequently, I had the opportunity to discuss priorities and overall uptime and DR strategy with a regional hospital customer, and the topic arose there. There were many Single Points of Failure and other things for the hospital team to address. Compared to that list of urgent actions, setting up an actual DR site looked like it might be a lower priority. We had a good discussion, and I’d like to share some of my thoughts coming out of it.
Minimal “ad hoc” DR plans seem to come down to:
- Protect data with some form of offsite backup
- If the worst happens, we’ll find some space, write a big check for gear, and be back up and running in a few days
Well, that’s all well and good, but what about lead time? Specifically, if you don’t already have enough space for racks, server, storage, and network gear, it can take time to find suitable space, get funding approved, sign the lease, get the space prepped for occupancy, etc. Similarly, if that space doesn’t already have sufficient power for all that gear, it can take time to get power into the space, and time to get the electrical work done, buy and install UPS systems and PDU’s, etc. Ditto HVAC – all that power will generate heat.
There’s also connectivity. If you have customers, you probably need internet connectivity for web services. You’ll probably also need WAN or MAN connectivity from the new DR site back to the main building or the WAN/MAN to other sites. Since “service provider provisioning” is synonymous with “slow” (if not “very slow”), that’s more lead time. You might be able to simplify things by being prepared to use VPN over the Internet for your DR WAN, assuming any remote sites have Internet routers or firewalls with VPN capability and capacity.
I recommend getting the long lead-time tasks knocked out in advance, even if you aren’t going any further. If you choose your DR site well, you might be able to use it for offsite backup as well – or in the case of a hospital, archival of old PACS images, etc.
For small-ish businesses “duplicate what we had” might be the simplest way forward, if actually activating the DR site. That does require you have a good inventory of installed gear, on hand from before the DR event! Having it stored on a server in the datacenter that just went away – that would be a major oops!
Should there then be a need for DR, your “lukewarm” site could be activated. That assumes you think your business can afford the remaining downtime while racks, servers, and network gear are ordered, shipped, and installed. At least this way, the lead time would be a couple of weeks, rather than a month-plus.
Is There a Better Way?
There is, of course, a tantalizing alternative: cloud. The lead-time element there is establishing payment and familiarization/planning. If your key apps are latency-intolerant, or require bare metal servers, cloud might not be workable. If you want to replicate your datacenter rapidly, well, there are virtual routers and firewalls you could leverage. Switches, not so much; you’d need to think in terms of replicating VLANs and VRFs. The UnderArmour talk at CiscoLive covered a couple of the workarounds you may need to consider.
Using a regional colocation facility is another possibility. If you have colocation space arranged, and the space provides robust internet access, you’re a good part of the way done, without the time being consumed for finding and leasing space, arranging power, etc. (as above).
Another consideration is distance to the DR site and connectivity. You might be in an area squeezed between mountains and ocean (e.g. parts of the West Coast), where the fiber mostly runs parallel to the coast. You might be subject to other geographic connectivity limitations. How close to your current site should your DR site be? Too close, and it is subject to the effects of the DR event. Too far, and will you be able to communicate from your main site to the DR site (assuming your main site is still functioning, staff can get to it, etc.), or use it for your Continuity of Operations remote access plan? Having your DR site in a well-connected colocation facility would likely make for robust Internet connectivity and datacenter – but could it talk to your main site in a disaster?
I’ve had that discussion as well in urban areas, i.e. Washington and New York. Is Brooklyn too close to Manhattan for DR? Factoring in Hurricane Sandy’s impact? My answer is a strong yes; there are too many failure modes in common between the two places. What about say I-95 and I-270 in the DC area? I lean towards “a bit close,” but would you be able to communicate to a DR site that’s further away? I can’t really answer that for you. One argument is that something that takes out both such sites is big enough that you or your customer base may have bigger problems to deal with, or a mortality-reduced-to-zero problem burden. I find that a bit gloomy as a perspective.
My suggestion is that one wants independent power grids, road grids, communication/fiber grids for the main and DR site, if possible. Latency and synchronous replication often constrains the distance that is workable.
Widespread vs. Localized Disasters
I’ll conclude by noting that I’ve had some warm discussions with “BCP-certified professionals” as I have tried to explore the completeness of their site’s DR plans. I’ve seen cases where the DR plan looked like it would work only if several large assumptions were all valid. Personally, I’ve noticed that when the least thing goes wrong in most major urban areas, traffic on the highways stops moving. Add a real disaster with some bridges down, and you’ve got major gridlock. That’s one reason a lot of the DR and continuity of operations plans that exist today are pretty much non-starters. Nobody is going to be going anywhere.
Similarly, if you assume your firm is the only one with a DR event, well, that means you’re prepared only for a very localized event such as a building issue, fire, water, power, or external fiber cut – and not for a more widespread problem (flood, bomb, tornado, earthquake, etc.). That’s fine as long as your planning recognizes that limitation. Should we refer to the two different situations as “limited-DR” and “large-scale-DR” then?
There’s a lesson there – DR really needs to cover several different scenarios, and have a plan for each. Not just “Planning for Bad Things Only Affecting Our Company.”
To sum up, DR requires you to be honest and thorough in examining your requirements, which may require considering various types of DR events, and deciding which you will protect against. Looking at various DR situations and weighing DR designs against them might be part of the process. If you’re a hospital or public utility, that’s a tough job, since people are going to be counting on you most when a disaster happens.
If you’d like an independent review of your own disaster recovery plan, or want to talk about developing your plan, just reach out.
Comments
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!