Is Your Disaster Recovery Plan Actually a Disaster?

A recent blog about Take Home Our Computer (THOC) policies talked about Continuity Of Operations (COOP) planning, and supplying a COOP site for people to work at versus facilitating work from home via corporate laptops.

While my brain is on the topic of Disaster Recovery (DR), I have a few other observations to make. I end up wondering if most DR plans will turn into a shambles — a disaster of their own, if and when something bad happens.

This type of planning always requires thinking through what is critical (people, assets, applications) for continued operation of the organization. There also needs to be weighing of relative risks, and drawing a line as to which risks are to be prepared for, and which ones will not be prepared for, with management sign-off accepting the latter risks. So far so good.

The problem I see starts with inadequate planning. Nobody means for it to happen. DR is important. But staff is overloaded, and DR is less urgent than day-to-day concerns, so the planning just… slides out into the future. Stephen Covey wrote about a 2 x 2 matrix, with axes being “urgency” versus “importance.” When the phone rings, that’s urgent but might not be important. DR is important but never very urgent. Long-term important things like DR tend not to happen.

Pete’s DR Lesson #1: Unless you dedicate staff to DR and COOP, perhaps one day a week, this planning won’t happen. If your staff are constantly running behind on projects, you might or might not be short-staffed, but DR is highly unlikely to happen.

The second observation I have — and everyone I mention this to agrees it is a problem — is that whenever the annual or semi-annual DR drill takes place, more critical dependencies show up. They get added to the list of critical apps or data, planning documents get updated. Repeat for 10 years.

Pete’s DR Lesson #2: Stop the insanity of doing the same thing over and over again and expecting different results.

The problem here is basically Lesson #1: People don’t have the time to plan thoroughly if at all, and/or key pieces won’t show up until you (a) activate full DR, (b) actually try using apps, and (c) actually try executing various business functions.

My personal suspicion is that human error plays a role here. Full preparedness is difficult. So what’s really happening is that organizations are trying to shave costs by doing “shoe-horning” (running only critical apps on a smaller computer footprint, squeezing them in). But in doing so, the assumption is that planning will fill in the gaps. Except that cost is also cut in that the planning never gets done, or is done poorly/hastily. And since identifying critical application and data dependencies is hard, things can easily be overlooked.

One alternative is to mirror the main datacenter at the DR or active-active site. Yes, that costs money. Then plan to re-activate everything at the DR site in priority order. Doing so greatly reduces the amount of planning needed (not that it is really getting done). Especially building a one-off DR site and then having to deal with its differences in a DR/COOP event. This approach vastly increases the likelihood the business will actually be able to continue operating. It also means you don’t need to do capacity planning and load testing, since all VMs and apps will be running in the same device capacity before and after DR activation. Additionally, you don’t have to do business process drills to identify missing data — you have all the data and apps. (Barring things people did on paper only.)

You could call this strategy “No Application or Data Left Behind.”

By the way, if you have plans for how to activate your DR, and those plans only live on the corporate fileshare… will you have them accessible when you need them? The DR fileserver will need to be up and running and accessible. How will it get that way, if the plans for doing so aren’t available? Winging it?

Pete’s DR Lesson #3: Your DR planning documents and particularly your step-by-step plans for activating DR, bringing up the network, and priority sequencing of bringing up apps all need to be available when the main fileserver is down/inaccessible.

My thoughts: keep copies on laptops or thumb drives, and update them regularly. I’d say use a secure cloud share, but that might not be accessible in the early hours of a regional disaster or event.

You do have planning documents, don’t you? And you are doing continual improvement every time you do a DR drill, aren’t you? Maybe not… many IT departments, especially networking groups, seem to be operating in a documentation-free mode lately. Good documentation is critical in case a key person quits or is hit by a truck, or becomes unavailable/unreachable. Good basic documentation of what you built, how it works, and why you built it that way is very useful for bringing new hires or consultants up to speed quickly. I rarely see it. Yes, I get it, you and everyone else are massively short-handed, and (if you’re in management), you can’t get the budget.

The thing is, unless you document how to execute on DR, your process will never improve. “Winging it” is not a good answer.

Relating to this is how thoroughly you test. “The application came up” is not a very useful test — that’s the starting point. Can you actually log in and exercise a full set of key application operations? Have you done load testing to determine if the application can handle a full set of local and remote users — while all applications are running on the shared small DR hardware footprint? Have you actually put a solid subset of critical staff onsite and worked from the site for one to two days, to catch any gaps in the list of critical applications? Do you have licenses and capacity for all staff to use Remote Access/VPN, etc.?

I have a reference situation for this. I participated in an under-planned Dev/Test lab migration for a large lab. It was supposed to take a week. It ended up taking four weeks and causing a lot of stress and unhappiness.

The problem was basically that when many things aren’t working, staff can get into something like a computer thrashing situation, jumping from fire to fire and not getting problems solved. My untested hypothesis is that in a DR event, you’ll be able to handle one or two unplanned problems. If you have 10, you’re going to be scrambling for quite a while, and won’t come close to hitting deadlines.

Pete’s DR Lesson #4: Undocumented DR execution plans almost guarantee your DR activation will be a disaster. If your testing isn’t very thorough, ditto. If you only do DR testing one or a few applications at a time, ditto: Your hardware may not be able to run all the critical apps under load when needed.

Finally, as a reformed mathematician, I’m very aware of Boolean Logic. Those readers with EE degrees likely are, too.

I’ve run into situations where the Business Continuity/Disaster Recovery team makes a bunch of assumptions. Unfortunately, they may be providing coverage only if all the assumptions are true.

For example, I’ve seen these assumptions:

The DR or COOP event will only last a couple of days to a week.
It will only affect our company — we’ll be able to use the full DR site set of seats, mainframe capacity, etc.
Critical staff will be able to drive to the COOP site to work.
Staff considered non-critical will not be needed during the duration of the short event.
If the event is longer, we’ll be able to activate more COOP worksites, order computers and other equipment, add remote access/VPN licenses for non-critical staff, etc.

What gets tested is the narrow scenario of short-term COOP, full access to COOP site, no mobility issues, etc. – the logical and where all those assumptions are true.

OK, if the event is short but mobility is a problem, a company can survive a couple of days of downtime. If it goes longer, that’s not so good. There’s a confidence aspect — if something bad happens, and if you’re “off the air,” customers may be very concerned.

Suppose something happens that affects all companies in some office-heavy area. Then multiple companies may be intending to use the same COOP facilities. So you may get shifted to one that is farther away. That in turn may make the commute longer for shifts of staff expected to go to the alternate site.

Where I’m going with this is…

Pete’s DR Lesson #5: Identify your key assumptions in your planning. Then consider every true/false combination, and how well your plans will hold up. Document the proposed workarounds, and evaluate them for likelihood of working.

I haven’t seen anybody do this. Should people be doing this level of planning?