All right, who applied the curse “may you live in interesting times”?
COVID, climate change, floods, fires, and so on have us thinking about disaster planning, among other things.
Nationally, in the U.S., there’s a lot of talk, but not much apparent deep climate change planning and action seem to be happening yet.
- Where are the starts on desalination plants in California (other than San Diego) to service the lack of water there and in nearby states?
- Where’s the power for them to come from (especially since Californian nuclear plants are being decommissioned)?
- What does the water situation do to the national and local food supply? International food demand will likely increase. Prices?
- What about creating pre-built cities/solutions for the homeless and the soon-to-be homeless to relocate into?
Some of this could also be subject to failing to think big enough or far enough ahead. It takes years to build new water or power supplies or housing for, say hundreds of thousands of displaced people. But without solid data indicating “big” is happening, who is likely to invest time and money in planning and building? The politics just don’t work.
There’s also mental “discounting” of risks with long lead times. Yet building a million homes or desalination plants etc., can take multiple years to complete. Like, ten or more?
Enough doom and gloom. Not anything you or I can solve.
How about the corporate world?
For corporate disaster recovery (DR) / continuity of operations (COOP) planning, there are likely some of the same sorts of issues. And the above is relevant since, in our corporate thinking, we tend to take utilities like our water supply and power for granted. The same applies to road access, clean air, and the supply chain, for that matter. Or just committing the money to have a true DR site and plans, rather than a low priority half-**d effort?
Thought: Considering all of this, has your organization revised its risk/disaster/etc. planning? And yes, this shades into COOP, but it’s the network/security side of that.
I’ve occasionally had glimpses of IT DR plans. They’re not usually why I’m consulting at a site.
When I’ve seen a DR plan, they were sometimes pretty good, and sometimes a bit thin or even nearly non-existing.
Malware problems also suggest that many organizations have been doing only the bare minimum re security, like vulnerability scans. That is clearly not enough as well – but a subject for a different blog. (I’m thinking of the audit-type things, like tracking vulnerabilities, were they addressed, software “bill of materials” re-exposure via app components, etc.)
One take-away is that “oh, we have good backups” isn’t always true – testing backup validity is rare and often taken for granted. Or new servers/VMs creep in, and nobody notices they aren’t getting backed up.
On the network front, new devices get deployed but perhaps not added to the net management software, so there is no recent copy of their configuration(s).
Getting back to DR plans… you can only plan so much, and at some point, high costs pre-empt being pro-active unless the threat is obvious enough and close enough (in time or geography). So you have to prioritize.
You and I can’t solve those problems. Let’s narrow the scope some more:
This blog will cover some thoughts about Network DR/COOP. That’s maybe where most of my blog readers and I can do something. I by no means specialize in this, so my thoughts may be wrong and certainly are incomplete (and don’t sue NetCraftsmen because of that!).
The intent here is to get you thinking and possibly challenging any assumptions you’ve been making.
I’ve already noted, I have only ever had the chance to participate in wide-ranging DR/COOP discussions a couple of times. It’s not a service NetCraftsmen explicitly sells or gets asked to consult on, although we certainly can assist with DR/COOP. I do hope every business is doing good DR/COOP planning and at least identifying risks. That hope is likely very overly optimistic.
I do recall majorly irritating a couple of DR/COOP-certified pros by poking holes in their contingency planning. There may be a useful point in that story. (Fallback excuse: or else I am getting old and prone to rambling stories, so please bear with me.)
The DR/COOP plan assumed that “key” people could drive to an alternative worksite, and the others use VPN. They did not have nearly enough remote access VPN licenses nor VPN head-end capacity for everyone to WFH.
However, the disaster recovery work site was a shared facility. The company planned on two shifts of say 200 or so people onsite (and driving or staying nearby). But the reality is that if the outage affected other customers of the DR provider, then the company might only get a fraction of the seats and have to send some staff elsewhere or several elsewheres. Fun stuff!
So, my question was: what if the area is in total gridlock? FWIW, I think that is actually the most likely form of “big problems,” and just one or two traffic accidents might be all that is required.
What if nearby hotels are all booked solid? What if food delivery trucks can’t get into the area? Etc.
Here’s why I focus on this:
From what I’ve seen in the DC area or most urban areas, if a large truck takes out an overpass or key bridge and shuts down the Beltway or key intersections, you’d get total gridlock. Recently that happened in central Atlanta. Kudos, they got it sufficiently cleared up rather quickly, overnight. But is that an exception? What if cleanup takes weeks (collapsed bridge over river or ravine)? Couple that with a truck that was carrying chemicals and a spill? Any spill might have to be tested for bio, chemical, or other risks. Longer recovery time?
Snow is another factor, as we’ve just seen—the 2022 snow jam on I-95. Or in the late ’70s, it took seven days to clear up the mess on Route 128, 10,000 or more stuck vehicles that got snowed into place by 3 feet or more of snow. Suppose heavy snow causes your data center roof to fail and also impairs mobility. What then?
To me, in fact, that somewhat calls into question the business model of DR office relocation providers. If their customers do not have reserved sole-use space, then they are only protecting against small-scale (single company/site?) problems. Isn’t Work From Home a better model? Oh, but what if the Internet is out, and road issues hinder repairs?
Getting off the road focus, let’s get back to the topic of your organization being the only one doing DR.
Really, what kind of disaster only affects your company? Maybe building collapse or electric/gas/water/HVAC issues for the building. But most weather events and other disasters tend to have a wider impact.
And that was why the DR pros got mad at me: they’d pretty much assumed single-company problem only.
What are some of the things we may be taking for granted? The above is just the start.
- I’ll note electric/gas/HVAC are somewhat coupled. The winter of 2020-2021 in Texas showed that lack of electric/gas or cost of them is important.
- The prior sections suggest that roads/travel, lodging, and food/restaurants can be factors. For long-term issues, laundry and other services might be factors.
- Communications/connectivity services are yet another factor. If you have redundant circuits or ISPs, do they share common failure modes? That takes a lot of work to verify!
There’s also the quarantine version of all this. Contact-free pizza delivery, cots and laundry facilities in the building, etc. Is that on your radar?
Lesson Learned: Check your assumptions and think about what might blind-side you. Prioritize.
Some Good News
There’s some good news here. COVID has helped businesses to the extent of the max WFH distributed workforce model. Should a new virus require near-instant drastic quarantining, we mostly know how to deal with that now. (Severe spread by surface contamination and food/supply issues aside.) I’m also assuming lack of maintenance due to long-term duration or workforce illness doesn’t disrupt remote access.
Having an even more widely dispersed staff (i.e., beyond commuting-to-the-office range) might actually be a good idea to allow a company to continue operating if their HQ city/region is badly messed up. (Earthquake takes out a region? Earthquake where they are scarce? Wildfire causes isolation?) That does assume HQ is not also the sole data center… hence, remote WFH staff and cloud and diverse services are important as well.
As we are continuing to discover with COVID, logistics and other service businesses do not support WFH. And might have problems with reduced staff due to illness. If your business does or uses logistics services, what’s your plan for delays or gaps?
Thinking in terms of disaster scale, whatever the cause, maybe one way to effectively plan. Building scale, one block, one-mile radius, five miles, city, region. Yes, you might factor in the probability.
One conclusion lurks there: geographic dispersion probably enhances disaster survivability as long as necessary services connect the affected region to elsewhere.
Lesson Learned: Be explicit about your DR/COOP assumptions (including what has NOT failed).
There’s another good news story on this topic. I used to worry about Cisco having such a big concentration of office space in/around San Jose. One big earthquake or flood could be catastrophic. Cisco’s auditor apparently saw this issue, and it began appearing in their financial reports. The Cisco IT team then reported on how they did risk analysis before building data centers inland in Texas.
However, Texas power might not have been weighted properly, in retrospect.
Pursuing the diversity thread:
- If you apply that to things like warehouses, smaller, more in number, and dispersed might be better for continuity of operations, albeit probably costs somewhat more (loss of economies of scale, need for good software and processes to track dispersed inventory).
- If you learn from what Cisco did for a new data center location — when there is a new build opportunity, relocating key corporate buildings and HQ to meet some minimal criteria is probably a good thing. See their writeup for the risk factors they considered, and add your own.
Tackling the Problem
So how does one even begin thinking about corporate (etc.) DR/COOP?
I started thinking about causes. There are a lot of possible disaster causes. That gets messy fast!
It may sound simplistic, but rather than considering causes, it might be better to think in terms of what “resource” breaks or becomes unavailable or unusable. And the scope: how wide the breakage is (and how widespread a problem your risk management can afford to protect against).
Potential causes can separately be mapped to breakage and scope as a way to validate your coverage. As in: “what did I miss?”
More important probably: mapping lost/failed resources against alternatives and impact.
So, for example, floods remove or limit staff mobility and take out equipment, circuits, and sites. Tornados, more limited scope but ditto. Hurricanes, the broader scope for power, telecom, roads/bridges, etc., outage. Also, fuel pipelines, gas station supply, etc.
So, what key resources does corporate networking require to operate and continue doing so?
- Existing infrastructure (or sufficient surviving infrastructure).
- Probably not worth considering national scale EMP kiss your network and former life good-by. DR will be the least of our problems in that case. (Prioritize!)
- This Is why we usually do geodiversity. Except that many companies are still in “single big HQ” mode. Face time with the CEO, CIO, etc.!
- WAN/MAN services (remote access, site to site, cloud connectivity, etc.). This is where we hope our service providers have been doing their DR planning well.
- We saw recently (Nashville bombing) that fiber entry to a major provider central office was a massive SPOF affecting a fairly large region. Good luck getting a service provider to tell you about their HA shortcomings!
- Wildfires and flooding may impact long-haul fiber or other interconnections. (And degrade microwave circuits.) That seems likely to reduce carrier path diversity, particularly in the middle of the US and where there are mountains. Are there chokepoints where there is little diversity?
- Fiber that runs along pipeline right of ways concerns me. Are there possible common failure modes for carrier long-haul circuits/fiber?
- Access circuits are generally the less diversely routed aspect. Between planning and taking path diversity (even inside buildings) seriously, we can mitigate a lot of this risk. This does take significant effort to do it right!
- Cloud and CoLo and data centers. Ditto: diverse delivery of app services. Many organizations are not really ready on this front. Are ALL your key apps deliverable quickly out of at least two locations. Have you checked “key apps” lately? Have you actually fully tested failover? DNS, DHCP are some of the gotchas that might not be thought of as “key.”
- Cloud SPOFs: Do you have any (cloud or main data center) services SPOFs, e.g., Facebook building access, etc., depending on their own services that became inaccessible.
- Do write solid DR playbook-type plan with tests. If you must execute in the middle of the night on 3 hours’ sleep (or post partying), you will really need such plans. And if you have such plans, schedule a review every N months (N = 3, 6, or 12).
- In one case I heard about during an assessment, the network staff’s “plan” was three hand-written bullet points on the back of an envelope. When they had to execute a test DR event on short notice, they had a really nasty 48-hour fail, and presumably, management had some words for them.
- Very simple is good. “Plugin/unplug this cable” or “shut these links down and bring these up” is very simple.
- Staff access. WFH works pretty well. As noted early, bigger geo spread for staff locations may reduce risk (but add other risks). Providers coped with COVID by isolating staff with Colo etc., access to their staff only, and quarantining might suffice there. Food, lodging, etc., for longer-term quarantining, are something some providers have apparently considered.
- Knowledge is a key asset. How many companies can survive something that keeps all or most of their leadership offline? Or most of the network team?
- Documentation is a key risk hedge that is currently missing AT MOST ORGANIZATIONS—particularly “big picture” documentation.
- Related: what’s the big picture when DR plan X is in effect?
- We can more or less figure out details from device configurations if we know the big picture. Those of you who have been paying attention know this is one of my pet peeves. Everyone at a site tends to take the big picture for granted. So it is rarely documented. When I do an assessment, it’s the first thing I do to make sure I know how all the pieces connect, etc.!
- Cooling, power, and water. We probably take these for granted. As climate change continues, they generally may be affected. Already, they are likely to soon become very scarce and costly resources in some parts of the US and elsewhere. What if grid power is only available 6 or 8 of 24 hours? What if water is in very short supply? Could that impact the cooling systems? I have no idea! What if water is dedicated to people, not companies, so your company only gets a small water ration?
- Supply chain. We’ve recently seen how that matters. That affects networking in terms of getting gear for upgrades and new sites or replacement gear. And you have perhaps reduced provider diversity, as replacement equipment might take longer to arrive and get deployed to fix outages, either for you or for a WAN provider. So, do you stock more spare devices? In more extreme cases, will FedEx (or UPS, etc.) be available to get a spare from your storage location to where it is needed? How long to deliver? Do you stock regional spares?
That’s a starter list, at least.
Challenge: What’s missing from the above list?
Challenge2: Does your organization have a Chief Risk Officer? (Shout-out to my sister Barbara, who did that for a major insurance company for a while.)
DR/COOP is hard. I hope the above was helpful.
It looks like my upcoming retirement years (at some point) may be eventful, as in “fun times.” Hopefully a lot less eventful than anything described above.