I’d like to share some thoughts about Disaster Recover / Continuity of Operations (DR / COOP). This is triggered by the storm-related outages we just had in the MD/VA area, and reading that Amazon AWS and other big datacenters had outages.We all would agree DR / COOP is an important aspect of network and overall enterprise operations. Yet when you look at what actually happens, it ends up in a sort of second class status. Reasons for that include budget, ongoing rollouts of new applications and services, equipment upgrades, etc. After all, Directors of IT don’t achieve success by halting progress for some period of time to attend to DR/COOP. Yet when something hits the fan, there’s always lots of blame to go around, fun lessons learned meetings, and so on. With the onset of occasional storms with higher wind strength and massive rainfall, should we be upping our game?
Start with Datacenter Best Practices
A lot of what’s needed for DR/COOP somewhat relates to best datacenter practices, i.e. things that it would be good to be doing anyway.That is, there are some useful datacenter practices, first steps that can contribute to DR/COOP.
I’m amazed by repeatedly seeing sites with no discipline around basic dual-homing and failover planning. Why should large amounts of money be spent putting redundancy into the network if the server team can’t be bothered to take advantage of it?
I personally would prefer to see all but legacy servers dual-homed to dual datacenter access switches, with one or two pre-defined schemes for coordinating and validating what form of teaming / failover the servers are doing, and that the switch port configurations align with that. The point being to make sure at connection time that everything is set up properly in one of a small number of known configurations so that good performance and failover are highly likely to occur. My motto lately: if you don’t test it, it won’t work when you need it to.
Some of that theme arose in a prior blog, Cloud and Latency, at https://netcraftsmen.com/blogs/entry/cloud-and-latency.html. Maybe that material should have been a separate blog titled “Getting the Datacenter Right” or something like that. If you’re not there yet, it might make sense to classify applications as to criticality, and start with the most critical servers and the services they depend on. Identify those, verify dual-homing, verify proper teaming/switch configuration. That means you need to know which servers deliver each critical application, how they connect to the network, etc. And if you’re revisiting this, make sure they’re not on oversubscribed ports unless they have very low network IO rates — 3550, 3560, 3750 switches, older 4500 model switches, or non-6748 cards in a 6500.
Note the synergy: if you do App Mapping (for cloud, or to check use of appropriate switch ports, verify switch port performance, or to expedite troubleshooting), you’re also doing something that you need to do anyway to prepare for DR/COOP.
Another important step is duplication of critical services, including DNS, Active Directory, NTP, AAA (RADIUS/TACACS+). And don’t forget that it’s not enough to have a fallback DNS or Active Directory server at a second location, you need to ensure it has current data. I’ve heard a few stories about people finding out the hard way that they didn’t replicate automatically. So that’s another item for the basic DR/COOP checklist: for each service or critical app: document (and verify) how the data is replicated, backed up, when it was last verified, etc.
There’s another way to slice this. Do you have accurate information about every server (and VM) in your datacenter (owner, purpose, application(s) that rely on it, switch ports, IP addresses, VLANs, what form of teaming, etc.)? How many servers are sitting in racks sucking up power but with no network connections, or with essentially no traffic on their LAN connections? If you don’t, those are some pretty basic things.
The DR Dance of Due Diligence
I focus on this because of something one might call the DR Dance of Due Diligence. That’s where every year you have some sort of DR drill (usually in early January), you find some critical pieces are missing, some yelling may ensue, and then energy goes into planning for the next year. Repeat annually for at least ten years. I suspect a contributing factor is lack of tools for App Mapping, so that people are making a best effort to identify dependencies, but with no technical information backing up what people know or what is documented by application developers. While I have yet to experience this personally, many people have told me about their company’s version of it over the years. (Comment if you’ve done the dance — or even better, if you’ve done the dance and not found missing pieces.)
So we’ve indirectly identified some of what makes DR/COOP planning hard: tracking down everything that has to be replicated in a second datacenter, where to focus efforts, etc. What needs to happen is making this easier somehow. Some of that strikes me as process.
That’s where the list (spreadsheet?) of servers comes in. If you can’t tie a server or VM back to one or more applications, then you don’t know whether it needs to be replicated at the DR site, and you can’t track planning for how its data gets backed up or replicated, verifying that is in fact happening, etc.
I’ve had the dubious pleasure of getting invited to meet with an application architecture team in a couple of large organizations. I like the idea of application architectures: defining a limited set of application design approaches, e.g. web front end, middleware toolset, programming tools, database back end. The point is that if you limit the number of choices for each, you reduce the amount of new code release tracking and testing and one-off troubleshooting situations that can arise.
This seems like it is at best another moderate to low priority item in most organizations. There is also not much point to having an architecture if it is running 1-3 years behind present development efforts, or if it becomes too rigid and precludes experimentation and change — both of which seem to happen. At which point the thick architecture document becomes moot or ignored.
What I was hoping to hear was that network devices / design might also be part of that. Wouldn’t it be a good thing, to have a standardized approach (or two, the old and the new), for how you do firewalling and load balancing? Maybe back end SAN replication as well?
The reality is more that some consultant gets brought in to integrate components when installing a complex application. And they go with what they know, including their own choice for load balancer (and historically, often no firewall, just as servers historically often did not get hardened if there was a firewall in front of them). I see that when I’m doing an assessment, ask about load balancers, and hear that “we have some clusters doing Microsoft Server Load Balancing”, then these old Cisco CSSs over here, a NetScaler there, some Enterprise F5’s that somebody bought over here …”. Unless you have a plan in place, as in “this is how we do load balancing”, you’re going to get steam-rolled (project costs, tying up the costly consultant’s time, etc.). Even if you do have a documented way to deploy application services, good luck. Big project consulting teams tend to get their own way.
The result is that most sites have a hodge-podge of network components, software and server architectures, all cooperating to deliver the various applications or services. Some sites are seeing VMware as a way to change this, particularly if they can run VMware on standard hardware supporting a smaller number of Operating Systems going forward. Others are stuck in IBM / Linux (or Solaris) / Windows stovepipes with no strategy yet.
It seems clear that having fewer variants of everything reduces costs and makes support and troubleshooting easier. And DR/COOP! I suspect many would agree with that, with the question “but how do we get there from here?” Sorry, I can’t answer that for you — each site has its own unique needs, level of management support, budget, timeline, etc. It all starts with recognizing the need and planning.
So what can be done to make DR/COOP better? So far, we’ve got:
- Execute well on the basics of doing datacenter well (structure your day to day documentation and practices for better performance, faster troubleshooting, and providing the input data you need to do DR/COOP well).
- Attempt to have a small number of architectures, both for network devices supporting application delivery, and for application designs. That means less management of device types, OS variants, drivers, etc. hence more time for more important things.
- Identify critical applications and services — or better yet, classify all applications into say 3 levels of criticality.
- Identify servers so you can manage them at the “AppPod” level (groups of servers used to deliver an application or service).
I plan a subsequent blog to talk about how new technology might (and might not) be changing this.