Internet Edge: Application Resiliency

This is the 13^th blog in an Internet Edge series.

Links to prior blogs in the series:

Internet Edge:Simple Sites
Internet Edge:Fitting in SD-WAN
Internet Edge:Things to Not Do (Part 1)
Internet Edge: Things to Not Do (Part 2)
Internet Edge: Two Data Centers
Internet Edge: Double Don’t Do This
Internet Edge: Cloudy Internet Edge
Internet Edge: Special Cases and Maintainability
Internet Edge: Security Tool Insertion
Internet Edge: Internet Edge: Traffic Steering Part 1
Internet Edge: Internet Edge: Traffic Steering Part 2
Internet Edge: Internet Edge: Traffic Steering Part 3
Internet Edge: The Big Picture

Thanks to those who have followed this series of blogs this far.

My intent has been to lay out what I think and what I see people doing, with a sprinkling of opinions and thoughts about best practices.

My challenge to you, the reader, is to use these blogs to think about what you’re doing, consider if it might be improved, and how to improve it.

This is a huge topic, and I’ve only been able to touch upon parts of it. This blog represents one more part of this large picture.

In the previous Internet Edge blogs, I’ve explicitly noted that data centers / Internet Edges do contain other types of devices. Those devices generally relate to security, WAN Optimization (a rapidly shrinking market niche), and load balancing or DDOS functions. Yeah, I probably missed something there. There are a lot of specialized functions. I’m sure you’ll let me know what I omitted.

Let’s home in on load balancers for a moment. Why would we have them? In general, they perform two essential functions:

Distribute a heavy workload across a bunch of servers/web front ends / whatever.
Provide application resiliency by detecting down servers/applications and distributing the workload across the servers for that service that are up.

Over time, those functions evolved into doing more exotic things like URL rewrite and some other “fancy functionality.” No doubt some sites use fancy functionality. I have yet to encounter anything other than reasonably basic functionality in production use, but then again, load balancer users may well get their consulting from firms that specialize in that, perhaps from the vendor. And now NetCraftsmen has other folks that specialize in load balancers, among other things.

From a certain point of view, load balancers are tactical, one way to solve a problem.

The broader need is application robustness (single or multiple sites) and perhaps workload scaling. Cloud and /or Kubernetes (or Docker) (on prem or in the cloud) have become popular for workload auto-scaling. They’re different enough to fall outside the scope of this blog series. (It’s my scope so I get to make those tough calls.)

Application Robustness (why do I imagine that being read aloud in a deep bass voice?) has a fair-sized application design aspect to it. But robust application design often also has elements impacting the network in significant ways. So let’s take a look at what the network might best be able to contribute.

This is an area where I consider it a Good Idea for networking staff to provide some informed feedback if asked. And perhaps even if not requested (but politely). There are some not-so-good design choices that application designers can and do make. Either you (metaphorically) throw your body in front of the looming mess to stop it, or you get to support it ever after.

There’s a fair amount to be said on this general topic. I have no intention of writing a book in blog form. So for this blog, I’ll list the available design approaches that come to mind and then provide links to some deeper material. Specifically, Ivan Pepelnjak’s blogs and great courseware at ipspace.net. (Some of which require paid access.)

This topic has become particularly relevant since having a good resiliency strategy for two- or three-tier applications makes it easier to move them to the cloud or place a copy in the cloud as the first part of a move. That could also be the first step in a digital transformation, perhaps by splitting some functions as micro- or macro-services!

Let Us Count The Ways

So, what are some of the ways that might be used to make an application more robust.

DR (Disaster Recovery): Bring the app(s) up at another site. Or have them running there already with no external access.
- If your DR plan entails anything other than running small configuration scripts on one or two routers making small changes, like bringing interfaces up, you don’t have a plan. Or rather, you plan to fail. Murphy’s Law says having to activate DR is bad enough. Best: have it all ready to go but for some shutdown interfaces or routing adjacencies. Automatic failover is best but pasting in one or two router pre-documented commands is workable. Making it up as you go, at 3 AM on a weekend after some heavy partying the evening before, is unlikely to be pleasant or successful.
- Or have the backup copy of the app running but not processing transactions (i.e., no changes, so DB consistency is not an issue).
- One of the best approaches is to duplicate IPs / subnets, at least for the front end. Turn the primary site subnet interfaces off, turn up the DR site ones, routing kicks in, and routing finds it. In effect, “presto, it’s over here now.” Preventing accidental activation of this should be part of your plan.
- If you do something like this, you will need to back up servers/replicate data without using the production addresses. Hence, a backup interface on a different subnet / vlan, perhaps. Servers with two interfaces are “complex”. Put some thought into this.

Summary: some form of DR is table stakes. It may be enough if your business requirements will tolerate some modest downtime.

It’s much better if you can run application(s) in active-active form. Then there is no DR. (The Zen of networking: “true DR is no DR.”)

Ultimately, that will depend on your backend database and how It replicates – and the CAP theorem applies. Having the app running out of two or more locations and, then the worst case of having to shut one down is much stronger resiliency! The other side of that is having problems due to the replication solution.

Here are some ways the application might be deployed to get you there:

DNS-based techniques. Have front-end web servers running at a different IP addresses.
- Neat idea I’ve seen in the field: if all the middleware servers and DBs are only reached via the front end, then you can duplicate IPs and subnets at the DR site, except for the web front end. And use the same private addresses at both sites for the win! This approach may be limiting in some ways as the applications evolve. But it has the virtue of extreme simplicity!
- VMware, full clone, copy the cold image over, change the front-end IPs. But nobody likes doing that change the address part. Licensing etc. concerns. You could also set up the 2^nd site to do NAT from a different IP to the shared IP of the clone. Plus some care where the subnet with the duplicated IP gets advertised in routing. (Which might rule out OSPF?)
DCI / stretched L2.
- For some reason, this is not only popular with customers but with vendors. It may however NOT be the greatest idea. (I consider stretching Layer 2 to be a Really, Really Bad Idea, but this is not the place for that debate.)
- If the L2 DCI uses OTV or VXLAN, with BUM controls (rate-limiting defenses) and not using broadcast for unknown unicast, it is likely better than wide-open L2.
- L2 DCI still has risks. I’ve heard of Cisco OTV (back when Nexus 7Ks and 10 Gbps were new) flooding enough BUM traffic to lock up the old prior 6500 Supervisors attached temporarily to the other end of that. In general, L2, even with VXLAN means some degree of fate and risk sharing. I.e. The L2 DCI means you could have 2, 3, or more data centers all impacted by some event. L3 won’t do that to you. (Well, it generally won’t, unless you scatter services around and screw up the routing.)
- VMware doing VXLAN – what happens if the CPU gets spun up by a L2 event? And stretching a VMware cluster strikes me as a really good way to have extremely exciting failure modes. As in, VMware itself going down, not just the application(s).
One reason DCI via stretched L2 might be on the table is firewalls (FWs) …
- FWs that recommend doing that
- FW clusters at a site.
- FW clusters across sites.
- I discussed this in a prior blog. Just remember, the word people use after “cluster” is usually not “success”.
Stretched VMware or other cluster:
- Also not a good idea.
- The math doesn’t particularly work out for Long Distance vMotion at scale, possibly not for shorter distances either.
- If doing it with a split cluster, practice failure recovery. When I reviewed what VMware had to say about that, it sure didn’t look like anything you’d want to be doing for the first time in a crisis!
SLB-based: if your app has or can have active / active servers (front ends), sure, load balance (“ADC”) all you want.
But what about DB / backends?
- You do have to deal with DB consistency if using two backend DBs. That can get complex and costly, but that problem does lurk in any solution, including Cloud-based variants.
- There are vendors that will happily sell your app developers / architects either one-way or bi-directional replication software, sometimes for amazingly high prices. Hopefully, they pick something that works and meets well-considered requirements. Things like not losing data (think bank), and time to regain consistency between two DB copies do matter.
- CAP theorem: Consistency, Availability, Partition Tolerance. Need I say more?
Summary:
- The best solution is to design an application to use “swim lanes” and work out of multiple locations, if possible.
- The idea of “swim lanes” is to have an isolated, possibly internally redundant, stack of components capable of delivering the application’s services. It is best to isolate each swim lane from the others, so you do not get complex inter-dependencies. Note that you can still have capacity induced problems, if one swim lane fails and its traffic fails over to another swim lane and brings it down. (Repeat until everything is crushed and/or down.)

Draining Traffic

There is one more consideration for application HA design (or any form of HA design). Well, there are probably many more considerations, but there’s only one more I want to mention here …

If you need to do maintenance on redundant entities, providing optimal user experience avoids hiccups or short failover non-responsiveness. One way to do this is by “draining” one resource, in the sense of marking it for no more user connections or traffic. Gently shifting the work to the other resource(s). When the last user is off the targeted resource, you then take it offline / mark it unavailable / whatever fits your situation and do the maintenance.

Without going into specific HA techniques and details, not much more can be said about this. So, we’ll just move on …

Where To Learn More

I’ve learned a lot by listening and reading. I appreciate Ivan Pepelnjak as an intelligent source of independent thinking, and a good cross-check for my review. His (for a fee) training and (free) blog materials are invaluable! And also, the only resources I can think of that cover the alternatives for A/A applications. (Excepting vendor / single solution, the best type alternatives.)

Conclusion

Application High Availability / Resiliency is not simple!

I’ve tried to lay out the major alternatives and at least some high-level pros/cons.

Some real-world experiences:

If your organization writes its apps, getting a network person into the Agile or DevOps loop for design may be quite a challenge. Even for, say, “monthly catchup and review.” Yet, it might be essential to keep application designs from going down a bad path. Good luck with that, managers may not appreciate the need for it. Preventing app/dev teams from shifting hard things to being your (networking) problems.
If your organization buys apps and consulting services to deploy them, then the issue is getting into the purchase / due diligence loop. Which might be considerably harder – team rivalries or turf plus executive egos involved. But getting involved might be necessary, because by the time the app is purchased, you’ll be in the “network janitor” role, following dictates, and doing your best to “clean up the mess on aisle 3”.
My experience leads to a fair degree of concern about how deeply the average corporate app purchase cycle gets into the app design and HA aspects prior to purchase.

Disclosure statement

Let Us Count The Ways

Draining Traffic

Where To Learn More

Conclusion

Related Topics