Cloud and Latency

Author
Peter Welcher
Architect, Operations Technical Advisor

In the last blog I posted, I wrote about the Cisco CSR and its usefulness in the cloud — and deferred including some thoughts about Cloud and Latency, since that’s logically a separate topic. For the CSR blog, see  https://netcraftsmen.com/blogs/entry/cisco-cloud-services-router-csr.html. Now let’s revisit the idea of OTV to the cloud and “simplify”, and look at a couple of related matters: OTV, application groupings or containers, and latency.

Is OTV Required for Cloud Migration?

One point to consider is whether you actually need OTV to migrate virtual machines (VMs) to the cloud. The answer is NO. First, you can do “cold dead VMotion” (my term for it), i.e. halt the VM and move the virtual device file, the VMDK. Second, you may in fact have to do that, depending on distance, due to distance / latency / timeout constraints on “live” VMotion. So OTV might be convenient if the cloud location is close enough — you  do have control over that, don’t you? (Not really — depends on your cloud provider, or is it more of an equipment rental service.)

So until VMware comes up with “super-long distance VMotion” (i.e. some variant that works around the globe, maybe by putting the VM into a sort of zombie sluggish state…), that’s one latency consideration we have to bear in mind. [And from the terminology, probably an indication I’ve been reading too many paperback “supernatural fantasy” fiction books lately.]

Applications and Latency

I have latency on my mind lately, from doing application performance troubleshooting. There are just too many applications that are oblivious to the network, and poorly written so as to assume high-speed LAN connectivity. Split the application server front end from the database across even a local dark fiber link, or perhaps split the web front end from the application server, add some latency, and the application becomes sluggish or worse. For that matter, DNS, MS Kerberos, and LDAP all matter too: if you’re constantly checking certificates or credentials slowly, that can now slow down applications (or at least logins and clicks on related links, depending on application structure). If you use 10. addressing and don’t resolve all reverse lookups for network 10 internally, congestion on your Internet link can show up as … puzzling application / login slowness.

Ok, you get the picture: latency BAD for most applications. DNS and authentication server delays (plus latency) also BAD.

AppPod or vApp Server VM Containers

Latency is a potential issue with cloud. Suppose you have an application that formerly had all its servers in one datacenter, and you move some but not all to the cloud. The latency could be a show-stopper, depending on how the application was written. If you group VMs that make up the application into an “AppPod” or vApp (VMware’s technology for doing so), then consider moving the AppPod as a unit, not just components. When you can do that, perhaps latency won’t be a problem, or is less likely to do so. Especially if key authentication and database services are available in the cloud as well.  What may interfere with that is SOA architectures with inter-app RPC calls. It appears that leveraging SOA is tying  apps and servers together into larger and larger random mazes of queries. Is that a form of entropy?. Does some structure need to be imposed on inter-application RPC calls?

There’s a delicious irony or perhaps tension here. The whole point to the “any VLAN anywhere” and “lots of East-West bandwidth in your datacenter” is not having to know where your VMs and servers are located within the datacenter. And if you think about it, that’s pretty much an assumption to design SOA architectures with applications or components offering services to each other across former application boundaries.

When it comes to the WAN, you do need to know where your servers are. So which is going to win, awareness of app components with latency sensitivity, or ignorance? I think I know which one I’m betting on, due to the “I don’t have enough time” factor. And entropy, as in “increased disorder or randomness”.

The Game of Hunt the Server and Its Switch Port(s)

This impacts network personnel already in some nasty ways. We’re already at a stage where the app/server folks usually can’t tell us which servers make up an application. When the application breaks, someone (network person?) gets to spend 1-2 days figuring out which servers constitute the application. After spending 1-2 days finding out nobody can tell them, or the info is wrong / out-dated. Yes, OPNET AppMapper can help if you can afford it — but it still takes time to do the mapping. If you have the money to buy a good mapping tool, and / or the staff time, it definitely helps MTTR to nail down the servers and major flows, document them (internal Wiki anyone?), and keep the information current.

Let’s take a short sidetrack here, to talk about physical cabling and the impact of not knowing which servers make up an application services.

Not knowing the servers involved makes tracking down physical cable / port misconfiguration errors challenging. I’d personally prefer to not have to do that. But most sites don’t actually do monitoring of all active datacenter switch ports. You either track all ports and fix the problem ones (easily spotted) when they show up, or you end having to hunt them down (laborious!) and  then fix them anyway. The problem can be physical layer (bad patch cable, bad port), it can also be a duplex match or port-channel mis-match, i.e. inconsistent configuration between server and switch.

A recent consulting engagement led to the thought of actually validating server connections. You know, not just two tickets, server guys and network guys doing their thing independently, link light green, must be good to go.

Instead, how about having port profiles (as in a small number of pre-tested server NIC setups, driver versions, teaming settings, etc.), and validating patching afterwards for no errors, no MAC flapping, etc. You could even coordinate on good switch port descriptions, complete with server name, major application it belongs to,and a freshness timestamp (since port descriptions frequently do not get updated or re-validated). I’d sure prefer to have trusted port descriptions … Looking at the ARP table and then playing “follow the MAC address” gets old, fast. Especially if you’re trying to fix the application problem in hours not days, and don’t have a MAC mapping tool (CiscoWorks / Prime LMS, NetMRI, etc.).

Clouds and Latency

I do think cloud migrations will take some knowledge and planning — or be subject to surprises. That was the subject of my prior article about cloud and latency, Pondering Clouds, or https://netcraftsmen.com/blogs/entry/pondering-clouds.html. Looking back, I talked about “service delivery aggregate” or “AppPod”. VMware’s term vApp works for me as well. It’s a container bundling together a bunch of VMs.

Back to my main theme of applications and not knowing … as VMotion triggered by humans or VMware DRS increases, we may see intermediate performance issues due to cabling or configuration. Cisco 1000v at least takes most of the switch to server configuration mismatch problems off the table. But “catching the VM” while it’s on a hypervisor host connected via the problem port … might be fun, as in challenging.

Add in OTV or VXLAN (which I intend to blog about soon) and inter-datacenter VMotion, and you have a broader range of possible woes, as in some placements of the AppPod VMs lead to slowness, and yet when the VMs are spread across the two datacenters in other patterns, the app works fine.

Consider what happens when you migrate some components to the cloud and the app slows or stops working. How are you going to troubleshoot that? Is it a physical or configuration issue? Or is it latency? How can you determine that? How long will it take? What VMs work together to deliver the application service? Where do they live?

Heck, I’m seeing those problems now, with some components that went into production in the DR/Dev datacenter. In one case, with an address in a subnet that was supposed to only in the production datacenter. (If you think you know where the subnet is, would you use traceroute?)

Hmm, perhaps I should buy stock in firms that provide “application mapping” tools.

Summing Up

Any VLAN anywhere means you can get by with little knowledge of app structure. Cloud deployment strikes me as the other end of the spectrum — you need pretty good knowledge of your application’s behavior patterns, major traffic flows, etc.

I don’t have a complete set of answers, I do have some ideas.

As the complexity of troubleshooting increases, I personally would prefer to document every critical application. Yes, I know, subject to reality check. Things I’d like to know before I have to troubleshoot with senior management breathing down my neck:

  • For each key application, what are the 10 or 20 servers that work together to deliver the application?
  • What critical services (other AppPods or vApps) does the application depend on? (And check it somehow? I’ve heard of slowness due to a dis-continued DNS server heading up the list for DNS resolution on a key server.)
  • Roughly what does each server in the application do?
  • What are the IP addresses and DNS names of the servers?
  • If VMs are in use, I like the idea of having informal groups of hypervisor hosts, so that I at least know which set of physical servers and ports a given VM is to be found on.
  • I also like the idea of pro-active management of ports and applications — meaning you don’t buy one of those products that costs $1M+ just to “manage” one or two applications. Manage everything with alerting, including for slow responses! Turn on server logging where it doesn’t adversely affect performance or disk space.

My other tentative conclusion is that AppPods or vApp are the simplest way to reduce what you have to know about your applications.

Note we haven’t talked about firewalls, load balancers, and how they factor into the mix. Ivan  Pepelnjak has some blogs about “tromboning”. I see this as a key design consideration as well, when doing L2 over L3 to split subnets across datacenters (Data Center Interconnect, or DCI), whether your datacenter or the cloud. Some recent Cisco announcements provide some alternatives on this front, and are something I intend to blog about soon!

Leave a Reply