DevOps, Networking, and Clouds, Oh My!

Author
Peter Welcher
Architect, Operations Technical Advisor

TL;DR: I’m a technical omnivore, reading content across typical boundaries. The following is a tale of caution: developing apps for the Cloud is different than for in-data center.

Let me also note I’ve had a few experiences being brought in because “the network is slow” turned into either “which device is slow” or “what about the app makes it slow”. So, I’ve been watching cloud performance discussions.

I’ve also ranted (well, blogged) before: it seems that DevOps teams frequently do not include network persons. Nor security people, but that’s a separate topic.

After all, the perception is that any good developer can work with GUI, Terraform, or API to stand up things in the cloud, upload tools and code, and start agile sprints.

Whoa, not so fast there.

In our consulting, I and some others I’ve talked to have found that developers who think like networking people are scarce. As in, knowing and documenting the network flows. Plus, latency, etc. Ditto security.

Not having that knowledge can be expensive!

Apropos of that, I once did some work to “prove” a costly app upgrade ($1M-plus) significantly improved performance with far less network traffic. Testing left us puzzled since as we scaled up the number of simulated users, the amount of traffic plateaued. We never did get it resolved, in part because the developers were only able to give me a crude description of the traffic flows, and my centralized WireShark captures weren’t helping. Maybe we were missing some servers, server interfaces or an app component. Or some caching somewhere due to scaling up via replicated test scripts. Or the test script wasn’t really doing concurrency. We tried to hunt the gap down, but time and funding ran out. The overall cost was a lot of time expended for everyone involved, with somewhat suspect results.

Networking (and security) skills and thought processes may not be needed all the time in DevOps but are best applied up front in the design stages to avoid costly rework.

Initially, all seems simple. Carve up the app you’re building, hopefully in minimal viable form initially. Code away, hopefully with good modularization or decomposition into services. Catch problems and improve as you go.

Some data points hint at the problem I think I see:

  • Increasing numbers of documents about cloud optimization and reducing cost or improving performance.
  • Observations from some consulting and troubleshooting I, co-workers, and other colleagues have done.
  • Edge is slower taking off than pundits expected. (Just saw an article about it.) That may instead point at a different problem, however. See below.

By the way, I don’t claim to know all here. I’m an observer, with a few data points, trying to make some educated guesses. If that gets you thinking or gets a conversation going, great! (Hint: you can find me on LinkedIn, Twitter (for now), or Mastodon. Post coming soon, but I intend using it only for non-technical / political stuff.)

What’s Going On?

My guess is that “Lift and Shift” is still going on in a lot of shops, due to “use the cloud” mandates. Shifting VMs or large containers or large complex services code bundles to one cloud (“rented servers”) is easier and faster than native cloud re-development, especially if the team is short on native cloud experience and skills.

What’s wrong with that? Maybe nothing.

You do what you have to do, and part of it is just building cloud management and operations skills and getting real data on costs. The biggest win may be that the cloud is more reliable than your data center (or server room) was. And another win is not having to deal with buying and provisioning and managing physical equipment, etc. in the data center. That matters these days with long equipment lead times.

What’s the Problem?

The potential problem is the cloud has different physical characteristics than a data center.

The laws of physics, plus economics, should be factored into cloud design choices.

But they often only get considered when an application is unusably slow, or a huge monthly traffic bill comes in at month’s end. At which point “optimization” kicks in.

CLAIM: A degree of pre-optimization might be wise, saving time and money. That is, develop with cloud characteristics in mind.

So, what are these cloud characteristics?

  • Egress traffic costs money in most cases. E.g., backup in the cloud might not be great, because restore may take forever and cost a fortune in egress traffic fees.
  • Network latency and packet loss matter.
  • So, application “turns” matter.

Let’s take those one at a time.

Egress Traffic Volume Costs Money

Suppose you have application components developed by different teams in different clouds, A and B.

If you run say a naïve search on a local database, the application may examine a lot of records. If that entails moving all that data from A to B, you can run up a big bill.

From quite a while ago I recall an amazing speed-up (1000x faster) in a small app, by using a stored procedure, rather than fetching and testing one DB row at a time. The point being that matches were found on the server/DB side and only the matches returned (in bulk) to the application. Rather than the application fetching a row at a time and checking if it matched.

Well, with clouds it is similar but potentially even more impactful.

So one winning principle is: do your processing with local data, and just return results, if possible.

A related principle is that calls to get data from outside a given CSP or data center should ideally not transfer a lot of data. As an example, if someone just bought something, and you have a third-party cloud app that sends email, SMS etc. receipts, one would hope you only have to invoke their API with a small number of calls and small amount of data. If so, if that service runs in another cloud, no big deal. (Presuming the clouds are in the same global region, perhaps.)

I’ll note that you can’t/shouldn’t just have copies of the data all over the place. Storage and synchronization also costs. And synchronization can get very complex or fragile. People are talking about edge and putting data near where it is created or used. My point here is that is not a no-brainer!

Conclusion: a strategy is needed for which data is kept where, so there is one copy (or a small number). Etc., minimizing large data transfers. Basically, wrap common queries in some API and resolve the query locally, returning a relatively small answer (few bytes)?

Caching may also be part of an answer. If a lot of the data fetched is re-used over and over, caching might help. Unique per-customer data, probably not so much.

I have a sneaking suspicion that to prevent “data fragmentation” leading to replicated database copies or partial copies (a data integrity nightmare?), choosing a single CSP to hold the data and functions interacting with the data might be useful. But that’s getting off topic, and I lack hard data/experience seeing what sites are doing around that.

Network Latency and Packet Loss Slows Things, Times N

Companies with a presence in ASIAPAC have been seeing this for years. Apps that work fine in the U.S. have been sluggish at best for “follow-the-sun” staff in India, China, etc. Various organizations have dealt with this in different ways. Some of which unfortunately involve deafness to the discontent of the remote staff.

The cloud version of this is a web page or app becomes slow because the main item is running in Cloud A but getting data from Cloud B.

A given query might go back and forth many times. For example, a listing of financial transactions might fetch them one at a time, rather than as many fewer larger packets.

For that matter, a web app can easily pull in data from 100, 200, or more other URLs. Multiply each such data request times the latency and you’ll see why your web page takes 12 seconds to paint.

(The other URLs can be surprising, by the way. I once found a hospital system contacting MLB.com (baseball). It turns out MLB did early streaming video and licenses code etc.)

Application Turns Hurt

My analogy here is getting an answer from maybe a four-year old.

  • “How was your day?”
  • “Fine.”
  • “What did you do?”
  • “I played.”
  • “Who did you play with?”
  • “Jimmie”
  • (etc. – conversation starts to feel like you’re pulling teeth.)

Compare to:

  • “How was your day?”
  • “Great! Jimmie and I grabbed a ball and played catch, rode on the swings, and then we chased some frogs in the puddle in the back yard.”

In application/DB terms, the first is like fetching one matching row at a time, versus triggering a dump of all the matching rows as a reply.

The technical version of that: what could be 1000 x 10 msec each way (acknowledgements in TCP) = 10 seconds versus a lot less: one round trip time plus the time it takes to transmit the 1000 rows, which depends on the bandwidth and the server/DB speed. But might be less than 1 second.

By the way, one indication this is happening is when you bump up the bandwidth and the speed of getting a full response doesn’t change much. Latency is based on speed of light plus other delays, so is relatively bandwidth neutral. Ok, server turnaround, etc. also contribute.

Note this is not a no-brainer. If you design to do one turn but spew a lot of data in the reply, then you might be winning on latency and losing on cost (bytes out).

Optimizing an App

I have limited experience with this, I’m not a production coder. Quite a while back, I did write a full Lotus 1-2-3 clone single-handedly (in C, with smaller executable code than Lotus had at the time). Along the way I ended up realizing screen IO was killing performance, wrote maybe 1-2 pages of assembler code, and solved the problem.

The same may apply to cloud code.

Apps might be similar; in that you may not have to optimize the whole thing (although it can’t hurt). Fixing the one or two top performance killers may suffice. That seems to fit the theme of this blog. If your developers are aware of the above, then you may only have a few remaining performance hogs to cure along the way.

Tools

Which leads to tools (he says, rapidly changing the subject…)

Some of the tools I’ve recently blogged about, such as CatchPoint, ThousandEyes, etc., can help you detect the problems above. Basically, anything that shows you the flows and turns, and summarizes the data.

“RUM”-like tools for cloud, in other words.

And what can be very helpful for a web-based app is seeing all the calls a given web page makes, and which take the longest.

One example of that provided by CatchPoint is a downloaded streaming video snippet, not normally a problem but slow when the local regional anycast source is down. Some possible solutions: omit the video or reduce the resolution so as to transfer fewer bytes.

What’s With Edge

My guess is there’s been a lot of “lift and shift” versus cloud native app development, especially among large slow-moving or smaller organizations. Looking for a quick win (cost or robustness or outsource some of the support/maintenance burden). Also looking to build cloud experience and test its usefulness.

That would imply that edge is going to draw organizations looking for a competitive edge, or those with (probably new) products requiring the low latency of edge and edge-based apps.

I see a skills-building spectrum here. First, get into the cloud, as managed servers, databases, and services. Then do cloud native. THEN maybe do edge, where arguably advanced cloud app knowledge is required, plus perhaps broader deployment. (I.e. management has to really believe in cloud and commit to the potential win.)

So there are steps of management decision and then human skills building plus development, each of which might be measured in years. Hence, edge is coming, just not that fast yet.

Another factor might be economy: expecting recession, firms are not emphasizing bold new ventures.

Other Links

The following Kentik blogs were timely. Other synthetic/RUM web testing tools such as CatchPoint, ThousandEyes, AppNeta, and others may provide similar network data and even selected cloud data. Kentik has some interesting features re cloud and Kubernetes, but due to all the functionality and other features can be very pricey if you don’t need all those features.

https://www.kentik.com/blog/maximizing-application-performance-extract-practical-data-from-your-network/

https://www.kentik.com/blog/a-guide-to-cloud-monitoring-through-synthetic-testing/

https://www.kentik.com/blog/data-gravity-in-cloud-networks-massive-data/

And more recently:

https://www.kentik.com/blog/data-gravity-in-cloud-networks-achieving-escape-velocity/

As I was drafting this blog, Tom Nolle (Network World) wrote two articles about this broad topic:

https://www.networkworld.com/article/3686096/a-new-role-for-network-pros-application-flow-architect.html

https://www.networkworld.com/article/3687736/why-network-pros-need-a-seat-at-the-application-planning-table.html#tk.rss_all

I’ve been doing some paid writing for CatchPoint, and their “RUM”-like tools look like they’d be darn useful for catching some if not all the Cloud App performance issues mentioned above.

Conclusion

There are things a developer can get away with in a data center that just won’t work in the cloud. Considering the message flows back and forth and looking out for the items above can help avoid having to rework slow app components, possibly taking a completely different approach.

I also claim that unless you have a plan for which data (and services / APIs to access that data) live in which cloud, you’re going to end up with duplication of data, redundant code, and a big headache.

When designing a layout for data and access query function calls, maybe one key criterion would be carving things up so the function calls return relatively small amounts of data, in a stream.

What this (overly long?) article did NOT cover was high availability, right-sizing how you spin up extra compute etc. instances, and that form of cloud optimization. All topics for another blog or two!

Disclosure statement