How Will Cloud Impact Network Engineers?

A reader commented on a prior blog, suggesting I write about the title question. I liked the idea, and the resulting words (thoughts) follow.

Suggestions for other topics are always welcome (best: via Twitter or LinkedIn comments). The best are ones where I smack my forehead and say, “gee, I wish I’d thought of that.”

So: How will the Cloud impact network engineers? What changes?

I’ll do my best to prognosticate. The following will be Best Effort, and my crystal ball may be cloudy on some of this. I’m a bit surprised at the length of this blog; it turned out I had a lot more to say on this topic than I thought I did.

This blog ties closely to a prior blog, The Changing Network World. That previous blog was more equipment, and network design focused. This one may repeat or ignore some of that. The focus here is more on the impact on staff.

TL;DR

More cloudless data center
Documentation, what documentation?
New security challenges
Some low-end skills or tasks are mostly automated away or made more efficient, but there’s lots more for such staff to do. This includes the documentation that never seems to get done.
More advanced skills are still valuable.
Trade-off: Will need skills driving the automation tools (DNAC, ACI, ISE, etc.)
NaaS and outsourcing some more-specialized skills a possibility

Data center

The data center is clearly impacted by Cloud in a big way. Probably the most impacted, at least initially.

Cloud diverts some hardware sales and deployment work to setting up cloud objects via GUI or other means. This affects VAR, consulting work and related revenue sources. But it also affects design, skills, and awareness. And security.

I’m seeing some/most sites where the networking and security team were not involved in cloud design. Not a good idea! Organizations need to somehow fit that into Agile/DevOps work, or they may get unpleasant surprises (starting with duplicate IPs or subnets).

This ties to something I’ve seen over the years: server/app folks sometimes learn a bit of networking and think that’s all one needs. That’s not intended as criticism, just a statement that until you’ve experienced the various complexities of networking and scaling networks, they are not at all obvious.

So, the server/app folks can subnet and set up static routes, maybe basic BGP or basic routing. What’s the problem?

Well, that can miss the point about design, network best practices, scalable practices, etc. It can also leave the network team stuck without knowledge of what got deployed, and probably without documentation of what got built (perhaps because it is always in flux, and the cloud state is self-documenting, from one perspective).

I highly recommend not having to reverse-engineer the Cloud when you get roped in to troubleshoot an application problem. Having to reverse-engineer raises the bar for “slow” and “lengthy Mean Time to Repair,” particularly if the network engineer doesn’t get hands-on with the Cloud very often.

And hey, that’s not just me, I was helping a customer network team build a list of current and expected / desirable tasks to run by their VP, and they brought it up. Involvement, both to build skills but also for input into and awareness of app designs early in the process, and on an ongoing basis as well! Especially if the as-built network is all the documentation there is!

When you factor in DNS, load balancing, etc. the gap here gets deeper. IPAM for containers can turn into a scaling concern, for instance.

This isn’t (or shouldn’t be) a job security/team thing (“my task, not yours”), it’s a design skills thing. And documentation. And awareness, context.

Coping with Complex HA / DR / Failover

Something that came up recently is that apps may fail over in different ways. When we used physical data centers, we generally had a primary and a backup, or dual data centers with load balanced front end, database replication, and mutual fail over.

But now, there are likely more locations in the mix! And different ones per application, unless an effort is made to be consistent.

One application might be primary in-house with DR in Amazon. Or Amazon-East backed up by Amazon-West. Etc. Doing one in Amazon and one in Azure is likely to be painful and complicated, but who knows, people might surmount that and figure “less shared fate.”

Conclusion: the failover / HA / DR plans for apps may have to be documented per-app.

The brave new world will likely no longer be “primary data center / backup DR data center”.

Cost Structures

I’ll note the CSPs win on costs due to economies of scale. It may not be worth automating much if you are an average enterprise: too time-consuming. For the CSP, repeatable and auditable automated deployment pays off in stability and requiring lower staff skills to deploy and do basic support. In other words, they get much better ROI on automation than the average enterprise. For the enterprise, reliability/consistency (modulo code bugs) is probably the bigger win.

Turning that around, Cloud and automation can be a win for enterprises – potential cost savings – but other higher costs may more than offset that.

SaaS means no longer having consultants standing up in-house instances with per-app idiosyncrasies. So big app maintenance and support is gone, at least the server and database admin side of it. Local customizations and user support remain. What is left beyond that is internally built apps.

From a zero-trust point of view, SaaS moves the user to SaaS traffic onto the Internet, with some form of federated or other identity mechanism tying back to in-house control.

Other existing apps may or may not be suitable to be moved to the Cloud. If they are internal facing only, putting them in the Cloud is effectively about the same as running them out of a new data center, except perhaps for costs and time to deploy. Failover still needs to be considered.

New builds with cloud-based DNS and/or load balancing, now there’s some real room for cloud-focused network design.

Skills-wise, Cloud means that ideally network staff should know how to do the “plumbing” of connecting from internal or WFH users to anything like VPC or S3 (AWS terms) in the Cloud.

If you’re not involved in building it, you won’t gain and improve those skills. The same goes for other cloud services like cloud firewalls, load balancers, routing quirks, access lists, etc.

Plumbing-wise, cloud VPN connections more or less provide you with the configuration for your physical network device. I say more or less, having been involved in the fun of resolving bugs in the configuration provided by the CSP (usually tunnel authentication settings or other deep VPN attribute mismatches).

I’ll note in passing all this impacts the security team heavily as well, including how routing and ACL controls in the cloud work. Some of cloud security may be in the form of static “side routes,” which are sort of like service chaining in on-prem products. Selectively forcing some traffic to go through a firewall, for instance. I prefer the “no choice” alternative: traffic goes out one way, and that cable goes directly through the firewall.

A bigger concern, to me anyway, is permissions to access cloud objects, how to audit that, etc. ID / authentication / authorization skills will also be needed across CSPs, tying back to the enterprise authentication etc. scheme. This seems to be the most common form of cloud breach, especially with AWS S3.

Monitoring CSP bills and efficient usage / cutting costs is a new job position and skills. How to cut costs will be huge, particularly during the initial learning curve or first couple of new apps deployed into the Cloud. Note that DevOps / Agile consultants on a deadline may be more focused on getting the app up and working, and that cost efficiency may be follow-on work, by them or by the in-house team. (Ditto security!)

The world has been discovering that it can be cheaper to run in your own data center. Sometimes a lot cheaper. As Russ White says frequently: understand the trade-offs.

The Cloud is better for fast ramp-up, without hardware acquisition and deployment costs. “Time-to-market.” The big lock-in hook is cloud-provider-specific toolsets. The toolsets (ML/AI, etc.) provide faster time to market and shorter application development cycles. But one cost is vendor lock-in.

A new factor could be the order backlog/supply chain. Cloud providers are huge customers. Do their orders get somewhat prioritized?

Moving only part of an app on-premises and interacting with unique cloud services is possible but incurs latency-based performance impacts.

That topic shades into a data placement plan, which may be something nobody has. I wrote a blog about “data gravity” a while back. I am going to NOT talk about data and Cloud here, as that prior blog covered my concerns – and I suspect sites may have to learn the hard way. Consider the thrills of having different subsets of your data scattered across multiple cloud instances. Also consider maintaining that data, backing it up, and securing it. Or worse, consider one source of truth becoming several. How do they get synchronized?

Impact on network engineers: you may have to clean up after a new app is deployed in the Cloud and also document if that failed to get delivered or was signed off on by another team without them considering the networking aspects. Re data gravity, that affects networking people in terms of “why is my cloud app so slow”. Of course, the network is the first suspect. Are you prepared to discuss the speed of light or electrons, and why the distance to a CSP matters for some apps?

(Hint: we’ve already had one discussion about authentication: slows response down massively if a web page does it repeatedly at a distance.)

On a related note, I’m now seeing edge compute discussions about local copies of data for lowest possible latency. That means synching them. Nasty problem! Central push versus large cluster consistency?

Data center Strategy / Network Design

Each organization is going to have to decide how they intend to deploy apps and do HA/DR/failover going forward. That brings us almost full circle, back to the original questions.

Does Cloud replace data centers (mostly)? Or ask when do you put apps in the Cloud, and in which Cloud?

Some of the design / decision alternatives with trade-offs:

One or two physical data centers plus AWS and Azure (or other) CSP presence for certain apps (expensive)
Shift apps to either AWS or Azure (or 3^rd CSP). Back each app up using diverse regions or availability zones within the same provider. (But then what if that provider has a Very Bad Day?).
For the most diverse but hard to admin setup, deploy apps to AWS with DR in Azure, or vice versa. Good luck with that! Probably not a good idea (complexity, less capable tools).
Do you use CoLo sites (which vendor?) as points of rapid deployment? That is, connect sites within a region to a CoLo where NaaS can provide agile WAN connectivity? If you’re doing that, is the CoLo presence just one or two L3 switches or routers? Or if you have a cage, do you start moving local services and other non-cloud apps to the CoLo as well?
Going the other way: let AWS operate your WAN: AWS WAN service?

Application Development

SaaS takes a lot off the plate. Local SalesForce or whatever admins may no longer be needed! (So, if you are such a person, maybe build skills in ServiceNow or Splunk instead?)

One concern I have is that restoring a one-enterprise SaaS app database is bad enough. If the SaaS provider’s database ever gets badly messed up, how many days/weeks/months will recovery take? Hopefully they’ve got it partitioned so it doesn’t ALL get messed up when something goes wrong?

The big thing I think I’m seeing here is the app team or the hired app consultants may tend to take care of cloud (and other) networking. And if “agile” or “devops” is in their name, they may not communicate what they’re up to others, or not do so well, let alone engage in design discussions with others. (Sorry, pain point where I or people I was working with got burned.)

As noted above, app developers (internal or contracted) may do quickie network “design” and come up with addressing, routing, or something else that is not very scalable or doesn’t work well with what you already have deployed.

Murphy’s Law says this will happen late in the deployment, likely when the proof-of-concept build has covertly turned into the end build because time has gotten tight. In that case, fixing it by doing it the right way from scratch won’t be an option. How do you mitigate the negative impact?

Reality check here: this is some of the hidden value that networking people (especially designers / architects) provide. Others in the organization may have to learn that the hard way. (Security and hidden value, ditto.)

I’ll note in passing here that the networking team may be managed by a networking person, or not, but from there on up it is almost always former server/app people. This may reflect relative valuations by senior management, or upwards visibility, or just the fact that there are usually a lot more server/app people than networking people. It may also reflect that many/most networking people tend to be more deeply technical than big picture folks, more introverted, less interested in interacting with senior management, etc.

Security / Resiliency

I noted above that some new skills will be required for securing cloud functions.

I’ll also note that we’re developing more and more critical dependency on the Internet. This is a good topic for another blog, so I’ll try to be brief here.

I’ve just been reading about a vendor with cloud-based NAC for letting devices onto the network and for switch port configuration and activation. What if that goes down? Does that mean you’ve been “Facebooked” – lost the ability to open doors, get into the building, fix the problem, etc.? (And should we henceforth refer to this as “F’d up”, where the capital F clearly refers to Facebook?)

For that matter, if your password encrypted storage tool is in the Cloud and you lose your credentials for accessing it, or that Cloud is down, you’re down hard. E.g., password database in the Cloud and device loses access to that. Local copies that synch is probably better. Although then if synch messes up, all your copies may become garbage?

What if the Internet or a major CSP fails in a region for days? (Foreign nation or hacker attack, environmental issue like Texas power grid and cold weather, etc.)

SD-WAN down
ID / authentication services down
MS AD in the Cloud for instance?
DNS down
Need physical access to restore???

That’s the short version. But clearly Cloud adds a whole bunch of things to think about into the security / resiliency domain.

Security Teaming

Security may need to work together more with other teams than at present. And vice versa.

WAN

I suspect everyone has outsourced their local Internet and WAN circuit management, and that staff is glad of that. Although working with the support staff at some last mile providers can be excruciating (due to their lack of skills).

Short term, the new impact is SD-WAN and DIA. I’ve speculated elsewhere about remote access possibly displacing or changing that. More recently, remote AP may be part of a vendor’s SD-WAN strategy.

And about Network As A Service (NaaS)? In the near term, as my prior blog noted, it may affect the inter-region, inter-CSP, inter-national, etc. core network. Network teams will likely be involved in the routing logic and responding to outages, failover testing, etc. But NaaS potentially gets you (us) out of managing boxes. No routers to deploy, upgrade, replace. Perhaps firewalls ditto. Given what I’ve heard about some firewalls, like all-day to get a working build and patching when standing up a new firewall, that might also be a Good Thing.

Tentative conclusion: perhaps most “hands” type work will be outsourced (if not already). Higher-level skills will be needed, along with skills in configuring cloud virtual routers, firewalls, load balancers, etc.

Recommendation: Beginners should learn basic networking first, then the cloud versions of it. More advanced users should at least read up on cloud networking, hands-on time and certification a plus. Ivan Pepelnjak’s cloud courses and a subscription might be a good cost-effective place to start. Concise, and he shares a networking perspective.

What doesn’t change due to Cloud

Circuits: local and small sites still need to connect to Internet, well-connected CoLo, or regional hub, etc.
Campus networking, as in on-premises gear, e.g., switches, OT/IOT switches or WLAN connectivity, APs, desk phones (if not converted to softphones that can be used for WFH, or just use of personal cell phones with a corporate directory).
Non-cloud topic: I have speculated elsewhere that enterprises may start emulating colleges, in shifting to mostly WLAN rather than wired ports. However, as more “IOT/OT” sensors arrive, they may be wired (for robustness) or wireless (mobile or smaller, or for convenient inexpensive deployment). Perhaps 90% or more wireless eventually. This affects the type and number of LAN switches you deploy and may well require a more crisply designed and deployed WLAN infrastructure, possibly outdoors WLAN and cellular 5G+ hand-offs as well.
With 5G cellular and federated ID, conceivably smooth outdoor and home 5G to indoor WLAN transitions, perhaps with automatic remote access VPN, might become the norm. At that point, in small offices especially, perhaps WLAN to 5G uplink provides Internet. Eventually. Or just 5G built-into phones, laptops, etc.?
Or eventually, perhaps site or home WLAN is mostly managed by the 5G provider (their dream!) and it is all Internet from the user’s perspective.

The Maybes

There are some extreme possibilities. If your company has a lot of call agents, perhaps they work from home, connecting to app(s) in the Cloud resulting in much less on-prem infrastructure! But how do you make sure call quality is robust? Can’t irritate already irritated customers!

It has been 20+ years since I had an office phone. Back when I had one, that pre-dated cell phones, so there was no good alternative. Now I have Jabber and Webex. I’m kind of mixed re the Jabber use case…

Do we really need office phones? Or do we need call forwarding from an “office number” to our cell phone / desktop, with a “business call” flag and central directory? For those in the office (management and hoteling?) are desk phones needed or would softphones be simpler to support? (Jabber, Webex find me wherever I am, in a simple user-friendly way. Well, except that Jabber can have issues when you’re on remote access VPN without split tunneling for Jabber.)

Using Webex instead of Jabber does provide that sort of functionality.

The biggest gap with cell phones-only strategy is perhaps phone directory. If the corporate directory could be replicated and kept updated on cell phones … One thing my cell phone does nicely is block spam calls. I can’t do that with Cisco Jabber (admin privileges apparently needed to add/edit a directory entry). So, I’m still receiving 1-3 calls a day about extended auto warranty on my Jabber office phone number. Rolling them over to voice mail helps.

The exception to that: perhaps executives. Rumor says they like handsets or now, big desktop phone/video platforms. Execu-perks?

And the obvious next question: Do we really need offices now that we’ve learned from WFH / COVID-19? I touched on this above re WLAN, and elsewhere. Offices do provide a sense of “belonging.” I’ve done hoteling, it wasn’t the same as “my desk.” It did provide some of the same socialization value of being on site with other people. Perhaps hoteling workplaces should randomize onsite days, to ensure social mixing?

I can conceive of climate change causing a shift to a strong WFH ethic, and “no commute” jobs, except where physical presence is needed. Saving commute time and cost is also a big positive.

As a data point, I drove through New York City on a workday late morning while COVID was prevalent (PTO, family visit), and the Cross-Bronx was a lot less congested than prior trips. So WFH may be having an impact!

Other possible staffing changes:

Working with managed services and/or spending more time troubleshooting requires different skills, including diplomacy and patience. Good documentation helps too!
Design/build/document may be totally outsourced.
Rent-an-expert: Perhaps small to medium firms put an expert or fairly-skilled tech on retainer for say 2 days/week. This saves money if you don’t need a full-time person on staff or can’t afford their full-time salary. NetCraftsmen is already providing this service to some customers, especially for skills like ISE, Gigamon, StealthWatch, Cisco FTD/FMC, Palo Alto firewall, F5, etc.

Links

Just for fun, and to see if I’d missed anything, I googled the topic. Here are some links that I found that you might look at:

That’s about it, as far as what I (google) found. There were some “cloud will take over, who needs data center network engineers?” blogs. I think they over-stated things.

Conclusions

It is a good idea for network engineers to learn the networking aspects of CloudAlso, be aware that only the “legacy network replacement” portion of Cloud actually requires networking. By that I mean the following sorts of things (using Amazon terms): VPC, DirectConnect, Transit Gateway, public S3 connection. And virtualized firewalls, load balancers, etc.

By way of contrast, consider AWS’ offerings around event-driven and lambda processing – basically, “serverless” applications. The virtual servers and back-end networking for those are completely hidden from the AWS customer, they just provide services. So how much of a role would the network engineer have, other than connectivity? And likely troubleshooting slowness, like might happen when serverless function calls go between CSPs or geographic regions.

Currently, how much network engineer involvement there is in the “legacy network replacement” cloud varies, often low involvement right now.

I claim that the designs / deployments will work better (together, and more scalably in the long-term) with a networking person involved in the design.

It is not clear to me that DevOps and management would agree about the scalability part above. I’ve seen some awful IP address assignment schemes, etc., in traditional networks, and life still went on, for years. Hidden costs, perhaps, but management may not even be aware of them. I’m a fan of “build it right in the first place.” You may not think of everything, but the results are likely to be a lot better than “figuring it out on the fly.”

So it might not hurt to keep telling management that. Otherwise, say cost-cutting on compensation reduces skills 10%. A couple of rounds of that that might reduce in-house skills to an inadequate level.

From what I’ve seen already, I expect we networking people will be hauled in to troubleshoot, so we do need to aware of what’s being built and (I hope) documented.

Doing the documentation could be a great way to get involved and start interacting with the teams building stuff in the Cloud. Screen captures of VPC, etc. creation screens could also be useful, if sites are not yet using Terraform or something that (partly) self-documents the Clouduild.

By the way, NetCraftsmen has done ACI, sometimes reverse-engineering or “cleaning up” after someone built it. You can get lots of details from that, but the intent and the big picture, not so much. Documenting the intent and big picture is something I still strongly recommend!

Disclosure statement