Recently, I’ve come across a number of blogs about networking skills. I have some thoughts on the subject, which may be a good bit different than what you’ve been reading. I’m curious whether people agree with me or I’m just being contrary — so leave a polite comment either way!
Programming/Scripting for Networks
Let’s start with programming and scripting. Cisco was pushing hard for learning coding, then eased up a bit. Several people I highly respect are enthusiastic about programming, or at least scripting. For that matter, I like coding. Getting someone to pay me to do it is the problem!
Concerning tools like Python, Puppet, Chef, Ansible, etc. — knock yourself out. Having some idea what they can do and how to use them might be helpful. I agree with Cisco’s toned-down version of “all networkers must learn programming,” namely, “must be able to credibly talk to programmers” — although I might quibble about which programmers, i.e. not just tools developers. (How many firms will be building their own tools, instead of DevOps teams? More on this below.)
What concerns me about lots of people scripting is the amount of bad coding that might happen. I’ve programmed in 15-20 different languages, developed one big GUI-based program, and had my share of humbling bugs in all of them. PERL is especially handy if you like obscure bugs. Regular expressions, ditto. Another thing I’ve noticed over the years is the amount of bad or sloppy and under-documented code out there, including my own. I’ve worked at it and write very carefully indented and self-documenting C or PERL code — just because I know I’m going to have to fix it at some point.
Think about it: Someone succeeds in automating some aspect of your network. They leave. Who maintains it? Or your attempt at scripting breaks in some obscure way. Worst case: After breaking routing in numerous routers due to mildly different CLI syntax. From experience, the testing one does is seldom as much as the testing that should have been done, and bugs do hit production.
In summary: One’s odds of avoiding a CLM (career-limiting move) might be better with supported code from a vendor, be it Cisco, Apstra, or whomever.
For reporting, OK, that’s less dangerous. An API is useful while a product is being developed, in case their canned reports don’t do what’s needed. Having said that, in the last year or two I’ve used Python to probe at some APIs. My conclusion: badly under-documented. Telling me syntax and a table schema does not give me context. For instance, one network management product gives me performance data. I couldn’t readily determine what time period the data rolled up, in part because I had something wrong with my query and no good example of accessing that particular API element.
Is scripted automation and reporting really all there is? It’s useful, and bigger shops might pay you to do it. Smaller ones, I doubt it.
Yes, we probably need to be able to script to get more from network management and automation tools. Network management is a relatively small market, and just doesn’t seem to generate the R&D to automatically pull data together into a network model and provide correlated results, let alone root cause. Just getting good graphs out of present tools can be a bit of a hassle.
In fact, that’s one of my gripes. Without naming vendors (and there are several), some seem to think I should use their API to make up for their immature product’s lack of a good set of canned reports. That just doesn’t work for me. If I sit through a sales pitch and find out the product is rather incomplete, I “enable product dampening” — as in you lose points with me for years.
While we’re at it, too many APIs, inconsistent across products from a single vendor — that redefines inefficient use of my coding time. Put differently, APIs are an enabler. Too many APIs, a disabler. Inconsistent semantics, a disabler. Sure, I can write code that does different things based on device model. Waste of my time having to do that?
Catching Bigger Fish
Maybe we should be widening our skills perspective.
To catch bigger fish, you might use a bigger net.
It is useful to understand how ACI, VMware NSX, and containers do networking. Try to figure out or learn the best practices around hierarchy of deployment, manageable addressing and routing, manageable and secure networking, etc.
Example: ACI can control NSX security. I’m not sure I like the approach Cisco recommends. For one mostly NSX deployment (80-90 percent virtualized), I plan to use ACI as a fabric automation and management tool, and use NSX management natively. That way we won’t be adding troubleshooting complexity due to the control plane interaction between the two. Learning one tool well: good. Learning the two or three competing tools in an area: better. Learning how to make them interoperate: small market for the skill, high risk/complexity?
Another example might be more storage related. I’ve been wondering about where hyper-converged systems are appropriate, and where not. As you scale them up, you add network traffic and latency. From a Google search, I see that “scaling hyperconverged systems” is a thing — but no clear answers. Detecting that applications are slow due to storage IOPS is something we’ve run into a few times lately, as part of proving it’s not the network.
The implication here: perhaps some server/virtualization/storage skills are relevant for in-house datacenter networking teams.
The meta skills point here is: Don’t just learn one way of doing things; learn to be able to discuss the alternatives, their pros/cons, and where they fit or don’t fit.
DevOps and Cloud
I’d like to suggest that another place where network people can really have impact is DevOps and Cloud.
Here’s my logic:
- Network people are the ones who solve problems across stovepipes. We get the blame for an outage, we learn enough about the application, the related servers, storage, or security tools, and we identify the problem cause, working with others. We’re the IT people who connect things, who work across boundaries. We just need to broaden the associated skillset!
- Network people are highly aware of traffic flows. Or should be. Analyzing application flows that deliver a service is key toward successful cloud deployment, be it VM instances, containers, or whatever. I have yet to see an application team that documents their flows and the high-level way their application works. Sometimes they sort of document things like database schemas. If you’re lucky, what ports a firewall must allow (and sometimes, in which direction). Key app flows, never seen them documented. When forced to do so (for security teams), it’s usually a pathetic after-thought, even for major commercial apps. For what it’s worth, I learned a while ago, when doing QoS, to look for security appendices, because security keeps the app from working, whereas identifying ports for QoS is less compelling.
- Network people understand latency in their bones. Application developers — good ones get it, the other 90 percent don’t (and 90 may be optimistic!). That’s why we need to be flow-aware. Suppose an application has a chatty application component or micro-service. Let’s say it queries the database one row at a time, rather than using a stored query to do all the work in the database. Even though the latency LAT might not be all that bad, N x LAT could be a show-stopping problem, especially if N is, say, 1,000 or more. Recent case: Web authentication may have to happen many times when viewing a single web page, not just once.
- This leads me to what I call “app pods”: a group of consenting or cooperating VMs or containers that deliver an end-user service. If you take a chatty member of an app pod and stick it elsewhere, latency becomes a significant consideration. Developing cloud-ready applications or leveraging hybrid cloud successfully requires being aware of “flow clusters,” groups of intense traffic flows that define the VMs or micro-service containers that are tightly coupled and need to remain co-located. Using hybrid cloud instead of single cloud creates “friction” in the form of adding latency.
- If you doubt me, consider Tetration or why Cisco spent a fortune to acquire AppDynamics. They’re needed pre-cloud, to figure out the app pods and post-cloud migration, to manage and resolve performance issues. Firewall logs don’t cut it if the firewall only enforces at Layer 3: They miss same-subnet interactions.
- With cloud networks (AWS VPCs or Azure Virtual Networks), summarizable addressing, manageable addressing (for routing and security), and avoiding NAT are all advisable, especially when there’s static routing in play. Containers, even more so. Cf. Cisco’s open source Contiv project. I’ve been reading various resources in this area, to contrast what various tools do for convenience (NAT with containers, for example) versus what might be a more operations-friendly design. Why? I think that’s a valuable skillset.
DevOps and Networking
Sure, there’s DevOps for networking. Except we rarely have Dev/QA and test environments. If you have an actual networking lab, and not just VIRL, raise your hand. Yeah, I thought not.
Consider this: DevOps is supposed to be collaborative. (Although code before documentation and collaboration seems to be the approach in some places.)
Is there a network person on a local DevOps team? Security person? Operations person?
There’s a useful liaison role for someone who can talk to all three of those concerns. Maybe only periodically coordinating with any one DevOps team.
The need (job role) I think I see here: tying network architecture to application and cloud architecture, defining standards, and keeping it all manageable, while not slowing development down. If nobody does it, well, take your present “accrued technical debt,” throw in DevOps teams making it up (or figuring it out) as they go, and you’ve got chaos x 10! (Feel free to interpret that as 10 factorial, if you want.)
There’s also an up-front role, when a given DevOps project is kicking off. I’ve seen one large project need to radically change direction, at substantial cost (I suspect $1M+). The DevOps team apparently did not realize their “micro-services” (actually SOA) code was chatty, and compounded that with perhaps 5-10 milliseconds of latency in their hybrid cloud approach. CoLo providers may cross-connect you with a high-speed link to a cloud vendor, but the cloud servers are highly likely to not be physically located in that same CoLo.
I’ve seen lack of planning around application or service addressing, resulting in awkward routing (8 random /32 host routes needed to selectively override a /24, with corresponding firewall exceptions), or awkwardness in summarizing security zone addresses. That’s something a network/security person might be able to help with up front or as the project evolves. Assuming you don’t like accruing technical debt.
How Soon?
The time for new skills may be sooner than you think. Some small firms are moving fast, some not. Large firms generally do move much slower. Anyway, I’d like to share a hint at what future networks might look like.
I’ve been dealing with some small “hollowed out” (my term) networks lately. They have user sites WAN- or VPN-connected to a rack or half-rack in one or two CoLo sites. Their computing is already in the cloud or multiple clouds. Office 365 as SaaS. Salesforce. Time sheets and HR as a Service. Other business functions (accounting, etc.) either in the cloud or in a converged chassis in the CoLo.
The next step for such firms may be virtual routers running in the cloud as WAN/VPN hubs, probably via IWAN or SD-WAN. Or physical routers, where more performance is needed.
The networking person for this likely maintains it, orchestrates buying the hardware, circuits, CoLo and cloud footprints, and coordinates implementation services by consultants.
This blog is getting long, so I’ll expand on this last topic in a subsequent blog.
Comments
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!