NetCraftsmen Showcase: Deploying QoS

This blog is the first in what may become an occasional blog series. I thought it might be fun (and good marketing, of course) to share some of the many things NetCraftsmen consultants are up to. NetCraftsmen is doing a lot of managed service and design/deployment work for a variety of large and small customers.

Thanks to Steve Meyer and Carl King for the info provided and for reviewing this blog.

Steve and I have a long history of designing and deploying QoS for customers.

Recently I’ve been providing some support to a team working on a QoS project in a large hospital system with over 1,000 switches and routers. The project stemmed from the fact that the existing QoS configuration deployment had configuration drift over time (missing elements, gaps, wrong, etc.). This happens in most shops. New devices get deployed, staff gets distracted, changes miss some devices, etc.

Medically important VoIP apps needed proper support.

The fun was enhanced by the critical VoIP app using IP Multicast (“IPmc”) and a similarly inconsistent set of IPmc configs.

TL;DR Success factors and some lessons learned re DNAC and QoS. Automation using tools at hand.

Background

At the time the project kicked off, I had been working on a script to parse collected show output. And had extended it to do sanity checks of QoS and IPmc configurations. Also, extract the relevant configuration commands to files to simplify the manual review of configurations.

I’ll note the script is not something I can share without more effort. My emphasis was on writing code and doing rapid prototyping to see what worked and what didn’t work. I tried to use good style and comments, partly to reduce my pain in fixing bugs, but some of the code is … hasty.

It does at least crudely parse every CLI command I’ve seen in a large collection of IOS, IOS-XE, and Nexus configurations, although sometimes just enough to ignore an entire CLI command sub-tree.

I also did a lot of manual checking, but Cisco coders may have done things differently for almost any model/sub-model of hardware, so there are probably gaps and bugs. The point was fast working code, as correct as reasonably possible, and fix problems or parse more carefully when problems turn up – and they did and will do.

The script checks things like “Was QoS or IPmc globally enabled?” (on by default in some devices, not in others – Cisco cross-platform consistency of defaults is just not there). That factors in some solid guesses as to devices that default to enabled based on model number. And if enabled globally, is it enabled on at least one interface? Are other mandatory commands present? Etc. The same is true for multicast: is it globally enabled and enabled on at least one interface? Is there something covering PIM RP? Etc.

Anyway, the script was useful for getting a quick read on how big the IPmc and QoS discrepancy problem was. The result: many devices had gaps.

Fixing IPmc

For IPmc, the issue is generally just missing Layer 3 commands or global. And generally, additive, as in pasting in commands that are already present, isn’t a problem. In addition, per-platform variations in syntax are few, so a couple of base configurations were all that was really needed. Scripted paste-in, verified, done.

And yeah, there might have been some snags I haven’t heard about.

PIM RP and anycast RP in a large network are other considerations.

QoS is a PITB

QoS, on the other hand, is painful to fix manually. All too often, you have to back out commands, and you can’t just replace them.

In the extreme case, if you have an ACL referenced by a class-map used in a policy that is applied to one or more interfaces, and you want to change the ACL, you may have to remove the policy from the interface, delete the policy, delete the class-map, fix the ACL, then put it all back. Or variants of that rigamarole. Painful!

Based on internal company feedback, DNAC had been very helpful in greatly simplifying initial QoS deployment at one site. So the team decided to use it where possible, with manual/scripted fixes elsewhere. I’m told DNAC has gotten pretty good at backing out commands, apparently including QoS, as well. On the other hand, for QoS, it seemed to play it safe, removing old service-policies from interfaces but not the ACL, class-map, and policy-map.

DNAC QoS Pros and Cons

If you attempt this, you’ll discover the pros and cons. First, you need to get your DNAC up to a recent non-buggy release level, which can be time-consuming. Then when you consult the DNAC support matrix, you have to upgrade a bunch of switches to supported code – the chicken-and-egg problem. As in, you have to upgrade them so that DNAC will support them, and THEN DNAC will be able to automate future upgrades and manage QoS.

And by the way, going forward, I would want to try DNAC automated upgrades on one device of each type, just in case of bugs and gotchas.

There’s also a learning curve if you haven’t had DNAC before or migrated from Prime to DNAC but only for AP and Wi-Fi management.

The good news is that DNAC then reportedly pushed QoS configurations out nicely. Some custom rules were added, and it handled them.

That cut down the amount of legacy/manual work, making the whole project go more quickly.

Since I’d recommended DNAC in the first place, that was a “whew! Glad it worked well.” Yes, there’s the initial management startup time cost mentioned above, but from then on, you’ll have automated device upgrades and automated changes or additions to your QoS, etc. As well as other management and assurance reporting. A net win!

Lessons Learned

Allow time for unexpected tool overhead (e.g., large-scale inventory population and device upgrades). That probably falls under “initial setup to use DNAC automation,” which is maybe a separate task from QoS (or IPmc) deployment. We encountered some issues with device access. And heck, we’ve found at most sites that getting a 100% reliable device inventory can be a challenge, especially if equipment replacement is constantly going on.

So if legacy devices or whatever have console-only or local password-only access, yes, cleaning that up is necessarily going to be part of any automation and management project.

Manual QoS is still painful. NetCraftsmen has a large document of best practice config snippets for older devices, which is still handy, e.g., for Nexus switches and 4000-series switches that DNAC does not (yet) support. That saved a good bit of time.

Conclusion

In production networks, configuration drift is a real thing. When staff deploys new switches and routers, they may forget to paste in parts of configurations or not have them prepped in the first place. After-hours work can be conducive to such oversights. Do YOU believe your QoS and IPmc configurations are correct everywhere in your network?

There’s an automation story lurking here. The above work used what I might call “just in time” automation. Scripts were used to detect deviations from the standard and extract just the relevant QoS or IPmc commands to simplify viewing.

There was a fair amount of checking to ensure that what got deployed was what was intended.

So no overall automation, but spot use of various tools that were on hand to get the job done. This is probably how automation needs to start in any organization: find a good workflow, and automate parts of it. If the task is a repeated one, improve the automation incrementally. Generally, focus on automating labor-intense tasks, also ones (like pushing configlets) that humans tend to screw up.

And isn’t that how most of us operate?

Disclosure statement