Some recent QoS discussions got me thinking. I want to share some of my perspectives on QoS in the hope that you find them useful.
The starting point here is perhaps that many sites do not seem to be doing QoS. Or maybe that’s just among the sites I’ve been at recently. Understanding what QoS can do for you might help with that.
TL;DR: Understand the basics of where QoS helps and what it can/cannot do.
Know Your Flows
QoS is needed for some situations that are common in networks. It is needed when you cannot over-provision bandwidth, where there might be contention for bandwidth.
Here are some of the common situations where QoS can help:
- Upstream merge: If you have a bunch of switch ports sending traffic upstream, if 48 x 1 G ports are operating at 50%, you might have 24 Gbps of traffic trying to exit on 1 or 2 10 Gbps links. Something has to give! QoS lets you pick winners and losers – more precisely, which traffic has priority and which gets dropped or delayed slightly when there is congestion.
- Downstream (or other) slowdown: If you have traffic blasting down a 10 Gbps link that has to exit via a 1 Gbps link, e.g. to a device, any excess traffic will have to be queued or dropped. TCP will help adjust TCP flows based on drops. But you probably want to protect your VoIP and video from drops or queuing delays. (I’d use the super-highway exit ramp analogy here, but excess cars don’t just vanish!)
- Shaping: If traffic comes blasting into your WAN/MAN router at 1 Gbps and has to go out a 1 Gbps port where you’re paying for, say 500 Mbps capacity, that’s similar to the downstream slowdown. You can pump out more than 500 Mbps, but the provider is likely enforcing the contracted bandwidth, so they’ll be picking what to drop. If they support QoS, they will at least pay attention to your priorities as signaled via the QoS DSCP markings. QoS shaping lets you slow down your output to the contracted rate, queuing and dropping per your
And here’s one where QoS cannot help:
- No-QoS Provider: Suppose you have an (inexpensive) L2 or L3 WAN/MAN provider connecting your remote sites to the main site. At many points in the provider network, the various customer’s flows will necessarily merge. If the total outbound traffic on some provider internal network interface exceeds the bandwidth available, random drops of traffic will occur, ignoring any QoS DSCP bits you set. That is, your priorities are not their priorities; they have none.
- In other words, your priority traffic may get trashed because some other customer is pumping out a lot of traffic.
The good news there is that L2 switch-based providers may have 40 or 100 Gbps switch interconnections and be running at aggregated customer traffic utilization levels where drops are rare, even if one or two customers are putting out an unusually large amount of traffic. But even with that, you might have good days and bad days. How much over-provisioning is a small or cheap provider likely to have, given that it may well affect their profitability? How fast can they add bandwidth when their reporting system (if any) tells them a link is constantly running “hot”?
That brings up a key thought about QoS: QoS is application quality insurance against having bad days. If you contract with a provider that does not offer QoS, you may be saving money, and things may work fine most of the time. But you and they have less control of what happens under the three situations (bullet points) above.
Another way to think about QoS is drop-tolerant traffic. Namely TCP. TCP traffic drops signal the sender to slow down (via unacknowledged packets). So, part of what we do with QoS is protect the “fragile” traffic (my term for it) like VoIP and video. And divide up the remaining bandwidth among drop-tolerant classes.
Yes, there is TCP-based video. I would expect that any retransmissions would cause a short-lived video display “freeze.” TCP video apps may buffer for a few seconds before display to allow time for a retransmission. Multiple dropped packets may still cause problems. This may explain some hospital ultrasound issues along those lines that we spent some time troubleshooting. (Discontinued because It was taking a long time to troubleshoot, many devices in the path. We found that the video app seemed to work better without QoS as a quick fix, and the switches are to be replaced soon.)
When designing QoS, I recommend dividing up the outgoing bandwidth via percentages, which really specify the ratio of bandwidth the various classes get. That way, when there is some spare bandwidth, other classes get to use that bandwidth. Putting in shaping and policing commands per class caps classes’ traffic, which can result in unused (wasted!) bandwidth. I prefer to shape only when there is a contracted rate below the line rate, e.g., 2 Gbps on a 10 Gbps link.
QoS deployments often have a “BULK” class. (To use the older name for it.) The idea is that some traffic like file transfers can be allocated, say 1% of the capacity. That means that if any other application is sending, its traffic gets priority. The bulk traffic gets to transmit when there’s spare bandwidth. You might for example treat backups as BULK, but allocate, say, 10% of the bandwidth, so they complete within 24 hours (based on experimentation, and realizing that backup traffic generally increases over time, so will take longer to complete). And be especially careful about replication traffic. (The server team adding an unplanned DB replication can ruin your day! – Planning is needed when any single flow can take up a significant fraction of the link.)
Real World Use Case
Medical system. Remote clinics have sub-1 Gbps MAN/WAN links (cost, availability). When radiology images are transferred back to the main site, VoIP (IP phone and Vocera WiFi badges) gets trashed. QoS can prioritize the voice apps over image transfer so that when there is contention, radiology packets get dropped, slowing down the image file transfers. So QoS can get the traffic out onto the WAN in good shape.
However, if the WAN provider doesn’t do QoS, then the radiology traffic may need to be shaped or policed, capping its use of bandwidth in the provider network. This is anticipating the merging flows scenario above.
Limiting the radiology traffic isn’t great because how do you predict and control what happens in the provider network if multiple radiology uploads from different sites are in progress? If you cap the bandwidth, you may be forcing the radiology upload(s) from a given site to be slow when they don’t have to be. Doctors can become highly unhappy and vocal in such situations!
Conclusion: There’s not much you can do to compensate for a carrier that does not provide QoS.
Coping With QoS Complexity
I frequently hear that configuring Cisco QoS is painful. I’m not inclined to argue, deploying QoS can take a lot of time and attention. The commands also vary across Cisco devices, although it has gotten better in the last few years. (Except for Nexus QoS, which I consider to be just a very weird QoS CLI – tied closely to the hardware capabilities, apparently.)
I know of two answers to that:
- A couple of us at NetCraftsmen have built up a set of QoS design documents for a standardized approach, using common class names and reducing classification to building access lists (to the extent possible). The idea is you can configure the framework of about 10 QoS classes, leaving some classes unused, and enable a new QoS class by adding classification ACLs. This also includes tested configuration templates for various switch types and for IOS routers. This lets us come in and deploy QoS at a lower cost. (No, I can’t share those documents. But we can customize them for a customer.)
- License Cisco Prime or DNAC QoS and use it to automate deploying QoS. This is much simpler.
Re DNAC: one of our staff just used it with a Cat9K hospital site switch replacement and was very positively impressed. I’ve heard that from others too.
Most sites don’t seem to be using DNAC yet, perhaps because their switches aren’t due for replacement yet, or due to COVID. The other possibility might be the failure to appreciate the value of QoS, and the labor aspect of deploying and supporting it.
QoS Design and Deployment
Another part of QoS is a systematic design approach because we need to “classify and mark” (C&M) traffic inbound so that we can leverage the markings upstream.
The best place for C&M is the access switch on the campus, so we can leverage VoIP prioritization going upstream. Where we usually most need “fancier” QoS is at the WAN edge. So one option is to deploy WAN router QoS first (C&M inbound, fancy QoS out to WAN), then retro-fit the campus, or assume the campus has tons of bandwidth (which I don’t recommend).
Data center QoS is another discussion. There’s usually lots of bandwidth there. But some big flows too.
WiFi, VPN, etc. – all separate domains and topics. With wireless, having the on-wire CAPWAP or other tunnel header capped at some low DSCP value can really trash e.g., guest wireless video. I’m very much Not A Fan of that approach, but it’s the standard, and you don’t want to read my rant on that subject.
And lately, there’s QoS and VMware. The biggest thing I know there is that you want to make sure your call manager, ISE, etc., get plenty of CPU cycles (via shares) and interface bandwidth out of the server chassis. (The caution here is that the VMware admin might see your VM isn’t using its full share and reduce the share amount, to allow putting more VMs onto the VMware host. The symptom of that is slow response by the call manager, ISE, or whatever – NOT what you need!)
NSX has some other QoS aspects I’m going to ignore here. QoS DSCP markings by the VM, great if you can configure that. But I’m fine with doing server-side C&M on the access data center switch, which also provides network-side verifiability and consistency. Simpler!
I’ll keep this short. From an Ops perspective, I/you need to know that the configuration you designed got deployed correctly. The engineers deploying it WILL get bored and miss buffer over-runs or operator errors. Also, when troubleshooting QoS, my first question is usually, “are we sure that the QoS config hasn’t changed: config drift, compliance. Both are not fun.
Lately, these two factors have me strongly recommending DNAC to customers doing campus QoS with Cat 9K switches. It greatly cuts QoS deployment cost, assures correctness, and ensures ongoing compliance. It also provides QoS reporting and troubleshooting assistance.
I’ll admit my hands-on time with DNAC QoS is currently low, but my peers report favorably on it. The win is potentially so big I am going ahead with the statement above. It potentially saves a lot of time and money, means you can do QoS without deep expertise, etc.
I hope the above gives you some tools to think about QoS and where it is most critical.
To me, the main thing is to strongly prefer using percentages, shares of the bandwidth, and setting relative priorities.
As soon as you start putting in Mbps or Gbps numbers for policing or shaping, you’re creating a Not To Exceed situation, where even at night with no competing traffic, that class of traffic will not be able to use the spare bandwidth.
And do check out Cisco DNA Center’s QoS support!