Some QoS Gotchas

Recently, I’ve been consulting in some situations with subtle QoS / user experience issues. The following blog describes some things that you might be overlooking.

Internet Links Matter

First, there is growing use of Internet for SaaS applications. No longer do we just deal with enough bandwidth for internal applications.

However, the quality of Internet service varies. The old saying applies that at best you get two out of three from {fast, good, cheap}. What discard SLA does your ISP offer (if any)?

If your users are complaining:

Check your Internet link: does it have enough bandwidth?
Note that classifying Internet traffic for QoS is hard, and there is likely no good way to differentiate handling of inbound traffic anyway. Traffic reaches your Internet edge router before you have the chance to do anything with it. Applying shaping on the inbound traffic exit from the edge router may at least help you throttle inbound TCP traffic to a degree. If you’ve got aggressive file transfers, replication, etc. going on, throttling them some might help. A later blog will look at this topic in more detail.
Beyond that, about all you can do is manage the congestion level, making sure you’re providing enough bandwidth.

Quantify User Experience

Another thing you can do is get a “UX tool” (User Experience tool). By that I mean some network management tool that will help you quantify UX, and measure UX from different locations in your network. ThousandEyes, AppNeta, and NetBeez are three tools that can help in that regard.

Put one measurement device right by your Internet edge, another in a user VLAN (or connected via a WLAN AP).

See if the readings are approximately the same for both devices. If not, that likely tells you something useful.

Having ongoing automated measurements you can graph helps determine whether something measurable (and that you are in fact measuring) has changed, when there is a UX problem.

Direct Internet Access

You also might consider giving sites direct Internet access. If your organization is back-hauling Internet traffic to go through a centralized security stack, the added latency may be reducing remote site user experience quality. You could instead use a regionalized approach (see for example my prior blogs about Performance Hubs), or have sites directly access the Internet via SD-WAN devices and / or a security service like zScaler or Cisco Umbrella.

LAN QoS

Here’s another thing to consider: LAN QoS. WAN and LAN QoS aren’t completely separate things. Providing WAN QoS is likely your first priority, but LAN QoS is also part of delivering good user experience.

People keep telling me “we have 10 Gbps uplinks, lots of bandwidth, we don’t need LAN QoS, etc.”.

I disagree. QoS is about short-lived events. When you aggregate many downstream 1 Gbps ports, they can temporarily congest a 10 Gbps uplink, causing tail drops. When a 10 Gbps downlink pumps data into a switch that has to send it out a 1 Gbps link to user or server, the queue(s) for that port can fill up and drops occur. LAN QoS lets you protect your more sensitive voice, video, application sharing, or other critical application traffic. Think of it as “microburst insurance”.

As a metaphor, if you’ve ever had the experience of sitting at a railroad crossing waiting for a long freight train to go by, well, that’s your VoIP packets trying to get into a switch queue along with all the packets in a file transfer. If you prefer, car on an interstate entrance ramp, wall-to-wall trucks on the main highway — that gives new meaning to dropped packet.

Visibility

Yet another factor to consider is what tool(s) you’re using for visibility. If you don’t have tools providing good visibility, you’re flying blind. Get a good tool.

To me, “good tool” requires at a minimum, the ability to monitor all active interfaces, frequently. Part of user experience is their wired connection. I’ve seen enough sites with undetected duplex mismatches or bad cables, and worse, users that learned to live with terrible network performance. I firmly believe Network Operations teams need that access link visibility. If you don’t have it, you can spend hours trying to diagnose poor user experience.

Visibility issue lurking there: workstations connected via IP phones, rather than directly to switch ports, are invisible to network management tools. The PC to phone link might have duplex or other problems (bad quality cabling with lots of drops), but your tool is likely going to be unable to help you spot that!

The new challenge is that more and more users are on WLAN. WLAN is more easily subject to operating in a degraded condition. More so, I have yet to see tools that report across all users on things like per-user utilization, drops, errors over WLAN. Doing that well might require some sort of user agent, deployed on at least a sampling of user devices.

To me, the absolute starting requirement with WLAN is a high-quality site survey and deployment. Also, deployment, since we keep coming across AP’s deployed with internal antennas in the wrong orientation. But that’s a topic for a different blog, all in itself.

What to Monitor

Reporting on interface utilization (and errors and discards) is a good start. Reporting on percentage utilization is better (less brain activity / recall of interface speeds required, more direct information about problems). TopN data can help.

Unfortunately, that doesn’t tell you about QoS. What’s going on with the QoS queues? Are the applications users are complaining about even marked and getting specialized QoS handling? One answer is percentage utilization and drops or drop percentage per queue.

You do have to bear in mind that QoS is about picking winners and losers. So, you should be expecting drops in your “loser” queues.

What you have to watch out for, however, is averaging. If you’re looking at Top N numbers with 24-hour averages, that’s not very useful to telling you about conditions that might have lasted for seconds or minutes.

Do It By The Numbers

I like 95^th percentile numbers to solve that. They tell you what the worst measurements were out of a group of measurements. The 95^th percentile tells you the level the worst 5% of the measurements were at or above. That amounts to about one hour of “badness” for 24 hours of measurements.

The percentile number tells you the level your one hour of badness was at or exceeded — how bad the badness was, in effect. For example, if your 95^th percentile utilization is 30%, you’ve got a link that’s pretty uncongested. If your 95^th percentile utilization is 99%, your link has a collected 1 hour of measurement intervals where the utilization was at or above 99% — i.e. one cumulative hour of being badly congested.

90^th percentile translates to approximately the level of the worst 2 of 24 hours.

99^th percentile translates to approximately the level of the worst 15 minutes. Etc.

Reporting on TopN percentiles across all interfaces tells you your worst interfaces, and how bad they were.

You may be wondering, “ok, Pete likes math, but that’s obscure stuff.” Well, but tools can (or should) be able to do the calculations for you.

There’s a method to my math-ness here (pun intended).

Lesson learned repeatedly: graphs are useful, but graphs can be mis-leading. If you graph 24 hours or several days’ utilization data, you will likely see lots of very thin peaks or “spikes”. If you zoom in, some of them may actually represent one to two hours of high utilization.

There are really three problems with graphs of utilization or QoS data:

Skimming graphs across 1,000 or 10,000 interfaces is very slow, not going to happen. Mark-1 eyeball is not efficient.
With several QoS queues per interface, looking at say 6 times more QoS graphs is really not going to happen.
And as above, graphs can be misleading: peaks that look non-threatening.

The percentile reporting approach has two benefits:

TopN percentiles (and sorting) calls your attention to interfaces or queues that matter, the ones with problems.
The numeric data is less likely to be mis-leading than graphs plus eyeballs.

Having said that, I’ll freely admit I like looking at the graphs too — once I know which graphs matter. That’s how you spot patterns, like daily surges of traffic (Example: oh, someone is doing backups during business hours, they probably got 12 AM and PM confused).

What To Look For

What we’re looking for here is two things:

Is there enough bandwidth? If you don’t like the drop rates or saturation levels on your low priority queues, that’s one indication you’re short on bandwidth and need more.
Are any of my classes of traffic exceeding their guaranteed bandwidth, and dropping traffic because there’s no spare bandwidth left over from the guarantees for the other classes?

In either case, the answer is likely to be, add more bandwidth. I do not recommend fiddling with QoS class percentage guarantees on a per-interface basis. Managing per-interface variations in percentages strikes me as likely to be very unwieldy, time-consuming, not feasible. To use a slightly dated term, no interface should be a unique snowflake — there are too many of them!

Unless you have vast bandwidth or application differences, when a queue is apparently not guaranteeing enough bandwidth, taking “leftover” (unused other classes’) bandwidth into account, then you need more bandwidth overall.

Conclusions

If you’re grappling with QoS, I hope the above gives you some things to think about or try out.

NetCraftsmen would love to consult with you on your QoS, and help you try to improve things. The above are things you can check / implement for yourself. If you’ve got good WAN QoS, then the above items may be what’s holding back your QoS from doing the complete job for you.

Comments

Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

—————-

Hashtags: #CiscoChampion #TechFieldDay #TheNetCraftsmenWay #Routing #Switching

Twitter: @pjwelcher

Disclosure Statement

NetCraftsmen Services

Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at [email protected].