Diagnosing a QoS Deployment

Terry Slattery
Principal Architect

Several of us at NetCraftsmen were involved in an interesting network QoS configuration case a few weeks ago. The customer had requested assistance with the first phase of a QoS deployment in which we were going to add QoS to some overloaded WAN links. These links ran at low utilization at night, but at 8am the utilization started increasing and by 9am it was typically into link saturation. At about 4pm the load started dropping and by 7pm it was back to the night-time load. The customers at the WAN site complained regularly about slow application performance.

Since the link was saturated, we knew that drops would affect all all applications but that TCP performance would be especially poor as TCP backed off. (See my blog about on TCP Performance and the Mathis Equation for calculating TCP performance due to packet loss.) UDP applications wouldn’t have any feedback and would continue to monopolize the link.

Application monitoring showed that a significant portion of the daytime traffic was to Akamai, Lime Light Networks, and Pandora.com. In other words, a significant amount of traffic was streaming entertainment traffic. Since the link was saturated for most of the day, a lot of business application traffic was being dropped and the customer wanted to do something about it.

The QoS design used three traffic classes:

  1. Low latency data, which was time-critical business applications.
  2. Bulk data, which was the majority of the business application traffic.
  3. Low priority data, which was the “entertainment traffic” that we had identified.

We built the traffic classes and the policy (Cisco configs) and applied them to one of the WAN links early in the morning. We monitored the link utilization and QoS queueing, using ‘show policy-map interface Serial 1/0’. As the load started to build, we saw a drop profile that looked like this:

  1. Low latency data: 240
  2. Best effort: 5125
  3. Low priority data: 2791

That wasn’t exactly what we wanted to happen. We wanted no drops in the low latency data queue and most of the drops in the low priority data queue. After some research, we determined that the low latency data applications were sending a lot of very small packets. The default queue depth on each traffic class was 64 packets. So a burst of data (a screen update) would over-run the buffer pool, causing any packets beyond a burst of 64 packets to be dropped.

We decided to incrementally increase the buffer pool in the low latency data traffic class. We increased it to 128 buffers using the command ‘queue-limit 128’, applied to the low latency traffic class in the policy map.

The configuration of the policy map now looked like this:

policy-map WAN-QOS-POLICY
  description WAN outbound queuing and scheduling policy
     bandwidth percent 3
     queue-limit 128
     bandwidth percent 1

  class class-default
     bandwidth percent 26

The result was what we wanted, but not quite enough. Drops in the low latency queue decreased, but were not eliminated. So we increased the queue depth to 256 and drops in the low latency queue stopped. Success! We suspect that a buffer depth of about 150 packets would have handled the bursts but have not performed any further tuning.

We now turned our attention to the best effort queue. Our rationale is that in order to force drops into the low priority data queue, we would need to buffer more of the best effort data. Too much buffering isn’t good; it can fool TCP, which retransmits packets that time out. (See Wikipedia and Jim Getty’s Blog to learn about too much buffering.) So we increased the buffer pool on the best effort queue (class class-default) to 128. The majority of the drops shifted to the low priority queue, which is exactly what we wanted.

After this exercise, we started thinking about how to monitor all the queues on all the interfaces that have QoS implemented, so that we can identify where a high priority queue is dropping packets. Very few NMS products provide functionality to monitor QoS (CBQOSMIB) and those that do have very simplistic interfaces. Customers want a way to monitor the queue depth, traffic volume, and drops per traffic class. The NMS needs to provide alerting thresholds on drops and traffic volume. The system should also allow plots of both parameters, with the queue depth reported in the chart header.

The alerting function would identify links where a high-priority queue is dropping an excessive number of packets. A different alert could be generated when a big change occurs in a traffic class’s utilization, alerting the network staff to a significant network that caused a shift in traffic or that a new application was deployed. When utilization changes significantly due to new applications or permanent network changes, the network baseline needs to be updated.



Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html


Leave a Reply