Diagnosing a QoS Deployment

Author
Terry Slattery
Principal Architect

Several of us at NetCraftsmen were involved in an interesting network QoS configuration case a few weeks ago. The customer had requested assistance with the first phase of a QoS deployment in which we were going to add QoS to some overloaded WAN links. These links ran at low utilization at night, but at 8am the utilization started increasing and by 9am it was typically into link saturation. At about 4pm the load started dropping and by 7pm it was back to the night-time load. The customers at the WAN site complained regularly about slow application performance.

Since the link was saturated, we knew that drops would affect all all applications but that TCP performance would be especially poor as TCP backed off. (See my blog about on TCP Performance and the Mathis Equation for calculating TCP performance due to packet loss.) UDP applications wouldn’t have any feedback and would continue to monopolize the link.

Application monitoring showed that a significant portion of the daytime traffic was to Akamai, Lime Light Networks, and Pandora.com. In other words, a significant amount of traffic was streaming entertainment traffic. Since the link was saturated for most of the day, a lot of business application traffic was being dropped and the customer wanted to do something about it.

The QoS design used three traffic classes:

  1. Low latency data, which was time-critical business applications.
  2. Bulk data, which was the majority of the business application traffic.
  3. Low priority data, which was the “entertainment traffic” that we had identified.

We built the traffic classes and the policy (Cisco configs) and applied them to one of the WAN links early in the morning. We monitored the link utilization and QoS queueing, using ‘show policy-map interface Serial 1/0’. As the load started to build, we saw a drop profile that looked like this:

  1. Low latency data: 240
  2. Best effort: 5125
  3. Low priority data: 2791

That wasn’t exactly what we wanted to happen. We wanted no drops in the low latency data queue and most of the drops in the low priority data queue. After some research, we determined that the low latency data applications were sending a lot of very small packets. The default queue depth on each traffic class was 64 packets. So a burst of data (a screen update) would over-run the buffer pool, causing any packets beyond a burst of 64 packets to be dropped.

We decided to incrementally increase the buffer pool in the low latency data traffic class. We increased it to 128 buffers using the command ‘queue-limit 128’, applied to the low latency traffic class in the policy map.

The configuration of the policy map now looked like this:

policy-map WAN-QOS-POLICY
  description WAN outbound queuing and scheduling policy
  class OUT-LOW-LATENCY-DATA
     bandwidth percent 3
     queue-limit 128
 
  class OUT-LOW-PRIORITY-DATA
     bandwidth percent 1

  class class-default
     bandwidth percent 26

The result was what we wanted, but not quite enough. Drops in the low latency queue decreased, but were not eliminated. So we increased the queue depth to 256 and drops in the low latency queue stopped. Success! We suspect that a buffer depth of about 150 packets would have handled the bursts but have not performed any further tuning.

We now turned our attention to the best effort queue. Our rationale is that in order to force drops into the low priority data queue, we would need to buffer more of the best effort data. Too much buffering isn’t good; it can fool TCP, which retransmits packets that time out. (See Wikipedia and Jim Getty’s Blog to learn about too much buffering.) So we increased the buffer pool on the best effort queue (class class-default) to 128. The majority of the drops shifted to the low priority queue, which is exactly what we wanted.

After this exercise, we started thinking about how to monitor all the queues on all the interfaces that have QoS implemented, so that we can identify where a high priority queue is dropping packets. Very few NMS products provide functionality to monitor QoS (CBQOSMIB) and those that do have very simplistic interfaces. Customers want a way to monitor the queue depth, traffic volume, and drops per traffic class. The NMS needs to provide alerting thresholds on drops and traffic volume. The system should also allow plots of both parameters, with the queue depth reported in the chart header.

The alerting function would identify links where a high-priority queue is dropping an excessive number of packets. A different alert could be generated when a big change occurs in a traffic class’s utilization, alerting the network staff to a significant network that caused a shift in traffic or that a new application was deployed. When utilization changes significantly due to new applications or permanent network changes, the network baseline needs to be updated.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.