How’s the network running? More on SLAs

Author
Terry Slattery
Principal Architect

A few weeks ago in Network SLAs – Which one to use? I described SLAs based on device reachabilty or uptime.  Today’s topic is about SLAs based on network QoS.

There are many apps like VoIP and SAP that are intolerant of large variations in network performance characteristics.  Even though all devices are up and connectivity is good, the operational characteristics prevent some applications from providing aceptable service.  For VoIP, high jitter, latency, or packet loss are the key quality factors.

How should you factor a quality metric into the network SLA?  I think that multiple SLAs are needed – some for network availability and reachability and others for quality.  Collecting and interpreting quality metrics can be an interesting challenge.  Do you run separate tests, perhaps using IPSLA, which generates additional network traffic?  Or do you collect and measure the quality characteristics of real user traffic?

I prefer to use real user traffic when possible because it lets me know about user problems as they occur.  Synthetic tests are then useful for collecting more detailed evidence on same paths and using the same protocols as the user traffic.  While you could ignore user traffic and just instrument the network with a set of synthetic tests and monitor the results of those tests, the volume of tests and network traffic that is needed to collect good evidence becomes a new problem to manage.  And a new problem to manage is the last thing that we need.

One of the things that I like to do is collect delay, jitter, and loss data from the VoIP systems and search the collected data for phone calls that have high levels of any of these factors.  The logs show the source and destination addresses, so I can determine the path through the network.  I know that it is UDP for the call traffic and TCP for call setup, so when I find something that’s not right, I can configure IPSLA to run a test to determine if it is a continuous problem, a periodic problem, or intermittent.  I can also instrument tests that run to intermediate nodes so that I can determine which element in a path may be causing the problem.

Now that I have data, either from user traffic or from synthetic tests, how do I use it for an SLA?  Averaging samples is seldom the right thing to do because it will often hide a few really bad data points within the volume of data from all users.   A weighting scheme, or perhaps looking for max values on critical factors, seems like a more useful mechanism because it increases visibility into important problem symptoms.

The purpose of an SLA is to measure how the overall network is operating.  With this in mind, I think that I’d like a multi-valued SLA that shows the overall average, the max value out of the collected data, and the average of the Top-10 values.  The Top-10 Average would tell me roughly how far the worst data points are from the average.  Let’s look at an example.  I have 498 data points of jitter that have the following characteristics:

  • Max = 314
  • Average = 7.6
  • Std Deviation = 31
  • Top 10 Average = 178

The max tells me that there is one phone call that had very bad jitter.  The average tells me that the overall network jitter is within acceptable limits.  But the Top-10 average tells me that there is a bad problem somewhere in the network that is causing exceptionally high jitter and that it is affecting multiple calls.  The standard deviation tells me the same thing in a slightly different way. If both the Top-10 Average and Standard Deviation were much closer to the average, then that would indicate a smaller problem.  Without looking at the individual data samples, I can’t tell from the SLA if it is one station or multiple stations, but I do get an indication of something that is worth investigating.

While thinking through this problem, I thought that some specific analysis relative to jitter could be used, but that would eliminate the usefulness of having a simple SLA algorithm that could be applied to a variety of data.  By performing this type of SLA calculation on a number of critical network characteristics, then applying a notification filter, we can build a pretty interesting alerting system to let us know that there is a network problem without having to look through vast volumes of data.

Combine the above methodology with the network uptime and reachability SLAs I previously described and there’s the basis for a nice network SLA dashboard.  Use charts instead of numbers to make trends visible and you have a dashboard that the execs will love and that makes sense to the networking staff.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.