How’s the network running? More on SLAs

Author
Terry Slattery
Principal Architect

A few weeks ago in Network SLAs – Which one to use? I described SLAs based on device reachabilty or uptime.  Today’s topic is about SLAs based on network QoS.

There are many apps like VoIP and SAP that are intolerant of large variations in network performance characteristics.  Even though all devices are up and connectivity is good, the operational characteristics prevent some applications from providing aceptable service.  For VoIP, high jitter, latency, or packet loss are the key quality factors.

How should you factor a quality metric into the network SLA?  I think that multiple SLAs are needed – some for network availability and reachability and others for quality.  Collecting and interpreting quality metrics can be an interesting challenge.  Do you run separate tests, perhaps using IPSLA, which generates additional network traffic?  Or do you collect and measure the quality characteristics of real user traffic?

I prefer to use real user traffic when possible because it lets me know about user problems as they occur.  Synthetic tests are then useful for collecting more detailed evidence on same paths and using the same protocols as the user traffic.  While you could ignore user traffic and just instrument the network with a set of synthetic tests and monitor the results of those tests, the volume of tests and network traffic that is needed to collect good evidence becomes a new problem to manage.  And a new problem to manage is the last thing that we need.

One of the things that I like to do is collect delay, jitter, and loss data from the VoIP systems and search the collected data for phone calls that have high levels of any of these factors.  The logs show the source and destination addresses, so I can determine the path through the network.  I know that it is UDP for the call traffic and TCP for call setup, so when I find something that’s not right, I can configure IPSLA to run a test to determine if it is a continuous problem, a periodic problem, or intermittent.  I can also instrument tests that run to intermediate nodes so that I can determine which element in a path may be causing the problem.

Now that I have data, either from user traffic or from synthetic tests, how do I use it for an SLA?  Averaging samples is seldom the right thing to do because it will often hide a few really bad data points within the volume of data from all users.   A weighting scheme, or perhaps looking for max values on critical factors, seems like a more useful mechanism because it increases visibility into important problem symptoms.

The purpose of an SLA is to measure how the overall network is operating.  With this in mind, I think that I’d like a multi-valued SLA that shows the overall average, the max value out of the collected data, and the average of the Top-10 values.  The Top-10 Average would tell me roughly how far the worst data points are from the average.  Let’s look at an example.  I have 498 data points of jitter that have the following characteristics:

  • Max = 314
  • Average = 7.6
  • Std Deviation = 31
  • Top 10 Average = 178

The max tells me that there is one phone call that had very bad jitter.  The average tells me that the overall network jitter is within acceptable limits.  But the Top-10 average tells me that there is a bad problem somewhere in the network that is causing exceptionally high jitter and that it is affecting multiple calls.  The standard deviation tells me the same thing in a slightly different way. If both the Top-10 Average and Standard Deviation were much closer to the average, then that would indicate a smaller problem.  Without looking at the individual data samples, I can’t tell from the SLA if it is one station or multiple stations, but I do get an indication of something that is worth investigating.

While thinking through this problem, I thought that some specific analysis relative to jitter could be used, but that would eliminate the usefulness of having a simple SLA algorithm that could be applied to a variety of data.  By performing this type of SLA calculation on a number of critical network characteristics, then applying a notification filter, we can build a pretty interesting alerting system to let us know that there is a network problem without having to look through vast volumes of data.

Combine the above methodology with the network uptime and reachability SLAs I previously described and there’s the basis for a nice network SLA dashboard.  Use charts instead of numbers to make trends visible and you have a dashboard that the execs will love and that makes sense to the networking staff.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply