Network SLAs – Which one to use?

Terry Slattery
Principal Architect

Scott Hogg just did a post on his Network World Blog that mentioned SLAs (Service Level Agreements).  It is a very timely article for me because I’m working with a customer who wants to define network SLAs.  The task becomes one of selecting an appropriate SLA for the organization.

Good SLAs will have several basic characteristics:

  1. Measures key network parameters that are important to the organization.
  2. The data necessary to create the SLA can be automatically collected with the NMS tools in place.
  3. Are understood by the people who are managing the network.

Let’s look at an example.

Scott suggested an SLA that measures network downtime.  An SLA of five-nines (99.999% availability) results in 5.25 minutes of downtime per year. But how is such an SLA generated?  I can think of several measurement methodologies that result in very different figures.  Let’s use a sample network that has two data centers and twenty remote sites.

In the simple case, all network downtime is accumulated, even if it affects only a portion of the network.  A small remote branch outage is counted the same as an outage that takes out one of the data centers.  Another methodology averages downtime across sites, with the result that the failure of one remote branch has less of an impact on the overall SLA value.  Suddenly, the SLA metric is vastly improved simply by modifying the calculation that’s used.  Finally, a third calculation methodology would measure average device downtime.  Since the failure of a remote site is typically due to one device or link failure, the average of downtime across all devices would create an even smaller SLA figure than the previous two methods.

Which methodology is best for an organization?  It depends on the business.  You want the measurement to reflect those factors that have the greatest impact on the business.  If all the remote sites have to be connected to a data-center at all times, the first methodology is best.  The second methodology is good if the overall average remote site availability is more important than having every site up all the time.  Organizations that use site availability averages often have the capability for a remote site to run in detached mode for short periods of time – until the link to the data-centers can be repaired.  The third calculation might be used by an organization that has critical end systems that need high availability of their attached network devices.

Of course, there are many more SLA calculation methods and source data.  In a VoIP network, it would be useful to incorporate delay, jitter, and packet loss stats into an SLA.  The result might have to be a multi-faceted SLA in which there are several reported figures, each of which focuses on how the network supports a specific part of the business.



Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under


Leave a Reply