Rolling Up Performance Data

Author
Terry Slattery
Principal Architect

Almost all network management products that collect performance data incorporate some type of rollup as the data ages. In most cases, the data over longer time intervals is averaged and the resulting average value is used in the performance graphs. I find that the average value is seldom what I want. Particularly bad examples are where the data is averaged over several hours or over a day. Almost all links have long periods of low utilization, which negatively impacts the resulting data (and the graph that we tend to use to show it).

What do I want? My major interest in link performance data is typically regarding the maximum values. A link that is running at less than its capacity is seldom a problem and therefore I’m not likely to be interested in its average value. I am more concerned with network utilization, congestion, and errors. Seeing the peak values allows me to see when a link needs to be upgraded, detect the potential for congestion or see bursts of errors. When I’m looking at long-term data, I need to see whether a link has been running near capacity during the busy period and how long it continues to run at that rate. A short burst at link capacity is acceptable, because TCP wants to use the full path bandwidth. But if I see utilization running for long periods of time at near-capacity, I know that the link is a candidate to be upgraded to higher bandwidth. At a minimum, I know that the link needs to have QoS enabled so that important traffic gets priority over less important traffic.

Errors are particularly important to not average because they tend to be lower absolute figures and averaging will quickly trend them towards zero, unless it is on a particularly bad link. I recently had a good example of this at a customer who had a set of fiber links, one of which exhibited some errors. It was showing about 50-100 errors per day. That’s not a large number and wouldn’t have a detrimental effect on the performance of flows taking that path. It stood out because all the comparable links were running with zero errors. Replacing the patch cable on the affected interface eliminated the errors. (We’re not sure if it was the patch cable or dust/dirt on the ends of the fiber.)

My recommendation to NMS vendors is to store the peak values or the 95th Percentile values for utilization data and store peak values for discards and errors. It should be easy to add an option to historical displays to select either average or peak data.

Regarding the storage of performance data, the ideal is to keep the raw performance data for historical analysis. Yes, the volume of data can be quite large, but the network operations team sometimes needs that data for doing trend analysis or to see if a particular pattern has existed for a long period of time. Storing the data in a separate NAS makes a lot of sense, allowing me to archive as much data as I like. I can then go back to the data to answer the following questions:

  • How long has the link utilization been exhibiting some characteristic? (like nearly 100% load, or perhaps nearly 0% load)
  • What has been the typical network utilization during March Madness Basketball season? (or pick some other reason for people at work to be watching streaming video)
  • When did errors or discards start happening on a link?
  • How frequently are errors or discards occurring on a link? Do the times correlate with other data?
  • When will a link need to be upgraded (trend analysis)?
  • Did the QoS implementation reduce drops in high priority traffic classes?

The NMS should provide a mechanism that allows me to easily answer the above questions.
-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.