Network Management’s Unfulfilled Promise

Author
Terry Slattery
Principal Architect

I’ve been participating on several network management product forums for the past six months and have some observations that I’d like to share.

First, most organizations are focused on network performance monitoring.  The forum questions tend to be around the areas of monitoring network link performance and server parameters (CPU, disk, memory).  While these can be important metrics, they seem to get an overwhelming level of interest.  Interestingly, it is seldom performance that causes a total network outage.

I think that performance is easy to understand at one level, so that’s where we start trying to use an NMS product.  However, properly reporting performance metrics is not simple.  Managers ask about link utilization and typically don’t understand the metrics such as 95th percentile.  They want to know “What’s the link utilization?”  Well, when a packet is being transmitted, it’s 100% utilization.  When there’s no packet being transmitted, it is 0% utilization. Data transmission is bursty and the key is picking an appropriate averaging period and reporting the average utilization for that period.

If I pick a 5 minute period, I get a very different utilization figure than if I pick a 10 or 15 minute period.  Consider a 5-minute, 97% utilization transfer, followed by 10 minutes of 0% utilization.  The 15-minute average may be the same, but the 95th percentile utilization will increase with the 5-minute samples.  Averaging these samples over a day nearly always produces useless data (I wouldn’t even call it “information”), due to periods of low utilization that skew the results.  A lot of NMS products roll up the periodic samples to hourly, and eventually daily figures, essentially making useful data useless.

We should get smart enough to demand that vendors provide 95th percentile figures for each day instead of averages. Keep the 95th percentile figures to allow us to do better forecasting and trending.  If this happens, we’ll have to educate the managers about what the new figures represent.  I like to explain the 95th percentile figure as a rough approximation of the minimum “busy hour” utilization of a day (it winds up being about 72 minutes if you examine the math behind the determination of the 95th percentile metric).  Stated another way, if the link had a 95th percentile utilization figure of 65% for a given day, then the link was at least 65% busy for about 72 minutes of the day.  The accuracy of the calculation varies slightly as the polling period changes.

Try to get the 95th percentile figure in a dashboard or daily report.  You’ll find that it is not very accessible in most network management platforms.  The average and peak figures are always available, and since they get rolled up to longer intervals after a few days, their utility is greatly diminished after the rollup occurs.  I recently spent some time trying to get a rolling 95th percentile chart and finally gave up.  I wanted to chart the 95th percentile figure every hour for the trailing 24 hours.  How does the figure go down during non-busy parts of the day and increase during busier parts of the day?  It would be interesting to try different trailing time periods too, perhaps displaying it over the trailing 12 hours or 8 hours.

So we’re focused on performance, but haven’t really spent the time to understand which performance figures to trust and which figures to distrust and why.

But performance is only part of the story.  Most network outages are due to poor network design. For example, I’ve done a number of network assessments where a layer 2 spanning tree was extended between two data centers.  As I’ve written before (Spanning Layer 2 Between Data Centers), this isn’t a good idea.  A spanning-tree loop will take out a major piece of an organization’s network.  This is much more serious than poor performance on a link.  Because spanning-tree loop outages are more rare, we tend to ignore the bad design and focus instead of the link performance.  The result is that network management tools don’t tend to allow us to look at the data that tells us about design problems.

A simpler example, and one that some products are now addressing, is duplex mismatch. A product I’ve been using simply looks at overall errors.  However, a duplex mismatch produces some very specific errors, depending on the interface setting.  If the interface is running half-duplex, it will occasionally have a collision.  Some vendors treat this as an error.  Well, it is an error of sorts, but it is expected on a half-duplex interface.  What isn’t expected are late collisions, which are often seen on a duplex mismatched interface with the local interface running half-duplex (note: full-duplex links never experience collisions and should never increment this counter).  Any late collisions on a half-duplex interface are most frequently an indicator of a duplex mismatch.  (The other is a cable that’s too long or a spanning tree that has too many hops, but these are much less frequent.)

A full-duplex interface will experience FCS and CRC errors as the remote end stops transmitting a frame when a collision is detected.  Poor cabling, connectors, or dirty optics can also cause these problems, only much less frequently than a duplex mismatch.

Switching to a more complex routing design problem, the NMS should be able to show me the devices that have a default route statically configured.  I maintain that having too many sources of a default route is a recipe for a disaster, such as the one that I described last week (Network Configuration Management – Know when it is wrong) in which a default route was accidentally injected into the core of a network, creating a black hole route.

Similarly, in an OSPF network, show me the areas and the ABRs that interconnect them.  Do I have the desired level of redundancy in the network?  If I’m running an MPLS network, where do the MPLS VPNs extend?  Are they carrying only the routes that I want them to have?  These are tough problems to solve in large networks.  These types of problems require some pretty deep thinking to design a useful solution, but the payoff is in being able to significantly improve our network designs and improve network uptime.

NMS vendors need to start using the data that they collect to identify network design deficiencies.  They don’t have an incentive to take this step unless we network engineers begin asking for such features and until we begin voting with our purchasing dollars for products that provide visibility into network design problems.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.