When using NetMRI in consulting engagements, we are often asked which of the NetMRI issues are the most important to track. That’s a relatively easy question to answer and really doesn’t depend on whether a NetMRI is in use or not. Regardless of the tools, we want to track the same things. In the lists below, many of the issues are obvious, so I’ll skip an explanation. I will elaborate on the items that I think may not be obvious.
Most networks today support business functions that are critical to the ongoing operation of the business. The first issues to track are environmental, because they are the ones that fail more frequently.
- Power supply failure, including loss of input power on redundant supplies.
- Fan failure, causing a device to overheat.
- High temperature, possibly due to a fan failure, or perhaps due to an HVAC system failure.
- Power supply voltage out of range, which might not cause an overall power supply failure, or due to high temperatures.
We rarely see networks that don’t have some level of redundancy, so it is important to look for failures within the redundant systems. Router redundancy failures and key interface failures are at the top of the list here. (I mentioned these in last week’s blog post New Year Resolution: Run a Clean Network and include them here for completeness.)
- HSRP, GLBP, VRRP where there is only one router in the redundancy group. An interface could have failed, or the redundant device has failed. Or the redundant device may have not been installed or properly configured. In any case, your intended redundancy doesn’t exist.
- Router interface down – all router interfaces should be up/up or admin down or a failure has occurred, implying that any interface that isn’t used should be shutdown.
- Switch trunk ports down – similar to router interfaces – trunk ports are often infrastructure interconnections and should be admin down if they are not in use.
- Config not saved, while not specifically a redundancy issue, will be a problem if the device dies or is rebooted. The result could be an outage until the config is rebuilt. Saving the current running configuration and creating a notification that it was not saved in NVRAM provides the necessary notification that the device won’t come back up to the current operating state upon a reboot.
The Router Interface and Switch Trunk Port down issues mentioned above are particularly important because they are much easier to overlook. Most organizations don’t take the time to shutdown an unused interface or to remove an old description, making it difficult to tell whether a down interface is due to a link failure. It is easy to miss a key interface failure. An outage occurs later (often much later) when the redundant interface goes down. The best way to manage the network is to shutdown each unused router interface and switch trunk port if it is not used. Then any interfaces or trunk ports that are found in up/down state are due to a failure and should be corrected.
Then we start looking at performance related issues. Performance is typically where most people start looking at networks, because the tools have existed for a long time to look at network performance. What’s often not obvious is how to identify high utilization during business hours.
- High 95th percentile utilization, when calculated over a daily period, identifies interfaces that are running at the reported utilization or greater for 72 minutes of the day. If most of the high utilization time is during prime business hours, it may be affecting business productivity.
- High errors or discards identifies interfaces that are having problems, either due to a poorly operating link (errors) or due to network congestion (discards). The impact on the business is lower productivity.
- Duplex mismatch, which typically results in a specific set of network errors. Any interface that begins running at high utilization will experience more errors as a result of duplex mismatches. Utilization over about 10% of link capacity will start to see errors, with the errors increasing as link utilization approaches 30%. As with network errors, it results in lower business productivity.
Once the above issues are being addressed, configuration consistency is next on the list. To check configuration consistency, the network management system will need tools to allow you to identify configurations that don’t match your configuration templates. This is more than checking that a config has certain statements. It needs to be able to handle statements that must appear in a certain order (think ACLs here). It must also be able to identify configurations that contain some statements but not other statements (e.g., make sure the ACL hasn’t been extended or make sure that an undesirable routing protocol is not configured).
With the above checks and alerts, you are well on your way to handling the majority of common network problems, making your network much more stable. In a redundant network, you’ll have the ability to correct most network problems before they cause an outage, and that’s what’s important in a smoothly operating network.
-Terry
_____________________________________________________________________________________________
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html