Do you run a network with a high level of redundancy? If so, how susceptible is your network to the “second failure” syndrome? This is where a first failure occurred, but it wasn’t noticed until the second failure took out the redundant link or node. Basic checks can often detect when an initial failure has occurred, but these checks often require a bit of “network hygiene” before they work.
Start by checking for router interfaces that are administratively up, but operationally down (i.e. up/down state). If the network is not maintained in a consistent manner, you may be facing hundreds of interfaces in this state. It looks like a daunting task when you see 300 or 400 interfaces in this state. You have to check each interface to determine if it should be administratively down, then go through the change management process (you *do* have a change management process, don’t you?) to configure them down. I’ll bet that you identify one or two interfaces that were in the up/down state that should be in the up/up state.
You can tackle the long list by prioritizing the interfaces into three groups, organized by importance. The critical interface list will likely be much smaller than the overall, probably by a factor of 10 or more. I recently saw a site in which NetMRI was reporting nearly 400 interfaces in up/down state (the Router Interface Down issue). I was able to identify the critical interfaces by using the Quick Search box (see example using 10.9.10 below) and entering a common device name, reducing the list to around 40 interfaces. That’s a much more manageable list, one that can be tackled in a couple of weeks.
But once the critical interfaces are handled, don’t stop there. If you take care of all the interfaces, the analysis of up/down interfaces can be the quick test for whether your redundant network is really redundant, because the most common source of redundancy failures is not noticing the first failure.
Don’t rely on identifying up/down interfaces alone. You may have heard that the most common source of network failures is configuration errors, and this source of errors hit an organization that has a redundant network. An interface was intentionally shutdown to aid in troubleshooting a problem. There’s nothing wrong with this action. But it was overlooked when the original problem was corrected, so a part of the network was running on a single connection. Some time later, the redundant connection also failed, so a network outage occurred. This configuration error would not have been identified by the NetMRI Router Interface Down issue.
A NetMRI job script and related issue was subsequently created to check all critical interfaces within a device group and report any interfaces in the down/down state. It really is a check to identify any interfaces that are not in the ‘up/up’ state, so they will appear in two issues if an interface is in the up/down state.
The key lesson was there there are some configurations that need to be identified as problems soon after they are deployed. These faulty configurations are specific to the organization’s network, so some customization is needed. In this case, the script could be easily applied to other devices and interfaces, but would need to be customized to select only those devices and interfaces to which it should be applied.
NetMRI is now protecting this customer’s network from accidental interface shutdown on key interfaces and is freeing their staff from having to periodically check the interfaces using manual procedures, which was the alternative approach.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html