I was at VoiceCon two weeks ago, participating in a panel where I talked about network resiliency and presented my VoIP Troubleshooting and Monitoring tutorial. Both presentations included examples of how you should be prepared for network failures. I’m a proponent of understanding the causes of network problems and being able to quickly diagnose failures by looking at the problems that they cause. Let’s say that you want to be prepared to identify and react to a spanning tree loop. First, you need to be able to quickly identify that a forwarding loop has formed. Your NMS should show a CPU spike on switches in the STP domain in which the loop exists, due to processing BPDUs that are circulating. Ports that are forwarding looping traffic will report high utilization. A list of typical symptoms exist in the Cisco document “Troubleshooting STP on Catalyst Switches Running Cisco IOS System Software”, Document ID: 28943. Unidirectional links and similar problems are described in “Spanning Tree Protocol Problems and Related Design Considerations”, Document ID: 10556.
Links must be shutdown or disconnected in order to break the loop. This is where planning will pay off. Examine the image below, taken from the Cisco “Troubleshooting STP” document referenced above. A loop between the ADB switches, the ACB switches, or the AEB switches, is easily broken by disconnecting any link in the loop. I would plan to take out the AB link because that would break any of the three loops that I identified. If that doesn’t take care of the loop, then the problem is likely due to a loop induced between VLANs or between two ports in one VLAN. It could be due to a cabling mistake or a dual-homed server with bridging enabled between two interfaces. In this case, you have to be prepared to isolate each switch until you find the combination that contains the loop (it may involve more than one switch).
Now imagine an STP domain that spans ten or more switches and you have the potential for a time-consuming troubleshooting task if you’re not well prepared. This is one fo the reasons why we at NetCraftsmen recommend that failure domains be limited in size.
If the STP loop you’re troubleshooting is serious enough, you’ll not be able to use the network to access the switches. Someone will need to physically unplug the network connections. Having them clearly labeled, with respect to the cable colors, labels and interface descriptions, will make your troubleshooting go faster. And be prepared to properly reconnect the links if you’ve had to physically disconnect the cables. It doesn’t help if you quickly unplug three infrastructure links and then puzzle over which cables connect to which ports on the switch.
Now think about other common problems and how you’ll tackle the troubleshooting tasks to quickly identify the source of the problem. If your network uses a large number of static routes, be prepared to handle a routing loop where the interaction between a static route and the dynamic routing protocol creates a loop.
In a network supporting VoIP, you should understand the process used by phones to power-up, register, and operate. You can use the OSI model to segregate problems into physical layer, data link layer, network layer, and application layer. Knowing the types of problems at each layer allows you to quickly identify a few troubleshooting tasks to perform to identify the source of a problem. An example is one-way audio; think about how you would diagnose its cause and how you might fix it.
How can you be prepared? You need to know where and how you’ll tackle specific problems. What diagnostic tools do you need and are they in the appropriate locations? Do you know the actions that you need to take to isolate problems or the diagnosis that you need to perform to gather enough information to characterize and identify the source of a problem?
-Terry
_____________________________________________________________________________________________
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html