I did a Packet Pushers Podcast recently with Greg Ferro and Josh O’Brien, talking about the common network problems that we see in operational networks. The list of topics comes from a poster, The Top 25 Network Problems and Their Business Impact that I developed while at Netcordia (now Infoblox).
- Duplex mismatch. (See my prior article titled Auto-negotiate Duplex or not?) I’m seeing this over and over again at customer sites. Someone even left a comment that he has seen the problem with Tandberg video conferencing gear with Cisco. So figure out the specific cases that don’t work and document them. Identify the switch ports with a good description that says why a port is hard coded. Then use auto where possible. These days, the number of hard coded ports should be the exception, not the driving factor in the configuration of all ports. At sites where they have decided to hard code all ports, there are constant problems with duplex mismatches on edge ports. One site had about 15 ports that were logging more than 1,000,000 errors per day (yes, that’s 1M errors per day), and about 200 ports logging over 100,000 errors per day. The systems connected on those ports aren’t getting very good service from the network.
- Large VLANs (Spanning tree protocol (STP) domains). (See my prior article titled Spanning Tree Protocol and Failure Domains.) Building out a large Layer 2 network is pretty simple. Troubleshooting it when a loop forms isn’t. When you have an STP loop, and the CPU in your switches is running at 100%, it is impossible to login to those devices to disable ports. So you have to go to the switches and start disabling ports or disconnecting cables. You need a good network diagram in order to know which connections have the greatest likely hood of breaking the loop. Once you break the loop, you need to find the connection that created it, remove that connection, then reconnect the parts of the network that you had to disconnect. It doesn’t make for a very resilient network. Dividing the network into smaller STP domains, each with its own IP subnet, reduces the extent of impact of spanning tree loops. Because there are fewer switches involved, troubleshooting is easier. Ideally, your STP domains are so small that loops are either rare or impossible (because there are no loops to protect against).
- Root bridge undefined or not stable. Within each STP domain, a root bridge is elected. If the bridge priority is left to the default value on all switches, then the lowest MAC address is used to select the root bridge. That’s not a good selection criteria because that tends to be the oldest switch in the STP domain (assuming that you have a single vendor with a common MAC address range). What you should do is identify the key devices in your STP domain and force a root bridge and backup root bridge by setting the bridge priority on each. Using values of 8192 for the root and 16384 for the backup root is common.
- Routers (and L3 switches) that have the default route defined. Here, you need to watch out for a system that has a default route defined and it suddenly becomes the source of routing information in the network. If that router isn’t the right default to forward traffic into the Internet (or to the rest of your network), you’ll wind up with a routing black hole for packets destined to the Internet.
- Lack of route summarization. If the Layer 3 topology is big, then route summarization can make the entire network more stable. What it does is hide changes within the topology of a summarized region, to any routers outside the summarized region. It has the additional benefit of reducing the size of the routing tables, making it easier for network management systems to collect the routes, and makes it easier for troubleshooting because there are fewer routes for the network administrators to check.
- Failure of first-hop redundancy protocols. These are protocols like HSRP (Cisco’s Hot Standby Routing Protocol), in which two routers are configured to act together to provide forwarding services to one subnet. If the primary router fails, the backup router takes over the assigned IP address and MAC address. The backup then continues to forward packets. The workstations and servers on the subnet do not know that a network device failed. But what happens if there is only a single router? When it fails, an outage occurs. Knowing that your HSRP configuration is incorrect, or that one router (or its HSRP interface) has failed, allows you to correct the problem before the failure of the second router (or interface), thus avoiding an outage.
- Poor configuration change processes. Backup the network device configurations. Check them for correctness. Do the configurations match your corporate (or network) policies? Do you know for certain that all the routers and switches have their login security properly configured? Have you verified the access lists that restrict SNMP access to your network equipment? These are all easily done with good network management tools.
The above are just a few of the problems that we see on a regular basis in real networks. NetMRI identifies these problems and more. I use NetMRI to automate the collection of network configurations and to check them for correctness against an organization’s configuration policies. It is a real time-saver.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html