Have you ever used the OSI model to aid your troubleshooting? I’ve been able to use it to help me isolate the causes of problems and focus my troubleshooting to solve problems quickly.
Many years ago I encountered a problem that has become a good interview question. I was at a prospect’s site and they were having network problems with connectivity at site A (see the network diagram below). A CSU/DSU had died at site C on the 1.1.1.1 interface a few days prior and after replacing the CSU/DSU, they had not been able to get the link working. We were at site A (1.1.1.2).
I had the engineer working on it do a ‘show interfaces‘, which showed the interface as up/up. Since the link was showing an operational up status, I knew that it was passing HDLC keepalives. So both the physical and datalink layers were working. The problem had to be at layer 3. But pings failed to the next hop router. The next step was to determine why.
I had the customer enable ‘debug ip packet‘ on the interface. Sure enough a packet soon arrived and the debug output showed that the source address was 2.2.2.1. No wonder pings didn’t work. The other end of the link was in a different subnet! How did the packet originate from 2.2.21? Well, in the haste of replacing the CSU/DSU, the technician had unplugged the links for both site A and site B, probably because they were not well labeled. The CSU/DSU was replaced, but then the technician had to reconnect both CSU/DSUs to the right interface connectors. Without labeling, he had a 50% chance of getting it wrong. Sure enough, he connected it backwards, so neigher site A nor site B were able to pass traffic.
Using OSI layering allowed me to quickly identify that the problem was at Layer 3 and focus my troubleshooting at that layer.
This brings me to a more recent example, which has yet to be finally resolved. I was reviewing the issues that NetMRI identified at a customer site and found a router that was reporting over 50,000 TTL exceeded messages in a day. A few hundred TTL Exceeded messages might be the result of traceroute tests, particularly if there are automated traceroues being done. But tens of thousands is certainly a sign that something has created a routing loop. Strangely, no users had complained.
I enabled ‘debug ip icmp‘ to see what IP addresses were involved. Debugging icmp is typically not a big deal because these messages tend to be low volume. I took the step of minimizing logging load (no logging console and logging buffered) to reduce the debugging load on the production router.
Note to MIB developers: if you add a counter for packet errors like TTL Exceeded or Destination Unreachable or port unreachable, please also create a set of variables to keep the values of important information like addresses, protocols, and port numbers so that the management station can report the systems that have been causing the problem. Just keeping the last value would be very valuable. Even better would be a small round-robin cache of the values from the last 4 or 8 packets.
One of the network engineers looked at the problem and found that the root bridge of the spanning tree for that subnet was not properly specified. Once the root bridge was corrected, the volume of TTL Exceeded messages went down. Hmmm. Fix a problem at Layer 2 and the Layer 3 problem disappears. That means that a future change in Layer 2 may cause the Layer 3 problem to resurface. So the real Layer 3 problem has not been identified and corrected. I eventually want to go back to this problem and determine the real cause and the correct fix. I’m sure it will be interesting and I’ll learn something from it.
Good luck with your next troubleshooting task. Use the OSI model to help solve it faster.
-Terry
_____________________________________________________________________________________________
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html