Bruce Enders of Chesapeake NetCraftsmen recounts a story that demonstrates why configuration repositories and monitoring configuration changes are valuable. A NetCraftsmen customer called on a Saturday morning to ask about a network problem. Bruce gave them some free advice and the customer tried to figure out their network problem. Sunday afternoon, they called back, asking for onsite assistance. The problem was that a dual SAN implementation was no longer reachable from its main server. The two SAN units (I’ll refer to them as SAN-A and SAN-B) were at different locations (perhaps to facilitate off-site data storage and disaster recovery). Upon arriving at the customer site, Bruce performed basic troubleshooting with ‘ping’ on SAN-A. With no response, he checked the configuration of the attached switch and found that the interface descriptions didn’t match the cabling. The cabling had been incorrectly reconnected during some of the troubleshooting during Saturday.
Lesson #1: Part of configuration is connectivity information. Knowing which devices are connected via which interfaces is as important as the individual device configurations.
With the cabling corrected, Bruce found out that SAN-A’s configuration had been reset to factory defaults by the customer, who was hoping that would lead to something that they could use to try to make things work. The customer had no record of the configuration and didn’t know how to reconfigure it. A configuration repository would have at least provided the text of the configuration that could have been reloaded, even if it was one line at a time.
Lesson #2: Keep configurations of all important or critical equipment and make sure you grab a copy before any reset operation or modification of the currently loaded configuration.
Switching to SAN-B, Bruce found that it couldn’t communicate with the server either. In reviewing the configuration of the SAN-B’s switch, he discovered a VLAN ACL that contained one entry: ‘deny any any’! Of course, the customer had no idea how or when that entry was created and applied to the switch ports for SAN-B. There were two things needed here. First, when was the entry made? Was it the reason why SAN connectivity stopped? Had it been there a long time and SAN redundancy didn’t exist? Second, who created this configuration mistake (so they can learn to not repeat it)? Maybe the ACL was supposed to be applied to a different set of interfaces and another problem now also exists. I don’t know all the details, but I know that having an automated configuration repository would have clarified the origin of the incorrect ACL.
Lesson #3: Some lessons are important enough to repeat, so I’ll say it again: keep important equipment configurations in a repository. An automated repository that grabs configurations after a change (or periodically if there’s no alerting on a config change) is even better.
Lesson #4: Keep configuration change records. When a configuration changes, record who made the change and when. Because over 60% of network problems are due to errors in configurations, a record of configuration changes will likely allow you to quickly back out an incorrect configuration change and minimize network downtime.
I know what you’re thinking: “That wouldn’t happen to me.” But when you’re in the midst of a critical network outage and people are yelling at you, I’ll bet that you make some silly mistakes. If you have to manually save the configs, you’re less likely to do it during a crisis (call it the “I know what I’m doing” syndrome). An automated method of keeping track of configuration changes can save the day.
-Terry
_____________________________________________________________________________________________
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html