I recently read a blog by Joel Spolsky titled Making Wrong Code Look Wrong, in which he described ways that software developers can help make it easier to spot errors and to reduce the potential for bugs. Since I do some development, it was an interesting read. One of the major points was that when you have to modify code, you shouldn’t have to examine the code in a lot of other places to make sure that you’re not creating a bug with your modification.
But then an interesting thing happened that tied Joel’s article into network design. I learned of a network that had a major outage that lasted 30 minutes. Most of the network was down. In the post-mortem analysis, it really wasn’t down. It had a default route injected that created a routing black-hole. Let’s see, 30 minutes makes the network availability 99.9942% for the year.
A simple configuration change was the culprit. There were no routes in the configuration change, so what happened? How could the configuration change cause such a massive network failure?
The configuration change extended an MPLS VPN into a new part of the network. The proposed change looked benign. It created a new VLAN and tied the SVI for that VLAN to the MPLS VPN via BGP, extending it to the new location. The problem was that the MPLS VPN contained a default route that wasn’t apparent by inspection of the proposed change. When the VPN was extended, the newly injected default route was preferred over the default route that was in the core of the network. Instant black hole.
Having just read Joel’s article, I noted that the proposed configuration change would have required that the network engineer and any reviewer to carefully examine the routes carried in the extended VPN to make sure that there would be no problems. Similarly, the routes in the core network would have to be examined to make sure that they wouldn’t cause a problem in the VPN that was being extended. Both of these actions violate Joe’s premise that you shouldn’t have to look very far from the source of the change to determine if the change is safe to make. Having to do a lot of work to validate a change will guarantee that it won’t be done very often.
Back to the network configuration.
I’ve always been a proponent of very limited use of static default routes, and static routes in general. Default routes should be originated at the Internet borders. The only exception might be where your network is large enough to be segmented into several major routing domains. Originating default into each domain from the junction with other domains would be appropriate. The key is that there only a few routers should originate default routes. And those defaults should be tied to outgoing interfaces, so that if the interface dies, the static default route is withdrawn.
A well-designed routing system will propagate the default routes to the rest of the network. It makes troubleshooting much simpler. If Internet connectivity is lost, you don’t have to wonder where the traffic will flow. It will die as soon as it reaches a router that doesn’t have the default. Go track down where the dynamic routing is failing and you’ve fixed the problem. It’s nicely deterministic.
But how do you determine which routers are originating default routes?
NetMRI retrieves the routing tables of routers (as long as the tables are less than 3000 routes — if you have more routes than that, you should consider breaking the network into multiple private autonomous systems). The Network Explorer/Summaries tab (see image below) lists the routes in the network and the routers that are originating each route. It excludes the routers that are simply forwarding the routes. Because NetMRI is obtaining the routes directly from the routers, it is able to report on default routes within summarized parts of the network.
The example network shown below has hundreds of routers originating the default route. This is because each edge router is configured with a static default that points to its upstream neighbor. Instead of a static configuration, the default route should be learned from the upstream router. Propagating the default via the routing protocol also makes the device configurations more consistent in that they don’t need a device-specific default route in the configuration. A template for dynamic routing would be the same across hundreds of devices and simplify the configuration management, allowing a template to be used to verify proper routing configuration on most of the routers.
Of course, when you use a default route, you’ll need to configure classless routing so that the default will get used. And make sure that you have a summary route to Null0 for the address space you use in your network so that when an internal destination isn’t reachable, the packet gets dropped instead of potentially looping within the network or being forwarded to your ISP, who will (or should) drop it.
-Terry
_____________________________________________________________________________________________
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html