I’ve been reading a lot of documents, looking for clarity about Non-Stop Forwarding / High Availability / Resiliency and how it interacts with Fast Re-Routing. To some extent, the more I read, the more puzzled I got. The explanations of each topic are pretty good. How they interact, good but apparently a little incomplete.
Let me explain the problem. Non-Stop Forwarding (NSF) and Stateful Switchover (SSO) are designed so a dual-Supervisor 6500 (or other device) can minimize packet loss during a Supervisor failure. They do this by keeping L2 and L3 state synchronized between the Supervisors. When a Supervisor fails, several things happen:
- SSO maintains L2 forwarding, NSF L3
- The spare Supervisor keeps links up (except of course for those on a failed Supervisor module blade)
- The switch’s neighbors perform Graceful Restart to restore adjacency to the new Supervisor (same address, different brain) and synch up, rather than bouncing the routing adjacency and restarting things
- If the neighbors don’t hear from the spare Supervisor within a time window (“I’m here, I’m alive”), the adjacencies bounce — protection against a completely failed / de-powered chassis
That’s all well and good. And we’ve seen or heard about Cisco doing demos where a ping was running through a switch, failover triggered from the CLI, and nary a ping lost.
NSF clearly wins if there is no other routing alternative.
I started thinking about this in the context of wiring closet switches with dual uplinks to the building distribution layer. We’ve been using dual Supervisors in them for IP Telephony readiness or IPT-using sites. This is a Best Practice, recommended so as to not drop calls. One could counter-argue that losing a Supervisor is fairly rare, so is taking a few second hit on connectivity by not using NSF all that bad a problem? Probably not, but if you’ve paid for al those second Supervisors, why not get the most out of them?
NSF and Fast Hello Timers Are Somewhat at Odds
NSF is solving a somewhat different problem than Fast Re-Route. NSF is about not letting go of a routing adjacency, not experience two adjacency flaps (down, then up), preserving routing stability, ignoring a Supervisor failure, “steady as you go”, that sort of thing. Perhaps “riding out the outage” is the best way to put it. NSF in effect denies the peer is gone for a short period, to allow the standby Supervisor time to speak up and re-establish the adjacency.
Fast Re-Routing is about quick reaction based on quickly detecting link failure, then rapid routing re-convergence due to summarization, stubby areas, OSPF incremental SPF (iSPF) (mostly for bigger networks), etc. This leads very naturally to thinking about sub-second Hello timers, at least for OSPF or IS-IS. (EIGRP is best left at 2 second Hellos, 6 seconds Hold-time, per one such document, several others including a Cisco Validated Design document say 1 and 3 respectively.) There are various Cisco documents discussing sub-second timers, the benefits, not to push too far or you’ll create instability, the impact of SPF throttle timers, etc. Reacting quickly is more or less the opposite of denying anything is down.
There is plenty of documentation available online, also Networkers presentations, discussing the interaction between the two. The basic conclusion appears to be that it can take about 2 seconds for a spare Supervisor to get out its first Hellos and start the Graceful Restart process, so you don’t want to set the dead-timer too low and defeat the whole process. To allow some safety margin, a dead time of at least 3 and preferably 4 seconds appears advisable, per the Cisco testing documentation.
In the context of dual-uplink wiring closets, my question is then: which is better, doing NSF to ride out a Supervisor failure smoothly, but possibly reacting slowly to link failure, or giving up on NSF and reacting quickly. None of the Cisco documents appeared to discuss this — or maybe they did and it just wasn’t emphasized enough to reach my brain.
I’m not going to get into BGP fast convergence here. Separate topic, last time I researched it the quick answer was “lots of cool features in the newer code”. Yet another related topic: IETF and Cisco work on Loop-Free Alternative routing (“feasible successor on steroids”) and Not-Via calculations. (Thanks to Russ White for putting me onto those — see the Fast Re-routing talk from 2009 Networkers for more.)
Which Wins, NSF or Link Failure?
Two highly relevant points have finally surfaced. I suspect the various Cisco authors just take them for granted, hence don’t emphasize them. I will, since not knowing them seems to have skewed my thinking about NSF.
The key points:
- On point-to-point Ethernet cabling, link failure detection is instant (with carrier delay set to 0), so fast Hello timers aren’t needed or all that useful in a L3 routed link setting.
- Link failure takes precedence over NSF.
The latter point is the one I really had a hard time finding in print. See the NSF Deployment Guide from 2006 (URL provided below). It says
Operationally, a major consequence and benefit of SSO is that adjacent devices do not see a link failure when the Route Processor switches from the primary to the hot standby Route Processor. This applies to Route Processor switchovers only. If the entire chassis lost power or failed, or a line card failure occurred, the link(s) would fail, and the peer would detect such an event. Of course, this assumes point-to-point Gigabit Ethernet interfaces, packet over SONET (POS) interfaces, etc. where link failure is detectable. Even with NSF enabled, physical link failures are still detectable by a peer and override NSF awareness.
Before I re-read the above document and registered the bold faced part above, I tried some testing. Unfortunately testing NSF and fast re-route is hard — everything happens rather quickly, and you can’t get debug out of the chassis when your console connection goes dead.
Other things to watch out for:
- You really need uplinks NOT on the Supervisor, or you’re mixing both link failure and Supervisor failover for sure (shared fate, design Best Practice for NSF).
- Popping out the Supervisor blade may not be a good idea: there’s some small chance of damage, and it probably causes a fabric stall (thanks to James Ventre for pointing this out)
- You need ping running from something (a PC) through the chassis, preferably to a loopback at the Distribution Layer.
By the way, if a remote link failure occurs during the fairly short NSF failover window, NSF will ignore LSA’s or routing advertisements until it completes. The documentation indicates this can cause a short-lived routing loop. What are the odds a Supervisor fails and within a few seconds a link somewhere else fails as well?
Mixing BFD and NSF
The other related topic here is mixing Bidirectional Forwarding Detection, or BFD (which detects neighbor loss quickly and more efficiently than routing protocol Hellos) with NSF. I’ll save that for another blog. For now, let me just state that it appears that BFD is in the process of becoming “NSF housebroken”, but that depending on your code version, the two may not play that well together. See also the link below.
Similarly, if you’re doing MPLS, you have to consider how all this works with MPLS Traffic Engineering Fast Re-route, also directed LDP between tunnel endpoints or for EoMPLS, etc.
Here are the best links I’ve found on the topics discussed above.
(Note 1) This document provides the crucial info that even with NSF enabled, physical link failures are still detectable by a peer and override NSF awareness.
(Note 2) Initially, it appears that Cisco NSF and OSPF/ISIS/EIGRP timer manipulation have complimentary objectives. Each feature is dedicated to achieving the fastest possible convergence in the event of a failure on a router. However, more careful analysis reveals that these technologies also have conflicting goals.
(Note 3) In Cisco IOS Release 12.2(33)SB, BFD is not stateful switchover (SSO) aware, and it is not supported with NSF/SSO and these features should not be used together. Enabling BFD along with NSF/SSO causes the non stop forwarding capability to break during failover since BFD adjacencies are not maintained and the routing clients are forced to mark down adjacencies and reconverge.
My conclusions / summary:
- IF you use routed point-to-point links, failure detection is nearly instantaneous, particular if you set the carrier delay to 0 (with interface event dampening). The latest Campus Routed Access Design Guide suggests faster hello timers, etc. for L3 access layer, but reading between the lines, it will rarely if ever be useful, due to link loss triggering first. Since I view decreased timers as mildly risky, why go there for little or no gain?
- As noted, link failure over-rides NSF.
- Sub-second OSPF hellos, ISPF, etc. can be ok but if you make the timer too short, you create instability and you may defeat NSF. The dead timer needs to be around 4 seconds and the NSF timers need to be around 25 seconds. It takes time for the 2nd Sup to send the NSF initial hello-and-we’re-restarting-keep-the-adjacency-up. And you have to allow it enough time to be fully up before the peer gives up on it (NSF wait timer, see the timer link above). OSPF throttle tuning and other factors can also help, but are somewhat secondary to the NSF versus fast timers discussion.
- If you’re planning on detecting failover and converging quickly, you’d best have a good summarizable design with as few routes as possible, well-chosen OSPF areas, etc. Or SPF / EIGRP queries will slow things down significantly.
- You don’t want EIGRP hellos faster than 1-2 seconds, per some Cisco documents I’ve read. And Cisco IOS won’t let you set them to sub-second times.
- If you have a connection to a L2 switch or other multi-access Ethernet link (typically L2, but could be a routed SVI), the router may need hellos or some other method to detect neighbor loss if the link status remains up. This is where fast hellos or BFD come in.
- BFD is kinder to the router CPU, but doesn’t necessarily play well with NSF, even in recent code (haven’t exhaustively checked). NSF-friendly BFD is coming, where the BFD is put into an artificial state to keep the peer router from noticing the link dropped until the 2nd Sup can come up. This requires the second Sup have backup BFD state and be, in effect, a “hot spare”, I presume to get message out fast enough that BFD won’t trigger.
- For OSPF and EIGRP, NSF awareness (which is not NSF) is on by default in recent code releases. BGP, off by default. You still have to enable NSF for the routing protocol. You don’t have to enable it in the neighbors supporting graceful restart.
Thanks to all my peers at Chesapeake Netcraftsmen, we’ve had some wonderful discussions and debates about the above and other topics. The above reflects my take on how NSF works, mostly based on documentation. Any mistakes are mine and mine alone. Thanks also to James Ventre for some good discussions and insights, and challenges to me when I wasn’t thinking deeply enough. I have the suspicion he may blog about some lab testing of NSF, one of these days.