Clarifying BFD and BFD Echo

Author
Peter Welcher
Architect, Operations Technical Advisor

What is the difference between “plain” BFD and BFD echo, also what is the “BFD slow timer” for? That’s BFD as in Bi-Directional Forwarding Detection — what did you think I meant? I have been looking at the Cisco documentation and Googling occasionally for a while now, to try to figure this out, and over that time I found the Cisco documentation I had encountered to be not very helpful on this topic. It tells me how to configure BFD echo and the slow timer but not why. Wikipedia was winning my “most lucid documentation” contest for Cisco BFD. However, I have now found that the Cisco documentation has apparently been updated (and/or I noticed the updates and found better Cisco documents on BFD). In particular, the tech writer / engineer who improved the prose on this in the Nexus documentation deserves a bonus (or at least mention in print). 

There is also RFC 5880, Bidirectional Forwarding Detection (BFD), which has been available since June 2010, and like most RFCs, it is fairly readable. I guess I should have RTFRFC (Read The Fine RFC), instead of looking for the Cliff’s Notes (so to speak). 

Let’s back up and look at why we might care about BFD, then look at how best to do so. 

Why BFD?

BFD originated with Juniper Networks. It provides a fast way for routing neighbors to detect that their peer is down. BFD can use millisecond timers for communicating with routing neighbors. It is advisable to use interface dampening with that, to minimize the impact of a flapping interface on routing and CPU. 

Compared with e.g. sub-second OSPF hellos, BFD is at a lower level in the protocol stack and is lighter weight for the CPU. That allows BFD to be used on more interfaces (physical or logical). 

BFD also provides a common interface down event detection mechanism that can be shared across routing protocols (including static routes and FHRPs like HSRP). BFD can be useful with EIGRP, because various Cisco documents recommend not setting the EIGRP Hello timer below 2-4 seconds — EIGRP can reportedly become unstable if you do. Whereas EIGRP should work well with e.g. 50 msec BFD timers, as that de-couples the interface down detection from the protocol hello / adjacency maintenance mechanisms. 

Where BFD is especially useful is when there is a Layer 1 or Layer 2 device between your edge router and your carrier, especially if the device does not reliably pass along link status to your router. The device might act like a media converter, copper to optical. Or it might be some form of Ethernet to SONET or other carrier-grade edge device. Or it might act like a L2 hub. 

In such cases, you might have to wait for your routing protocol hellos (EIGRP, OSPF, or BGP probably) to time out. Since that takes tens of seconds, some time will go by before your routing can reconverge and use an alternative link or carrier that is still up. BFD allows your edge router to very quickly learn about the loss of its neighbor and react. 

Note that such rapid reaction is a bit at odds with SSO/NSF graceful restart behavior, which is more about a calm measured (delayed) approach to allow a second supervisor to take over from the first. With SSO/NSF, the idea is to “ride out” the transition. If you like analogies, SSO/NSF is like a router tranquilizer, calming it down, whereas BFD is like a double-shot expresso coffee, making the router more edgy and hyper. The two are somewhat diametrically opposite in what they’re trying to do. See alsoone of my prior blogs titled Non-Stop Forwarding and Fast Re-Routing, at https://netcraftsmen.com/blogs/entry/non-stop-forwarding-and-fast-re-routing.html

By the way, if there is no intermediate device, and the media is Ethernet, you can set the carrier delay. See for example http://www.cisco.com/en/US/docs/ios-xml/ios/interface/command/ir-c1.html#GUID-7ED1B93D-93F7-425A-8628-D48EC51679EC.

Carrier delay is the delay before considering an Ethernet interface to be up or down, sort of simple dampening. It is useful with direct point-to-point Ethernet links – it can be set to a low value to speed failover. It is not useful when there is a L1 or L2 device in between the router peers.

BFD History

BFD apparently started out based on a polling (“asynchronous”) approach using control packets. One router polls the other and get a quick response back. The challenge with this is delay in waking up the BFD process to send a reply, causing variable jitter in response. If the other end is slow responding and BFD triggers a link down, that’s not good. Backing off on aggressive timers to prevent that  from being a problem somewhat defeats the intent of BFD. 

BFD echo solves that, and provides a clever way to take some delay out of the above process. The newer Cisco code defaults to using BFD Echo mode to verify bidirectional connectivity, to take advantage of this. 

Achieving Clarity

The Nexus documentation now says

The BFD echo function sends echo packets from the forwarding engine to the remote BFD neighbor. The BFD neighbor forwards the echo packet back along the same path in order to perform detection; the BFD neighbor does not participate in the actual forwarding of the echo packets.”

The RFC says something similar. The key point (as I understand it) is that the BFD echo leverages the fast / hardware forwarding path on the neighbor to get the echo packet returned to itself without waiting for an interrupt and special handling by the CPU. 

The documentation goes on with Also, the forwarding engine tests the forwarding path on the remote (neighbor) system without involving the remote system, so there is less interpacket delay variability and faster failure detection times.

Yup, fast / hardware forwarding path. And in other words, you can have tighter timers because you don’t have to wait as much for the neighbor to respond. 

Finally, BFD can use the slow timer to slow down the asycnhronous session when the echo function is enabled and reduce the number of BFD control packets that are sent between two BFD neighbors.”

That is, BFD echo can go fast without interrupting the CPU, and since that will detect an outage, you don’t need BFD control packets running as often, since the control packets aren’t being used for the rapid detection function. That in turn lightens the CPU load and allows more use of BFD. Clever!

The Cisco implementation of BFD echo negotiates the appropriate timers, making it more administrator-proof (and lower maintenance). Details can be found at http://www.cisco.com/en/US/technologies/tk648/tk365/tk480/technologies_white_paper0900aecd80244005.html

Notice that the above also explains the role of the slow timer. It is the timer driving the BFD control interaction, not the pacing of the echo packets.

By the way, both ends can send BFD echo, or you can have only one end sending the BFD echo. The latter approach is referred to as BFD asymmetry.

If you look closely at RFC 5880, it does not specify the actual encapsulation for BFD. For single-hop situations, RFC 5881 applies:

“BFD Control packets MUST be transmitted in UDP packets with destination port 3784, within an IPv4 or IPv6 packet. The source port MUST be in the range 49152 through 65535.”

Cisco BFD follows that specification, per various Cisco documents.

BFD Best Practices

I haven’t found any Cisco document on this yet, so this section will be short! Here are my thoughts about BFD best practices:

Do use BFD echo if you can. 

Do back off asynchronous polling with the slow timer command. 

Do use interface event dampening. The default timers look pretty good. The idea is it is best to defer having routing consider an interface to be up if the interface has bounced down/up/down in a rather short period of time. If you don’t do that, routing, particularly OSPF, may be doing a lot of reconverging and flooding, and you may be forwarding packets 50% of the time and black-holing them the rest of the time while the routing is churning. 

It is a good idea when attempting fast convergence to also be doing significant amounts of route summarization. The fewer routes, the faster all routing related scans and calculations can be performed.

References

See also RFCs 5881-5884 for various BFD settings. The BFD working group document links page at http://tools.ietf.org/wg/bfd/ is useful.

Two somewhat useful Cisco documents about BFD:

Denise Fishburne’s blog on BFD at  http://www.networkworld.com/community/blog/bidirectional-forwarding-detection-bfd-–-little-about-timers-0

Cisco interface dampening:

16 responses to “Clarifying BFD and BFD Echo

  1. I was just revisiting the docs. A number of the google hits were unclear as to what "bfd slow-timers" does.

    The bfd interval command’s first argument is the frequency of the probe, e.g. bfd echo. It is in milliseconds.

    If you are doing BFD echo, you can use the "bfd slow-timers" command to slow down how often the demand control connection packets get sent, e.g. to 15000 msec. If you are not doing BFD echo (e.g. older platform), then your control connection is the heartbeat and needs to run much more often. If you do that, you are limited as to how much BFD the device can handle.

  2. Interesting question. I see IP SLA as more sophisticated in a number of ways, but also more work for the router. BFD is much lighter weight so can run as frequently as every 50 msec. BFD also just logically brings the interface down, with routing then reacting appropriately — so the interaction with routing is a bit simpler, no special configuration required there either. I see EOT (IP SLA-based Enhanced Object Tracking) as more for static route withdrawal and FHRP reactions, i.e. anything that uses the track command.

  3. what is bfd offload which only works when " no bfd echo" is configured .Please explain.

  4. BFD offload uses the FPGA rather than the CPU to do control-plane BFD. My guess is that since BFD echo is handled efficiently, they didn’t need to offload the work from the CPU.

  5. Let me make more clarification. BFD Echo sender sends a packet from his data plane with ITS OWN address to the remote side. So remote side doesn’t make a BFD reply, it simply forward this packet back to sender.

  6. AndrewX: I agree! And that’s lighter weight because the forwarding plane can handle it, no CPU / processor interrupt involved.

  7. Hi,

    FOr cisco routers BFD packet are tagged as CS6. Any idea if this is applicable for control as well as echo packets.

    Regards # mahesh

  8. Different purposes. Set carrier delay low or to 0 for instant response to link down. BFD deals with the situation where you have a failure but some intermediate device keeps Ethernet link status up.

  9. No idea, seems likely. Many router protocols use CS6. Why not do a WireShark capture and see what you can see? And then post a comment with what you’ve found? Thanks!

  10. So BFD and BFD echo are nice when a L1/L2 device is in between two L3 devices but for a point-to-point Ehternet link between routers (or L3 switch interfaces), BFD is useless and carrier-delay should be used (set to 0). Is this correct?

  11. Great article! It’s always good to have an alternative to the dry documentation where a feature or function is described in a practical sense, rather than a strictly technical one.
    I do have one comment regarding your explanation of BFD async vs echo mode. In async mode, there is no poll-and-response. Async mode operates via unidirectional control packets whereby the sending end expects no response from the receiving end. A BFD async control packet is sent from source to receiver merely for the source to let the receiver know it is alive and well for BFD. On the receiving end, once a BFD session is established and timers are agreed upon, the receiver merely expects to receive x amount of async control packets within y amount of time, or the session goes down. It is this independence between transmittal and receipt that makes this an “asynchronous” process.

  12. Great post, here is the RFC and my understanding.

    “In BFD Asynchronous mode, the systems periodically send BFD Control packets to one another, and if a number of those packets in a row are not received by the other system, the session is declared to be down.”

    “When Echo funciton is active, a stream of BFD Echo packets is transmitted in such a way as to have the other system loop them back through its forwarding path. If a number of packets of the echoed data stream are not received, the session is declared to be down.”

    You can use Echo function with Asynchronous mode. When Echo function is used, the echo packets (not the control packets) are used for fail detection. As a result, the Control packets can be sent in a much slower interval.

    The quoted parts are from RFC5880.

    Thanks Peter for the great posts, it leaded myself to the RFC and cleared my questions on Cisco’s statement of “bfd slow-timer”.

Leave a Reply