I’ve seen gradually increasing interest in Cisco’s FabricPath technology, so it seems time to talk about designing for FabricPath. I’m going to provide some opinions and overview, and then point you at some CiscoLive 2012 presentations. I see little point to re-inventing details that have been done well elsewhere, and do hope that I’m helping by pointing out resources people might not be aware of.
What is FabricPath?
FabricPath is a routed alternative to Spanning Tree Protocols (STP) and Cisco virtual port-channel (vPC) technology. Reasons for using FabricPath is that it provides routed protections against spanning tree loops, keeps all links forwarding (unlike STP), and is a bit easier to configure than vPC, especially as your datacenter becomes larger. The arguments against FabricPath: it might not yet be mature, and it is Cisco-proprietary, whereas TRILL will be the standard for FabricPath-like behavior.
Scott Lowe has a nice basic writeup about FabricPath at http://blog.scottlowe.org/2011/07/12/brkdct-2081-cisco-fabricpath-technology-and-design/
What Does Fabric Path Do?
FabricPath does MAC-in-MAC encapsulation to transport Layer 2 frames across a FabricPath network. The transport is based on routed forwarding to another FabricPath switch. As with SAN FSPF routing, the FabricPath routing is a link state protocol which tracks how to get data to one of the participating switches.
When a L2 frame comes in a “classic Ethernet” port, a LAN MAC switching lookup occurs. If the switching lookup indicates that the destination MAC is reached via FabricPath, it will also indicate which FabricPath edge switch to send the frame to. That switch ID can then be looked up in the FabricPath routing table. A path is chosen, the frame is MAC-in-MAC encapsulated, and routed over to the destination FabricPath switch. That switch de-encapsulates the frame and forwards it in normal L2 fashion.
FabricPath allows for multiple “topologies”, i.e. separate layers of FabricPath operation. It also does multi-pathing, up to 16 paths, each of which can be a 16-fold 10 Gbps port-channel. FabricPath uses a time to live (TTL) to protect against short-lived or other routing problems (bugs?) that might somehow cause a routing loop. The underlying routing is based on IS-IS, as is TRILL. (Brocade used the program code they have, so their TRILL implementation is based on FSPF.)
Why FabricPath not TRILL?
FabricPath appears to have scaling benefits compared to TRILL. One is conversational learning, i.e. an edge device learns MAC address / switch mappings only for the MAC addresses some locally-attached system behind the FabricPath edge device is actually conversing with. The edge devices do not learn all source MACs seen via ARP flooding. Per the article at http://lamejournal.com/2011/05/16/layer-2-routing-sort-of-and-trill/, it sounds like TRILL can optionally learn all MAC addresses from edge devices. This seems rather undesirable to me. The article compares Cisco OTV, which tracks reachability of MAC addresses. Fair enough, that may be a limiting factor for OTV. Which begs the question, if I’m criticizing TRILL for promiscuous MAC learning, shouldn’t I do the same for OTV? Probably.
FabricPath allows for vPC+, which enables dual-active FHRP behavior at the edge. This is useful for scaling up routing off off FabricPath VLANs.
FabricPath peers only on point-to-point links. To me, that’s a distinct plus for bandwidth tracking and preserving the routing model end-to-end. I see only risk from having switches interconnecting FabricPath routing peers.
Other than that, the web seems to have a lot of noise but little signal on the FabricPath versus TRILL topic.
Designing FabricPath
I’m amused by what I’m seeing in print. Most FabricPath designs show a spine-edge approach, as in the following diagram.
Note: the heavier links are dual-link vPC peer-link port-channels, drawn this way to reduce visual clutter.
I like this design. It is a CLOS fabric, an optimal structure for maximizing bandwidth between arbitrary (or selected) endpoints. If you want more bandwidth, you can either add links, or add spine nodes. If you start exceeding the 16-fold multi-pathing limit, you can port-channel links between the same switch pairs to add bandwidth without pushing beyond 16 paths.
What we do is turn the middle into a FabricPath routed domain. We do that by configuring the interfaces shown in red to be fabricpath links.
In case you’re wondering, the top and bottom pairs don’t need to be interconnected directly, since the purpose of most datacenter networks is to support either North-South (user to server, top to bottom) traffic, or server to server (East-West, left to right) traffic. You go from left to right across the top or bottom by taking one intermediate routed hop in the diagram.
The other configuration step is to specify which VLANs are connected across the FabricPath “red zone” above. And to configure a low root bridge priority on the FabricPath switches, making them all equal as root bridge. In effect, the switches and red links above form one giant root bridge switch, interconnecting whatever edge switches are not shown at the bottom of the diagram. The following diagram may visually suggest that better:
Concerning TRILL design, a small percentage of what I’ve seen seems to have diagrams like the above. The rest seem to be thinking based on Radia Perlman’s RBridge concept, which I would describe as “oatmeal with raisins” — a gluey blob of Layer 2 oatmeal with RBridge “raisins” scattered throughout. For various flows, different RBridges forward between VLANs. How you troubleshoot that sort of design is what puzzles me, since it seems like you have a Layer 2 and encapsulated routing mix where it might be challenging to identify which device encapsulates a given flow, also requiring lucid thinking and good understanding of Layer 2 forwarding and TRILL.
So maybe those lumpy diagrams are just conceptual and nobody really intends to do TRILL that way? Brocade does have pictures that look mighty familiar (and structured): http://www.brocade.com/company/news-events/newsletters/BA1209/0912_technology_showcase.html. Juniper doesn’t like TRILL, but shows a structured diagram as well, in http://www.juniper.net/us/en/local/pdf/whitepapers/2000408-en.pdf.
Congestion is easily managed in the above diagram, in the sense of monitoring a relatively small set of links between spine and edge, and adding bandwidth where needed. Load balancing should take care of un-evenness, unless there are small numbers of flows of vastly different magnitude.
Migrating to FabricPath
One of the drawbacks to Juniper’s QFabric is that it is apparently all-or-nothing. You can start with a small QFabric and then expand. If you buy it and don’t like it, what’s your alternative?
I see FabricPath as being incremental. You can migrate vPC edge pairs to FabricPath one pair at a time. So you might try something like a FabricPath to a pod with two Nexus 5500’s and some servers, and then gradually dial up the size of the FabricPath domain.
There was a good talk at CiscoLive 2012 on this topic. It has a lot of diagrams, includes a couple of things I hadn’t thought about (not that I’d worked through a FabricPath migration in detail), and includes cutover timing information so you can plan how long each step should take. The presentation can be found at https://ciscolive365.com/connect/search.ww#loadSearch%searchPhrase=fabricpath&searchType=session&tc=0&value(profileItem_10017)=10173. It includes topics like moving your vPC peer link from M1 ports to F1 ports to support FabricPath and vPC+. (That was session BRKDCT-2202. Also, session BRKDCT-2081 may be of interest for more fundamentals, e.g. how FabricPath works.)
In general, CiscoLive 365 (Virtual) sessions are at https://ciscolive365.com/connect/search.ww#loadSearch%searchPhrase=&searchType=session&tc=0. Registration is free this year, as far as I know. And the San Diego CiscoLive presentations do seem to already be posted!
Is there any reason not to use FabricPath for a DCI?
I can’t see any particular reason why not if the fibre infrastructure is sufficient between data centers.
I have a couple of thoughts in response:
(1) FabricPath assumes L2 connectivity between dc’s, not necessarily a great idea. DWDM and nearby, hmm, gray area. Probably better than just flat L2 trunking. Also vPC or VSS and port-channeling.
(2) OTV is less chatty on the WAN and runs over L3 services. Less chatty due to ARP proxying. Also no unknown unicast flooding / selective flooding in the future. How much bandwidth does that save? Tough to say, some but not a lot? Depends on subnet/VLAN sizes, number of hosts. OTV also offers some protection against accidental looping. And I like the idea that a loop will probably ultimately be throttled by encapsulation overhead — although I now know of one case where Nexus 10 G OTV and STP loop overwhelmed a 6500 Sup2 at the other end.
(3) OTV probably scales a bit better as you add datacenters with shared VLANs, if you use WAN IPmc to replicate frames. On the other end, FP might do it in switch hardware…
I don’t claim those are definitive, those are what I’ve come up with so far when someone asked me about this in a consulting or class setting.
One of the considerations if you are looking at FabricPath for DCI is how your multi-destination trees will be built. You will probably find that your broadcast / multicast / unknown unicast traffic is hairpinned between your sites due to the fact that the MDT root will reside on one site or the other (see http://adamraffe.com/2013/03/12/fabricpath-for-layer-2-dc-interconnect/).
It’s also a bit more difficult to do FHRP localisation with FabricPath compared to OTV.
Good point re the MDT root. Yuck, hairpinning is ugly.
FHRP localization I’m a fan of, also a good point. Why is it more difficult? At a quick glance, the same sort of VACL would seem to work. Your blog (URL above) is a nice info tidbit I hadn’t run across. Thanks!
I’ve become a bit mixed in my opinion on FHRP localization. It seems to help for internal corporate / datacenter routing. Throw in a stateful device like a firewall and maybe that’s not so good. I haven’t seen a good solution to that. Do we await SDN nirvana with virtual firewalls running tied to the app VM, to de-couple VM or bunch of VMs’ location from the stateful devices? Stateful firewall clustering across DCI sounds — risky?
I commented on the other blog above – FHRP localizaton is possible today with vPC+ on the edges as vPC+ will forward L3 to anything with the HSRP mac as destination. Ingress needs GSLB or LISP or something.
Also, it is absolutely very possible to cause an unknown unicast storm over fabricpath – very bad for DCI. Can’t do that with OTV.
This is great, I love a good debate! And am glad when I can learn from readers and share that with the blog readers!
I’ve always felt that FabricPath was not a good idea for DCI, but what I put earlier was the best arguments I had come up with or seen so far. That’s where the MDT tree has some relevance. Note that it should be a non-issue for unicast (think routing to the egress FP edge switch).
Now I have the strong urge to try this in a lab. Not that I have one handy with the topology Adam showed. I think Adam’s point was the with HSRP localization, the VMAC lives on both sides and confuses FP where it is as regards MAC learning. Maybe using different group numbers hence different vMACs (with the same VIP) could be made to work?
Good point re LISP. I used to think that solved things. It doesn’t help with the stateful device issue. So is it better for an internal WAN to DCI solution? Where there might not be stateful devices in the path?
I’ve written previously about LISP and other solutions for ingress. Since then, I’ve stopped buying as completely into the Cisco LISP diagrams, for the reasons noted.
I’m now wondering if LISP for the Internet will take off, or if it has a chicken-and-egg problem. What’s the economic incentive for an ISP to invest in a bunch of very big routers to act as LISP PITR and PETR devices? Are corporations going to do edge LISP to help their provider out? Steer their own cloud traffic? I’ve seen claims about shrinking routing tables and so on. What’s the benefit to cost comparison? I’ll admit I’m not tracking the whole LISP deployment rationale issue closely; I’ve reasoned that unless one has a long ISP background, one is highly unlikely to get consulting calls from ISPs. See also for example [url]http://www.networkcomputing.com/next-gen-network-tech-center/lisps-future-is-not-in-the-data-center/229501403?pgno=1[/url])
Normally you are right – FP would be confused by the vmac. vPC+ changes this a little – the outbound vmac still lives only on one side, but vpc+ will intercept local traffic destined for a remote hsrp vmac (it installs a hardware rule – similar if it was just regular vpc and the hsrp active was on the other peer). Anycast HSRP (coming soon!) is another solution to this if you don’t want vPC+
LISP is just one solution to the ingress solution. Really, you just need anything that can track server placement on one side or the other of the DCI and instantiate a /32 for wherever that server is, so that you will closest inbound route. LISP accomplishes this by tracking the IP on a local router, but lots of IP mobility solutions would work.
Internet LISP is different than LISP in the enterprise, IMO. You can do a lot of cool things that don’t require provider cooperation. Different blog post. 🙂
More good info! Hadn’t run across that tidbit about vPC+ intercepting. Neat trick.
Yes, my prior blog noted the F5 and (former) ACNS/ACE solution, where you’d "drain" flows to be optimal based on vMotion.
Re LISP, one of the other questions I have is how much vMotion how fast (think BIG datacenter) before it chokes. I don’t think VMware can go all that fast right now. You’d need a mess of vSphere instances.
For site to site across the Internet or internal WAN, I can believe you that LISP might be cool. As you say, that’s a different blog, to be written.
I’ve been amused by the idea I heard a while back of using LISP with GETVPN, to get it to tunnel instead of transport.
I had a chat with Craig about this earlier – I’ve made a small change to the post on my blog to make it slightly more accurate 🙂 So to summarise:
– You can do FHRP localisation on the 5K using mismatched passwords (you can’t use port or VLAN ACLs for this purpose on the 5K).
– You can’t do FHRP localisation with FabricPath if you have other FP switches in the domain which aren’t running L3 (such as at a third site). This is due to the MAC learning confusion mentioned above.
– FHRP localisation with FabricPath on the 7K isn’t currently supported.
Great, glad you two could reconcile the discrepancies. I appreciate having two technically great engineers providing this info, and thanks for sharing it with any readers! Good stuff!
Thanks for all the input from all. It’s pretty clear to me now the benefits of using OTV as opposed to FP for a DCI. 🙂