Practical SDN: L2 Forwarding in NSX, DFA, and ACI

Author
Peter Welcher
Architect, Operations Technical Advisor

How does L2 forwarding work in NSX, DFA, and ACI? That’s the topic for this blog, blog #2 in a series. The series attempts to contrast the various behaviors of NSX, DFA, and ACI. We’ll look at L2 forwarding (bridging) and some scenarios where the behavior might not be what you expect. 

Prior blogs in this series:

Yeah, short list so far. I’ve got an ambitious list but am not posting it, since it may not all happen. Time, work, life, constraints like that.

Bridge

(Source: Wikipedia. Creative Commons license) (Gratuitious blog graphic based on the slight relevancy of bridges.)

Common Elements (NSX, DFA, ACI)

There’s a key element to all these SDN technologies, really a requirement. Bare metal servers and VMs are unaware of any changes, they think they’re doing normal L2 forwarding via a switch. Little do they know what’s really going on under the hood!

Local forwarding (same Leaf switch or same hypervisor host) is done locally. And should be. No surprises there. Two VM’s on the same hypervisor host attached to the same VLAN / portgroup / VXLAN are locally switched by the hypervisor code. Two physical servers on the same Leaf (edge) switch are locally switched. Although later we will slightly qualify the “no surprises” statement for DFA and ACI. 

This reflects the current trend, pushing L2 and L3 forwarding to be on the Top of Rack (ToR) or Leaf switch, then further down to the hypervisor or chassis where possible. The intense interest reflects the demand for ever-lower latency, plus it distributes the workload, providing better total performance. “Scale out versus scale up.” I’ve been wondering about that (well, the L3 version of it) since Cisco announced FabricPath.  “Do I route on the N7K or distribute it across N5K’s in pods?” Then along came Anycast HSRP for FabricPath. Well, the logical next step is to spread the workload across the Leaf switches rather than the Spine switches. 

For each of the SDN / Virtualization Management technologies {NSX, DFA, ACI}, there are several L2 forwarding cases to consider:

  • L2 forwarding between VMs on the same hypervisor host
  • L2 forwarding between servers on the same Leaf switch
  • L2 forwarding between VMs on different hypervisor hosts
  • L2 forwarding between servers on different Leaf switches
  • L2 forwarding between a server and a VM on a host on the same Leaf switch (or vice versa)
  • L2 forwarding between a server and a VM on a host on a different Leaf switch (2 cases: varied VXLAN gateway locations)

The first of the above cases is simple. The hypervisor handles it via local switching. There is no external traffic, no tunneling. Done. 

The second case above is similar, just the physical world equivalent. 

For all of NSX, DFA, and ACI, the remaining cases rely on being clever about getting L2 frames or L3 packets to the right hypervisor or Leaf switch, which then does the usual L2 forwarding. 

The System Knows All

Let’s review what “The System” knows for each of our SDN 2.0 schemes. (Thanks Tom Hollingsworth, @networkingnerd, for the right term at the right time!)

In the prior blog in this series, I mentioned that NSX tracks MAC and IP addresses, which hypervisor they live on, and what physical and virtual LAN segments they connect to. It distributes this information via OVSDB protocol to the hypervisors. I also mentioned that a L2 “default route” is used to send unknown MAC and BUM (Broadcast, Unknown, Multicast) traffic to the active L2 gateway. Ok, so NSX punts on MAC addresses outside its domain of control. 

I also mentioned that DFA uses MP-BGP and VRF to track /32 routes. When a Leaf switch learns new ARP information, it apparently adds a /32 route to the MP-BGP. It needs to know the VRF the port belongs to do that. The MP-BGP knows which edge switch (= BGP next hop) every known IP address is connected to. Cool re-use of technology! How fast does it converge with 1,000,000 routes in it? Remains to be seen. (I’ve got a PERL script to create static routes if you want to try that … using OnePK might be a lot faster than CLI interpretation of that many commands for testing…)

ACI apparently learns similar information. Some sort of central distributed database (living on the Spine switches?) tracks that information plus policy information.See also the Eric Flores, @nerdofnetworks, blog “Cisco ACI Speculation” at http://packetpushers.net/cisco-aci-speculation/ for … more speculation. He says “LISP” is used for lookups of where to send traffic. I think LISP is more information than is needed. It could fit the term “distributed database”. While I’m curious, and also speculated, I keep reminding myself: all we care about is (a) does it work, and (b) does it converge / handle changes quickly. I like Eric’s picking up on what I think I’m seeing (hey, he must be right, he agrees with me!) Cisco is quite possibly recycling code they’ve already built, like FabricPath but using VXLAN tunnels for encapsulation. 

One might compare this (and what NSX does as well) to LANE, in that the system tracks which MAC addresses live behind Leaf (edge) devices. Since LANE has negative connotations, let’s not say “LANE” any further! One might equally well say “LISP”, since it too is about location (the Leaf attached to, in this case — the IP address to tunnel to), and identity (the IP / MAC / VRF info).

Note that DFA and ACI localize address significance, by tying the information to a VRF or tenant ID.

For DFA and ACI, a VM running on a hypervisor host (“h-host” for short?) is reached via VLAN (DFA) or VLAN/VXLAN (ACI doesn’t care, the Leaf does either). So L2 to a physical server or a VM is pretty much the same situation as far as the Leaf switch. It just forwards traffic at L2 to a MAC address. In DFA, the external connection is strictly a Leaf node doing VLAN forwarding to a MAC address. In ACI, it doesn’t even care if the destination is reached via a VXLAN tunnel or VLAN. 

Layer 2 Forwarding and NSX

I covered NSX L2 forwarding between VMs in the previous blog in this series. The NSX controller knows which hypervisor host each VM is on. It populates a virtual switch switching table. Local L2 traffic is forwarded within the hypervisor, as noted above. If L2 traffic has to be sent to another hypervisor host, it is tunneled using VXLAN. That is, encapsulated in IP and routed by the underlying hardware infrastructure to the other hypervisor host. The prior blog included a picture of normal NSX forwarding between VMs on different hypervisor hosts.

For details and good encapsulation graphics, see the Ivan Pepelnjak (@ioshints) video or PDF presentation, “Overlay Virtual Networks Data Plane“, at http://demo.ipspace.net/get/2.3%20-%20Overlay%20Virtual%20Networks%20Data%20Plane.mp4. (Which includes a link to the PDF of “NSX Architectures” at the bottom of the page.)

Think about that for a moment. That means you can have VMware and vMotion L2 adjacency based on VXLAN, despite having a (semi-robust?) L2 switched or robust L3 routed datacenter network underneath. If you want.  

VMware NSX (generically, ignoring that it is really two products right now) can also use tunneling or tunneling + IPsec between datacenters. That is, it is a possible Data Center Interconnect (DCI) technology. Whether L2 DCI or a specific DCI technology is wise, that’s a debate for another time. The classic networking answer applies: “it depends”. As a designer, I have some reservations (shared with Ivan and some other bloggers) about doing things like L2 or server/app/firewall clustering between datacenters. The Cisco DCI testing around Long Distance vMotion is highly relevant to using VXLAN for DCI as well. (vMotion latency, throughput constraints.)

Concerning forwarding between physical servers (which are not doing VXLAN), that’s not something NSX deals with. 

That leaves the two cases of physical server (“PHY”) to VM L2 forwarding in NSX. Doing so requires L2 forwarding. As far as I know, L2 forwarding is done via a L2 gateway in either version of NSX, at best an Active / Passive pair. 

Thus the NSX hypervisor just punts to the active L2 gateway, which learns MAC addresses for the physical world, and handles L2 forwarding. Each VXLAN can be gatewayed to exactly one VLAN, and that VLAN must be contiguous.

See Ivan Pepelnjak’s blog “VMware NSX Gateway Questions“, at http://blog.ipspace.net/2014/01/vmware-nsx-gateway-questions.html for rules about L2 and L3 gateways in the two NSX versions. What he’s written there seems to match notes I have, modulo gaps in my notes. (And thanks to Dmitri Kalintsev, @dkalintsev, of VMware). Note that Ivan’s slide deck (VMware NSX Architecture) carefully distinguishes between NSX for Multi-Hypervisor and vSphere. I tend to automatically ignore slide titles, and missed that on early readings. 

See also Ivan’s blog “VMware Virtual Switch Has No Need for STP“,  http://blog.ipspace.net/2010/11/vmware-virtual-switch-no-need-for-stp.html, and thanks to Ivan for the term “split horizon forwarding” (which I’ve been lumping into the term “end host mode”). 

If you’re migrating a datacenter to NSX, you might want to think about throughput of the L2 gateway, and spread the workload around if necessary. If you think about it, traffic might have to go from a physical server across the datacenter at L2 to get to the L2 gateway, then get VXLAN tunneled back to a hypervisor on the very same Leaf switch the physical server was on. That can be one cost of abstraction: sub-optimal forwarding. The win is the ability to simplify and automate. Like most of networking, it’s a trade-off.

Exercise for the reader: What does NSX do if VM A is on h-host A, VM B is on h-host B, and both h-hosts are on the same VLAN? 

Hint: Wireshark. For full credit, post a comment with your findings. 

I strongly suspect it still uses a VXLAN tunnel. Simpler to code, cleaner, faster. Do we care? Well, I like knowing. But need to know only if we have to troubleshoot it?

Layer 2 Forwarding and DFA

DFA associates each edge VLAN with an unique “segment ID”. The various segment IDs tie back to the tenant, which usually corresponds to one VRF. The VRF provides routing between the various VLANs belonging to a datacenter tenant. Each Leaf port is then associated with a segment ID, hence tenant and VRF. 

Doing this provides the same functionality VRFs provide for VRF-Lite and MPLS. It localizes VLAN, MAC, and IP information. They only need to be unique within the VRF or segment.  

L2 forwarding on the same Leaf switch happens normally.

Here’s the surprise. When ARP occurs, if the same-subnet destination is not local, the Leaf switch responds with a gateway MAC, and creates an ARP entry capturing this. The source then sends traffic to the MAC. The Leaf switch looks at the IP address, not the MAC. Based on IP and VRF (segment), it then can forward the frame via enhanced FabricPath tunneling to the correct Leaf switch (assuming the destination IP is known). 

Note that the enhanced FabricPath is still based on IS-IS tracking routing between the Leaf switches. IS-IS routing tracks how to reach the participating routers, much as SAN FSPF tracks how to reach a given Fiber Channel Domain (switch ID). (Which is probably why Brocade LAN fabric routing / TRILL variant uses FSPF — they already had the programming code. Everything in networking gets recycled.)

Minor detail: non-IP L2 traffic has to be forwarded via FabricPath. The slides I have don’t discuss the details. My guess is unknown MAC flooding and conversational learning.

Concerning traffic between two VMs on different hypervisors: the VXLAN tunnel is routed over the DFA switch fabric. There is no mutual visibility. VMs are similar to physical hosts, from DFA’s perspective. Switch or route to the MAC address (or L2 gateway device).

Layer 2 Forwarding and ACI

ACI maps the MAC, IP, and presumably tenant information to a Leaf or VTEP endpoint using the “distributed mapping database.” That is, all forwarding appears to be based on policy permitting connectivity together with the global awareness of which endpoints are reached via which Leaf switch. Enhanced VXLAN tunneling is then used to encapsulate and deliver the frame (L2) or packet (L3) to the destination. If the destination is on the same Leaf switch, I’d think no tunneling is needed. 

The unexpected aspect here is that it seems that same subnet or different subnet may be somewhat irrelevant, since the policy determines connectivity. If your policy is set up to connect on a per-VLAN or per-subnet basis, then the system mimics L2 switching. 

I have not seen a discussion of what ACI uses for routing over the Leaf to Spine CLOS tree. IS-IS would be rather natural, and it would seem the whole set of IS-IS and FabricPath code might lend itself well to this, changing what needs changing. 

 

Concerning traffic between two VMs on different hypervisors: the VXLAN tunnel is terminated in hardware in an ACI Leaf. That means VMs doing VXLAN and physical hosts are on a par. Since there is no encapsulating VXLAN tunnel that is opaque to the switches, ACI can identify and provide useful information about applications. Furthermore, the knowledge of both the physical connectivity and the VXLAN overlay between Leaf switches means that ACI can tie the two together for troubleshooting purposes. 

 

After-Thoughts

CiscoLive Milan will have about 10 ACI presentations and some DFA sessions as well. I plan to pull the slides as soon as I can and comb through them for additional technical details. 

In reviewing the slideware I have, I realized there were a couple of topics from the prior blog that deserve more detail. But not yet! The intent of this series is to provide a cross-wise overview. Hence, some details will be left for “later”.

Details given short shrift so far:

  • How NSX ARP works with VTEP gateways, and does return traffic (physical to virtual)  funnel back through a single gateway.
  • The role of 1000v in DFA; the VLAN to “segment ID” mapping process: 1000v profile or DCNM profile for a new server. I.e. how does DFA know which company/tenant/VRF a new device / VLAN belongs to? 
  • VXLAN and VLANs and STP: but Ivan beat me to it — see the link above!

Other Links

Relevant Prior Blogs

Twitter: @pjwelcher

Disclosure Statement

ccie_15years_med CiscoChampion200PX

Leave a Reply