Practical SDN: L3 Forwarding in NSX, DFA, and ACI

How does L3 forwarding work in NSX, DFA, and ACI? That’s the topic for this blog, blog #3 in a series. The series attempts to contrast the various behaviors of NSX, DFA, and ACI. So let’s jump right in and see what’s happening at Layer 3. (And if you’re still reading, the last two blogs took a while to write and ended up rather long — I’m trying to avoid those pitfalls here.)

(Source: Wikipedia. Creative Commons license) (Graphic by Tosaka)

Prior blogs in this series:

Practical SDN: L2 Forwarding in NSX, DFA, and ACI
Practical SDN: NSX, DFA, and ACI, The All-Seeing Eye. See this first blog for a general disclaimer.

Common Elements (NSX, DFA, ACI)

As noted in the prior blog, all of these SDN 2.0 technologies attempt to distribute the routing, to minimize latency and distribute workload.

Each of NSX, DFA, and ACI virtualizes or distributes the default gateway, so that the local hypervisor or the Leaf switch can act as a virtual L3 default gateway. No more HSRP, VRRP, or GLBP! Anycast FHRP was an interesting idea by Cisco, this approach ought to scale even better!

Exercise for the reader: It seems the default gateway ARP response would have to have a common virtual MAC address (VMAC), to support vMotion between hypervisors on different Leaf switches. Verify this and post a comment with your findings.

Layer 3 Forwarding and NSX

NSX has both distributed and edge routing functionality.

NSX for Multi-Hypervisor supports routing between VXLANs via L3 gateway nodes (active/passive for high availability). Each logical router is like a VRF, its own routing domain. Only can be connected to any VXLAN. One uplink to the physical network is supported. NSX for Multi-Hypervisor also supports distributed logical routers with connected routes, static routes, and default to the L3 gateway.

NSX for vSphere supports distributed routing between VXLANs and VLANs, i.e. the physical world too. Outbound traffic to the physical world is distributed. Inbound goes via the Designated Instance (DI).

The following diagram shows how L3 forwarding between VMs works.

20140124-fig01

Ivan Pepelnjak (@ioshints) covers this in a video, also in his NSX Architecture slides — see his “Layer 3 Gateways“, at http://demo.ipspace.net/get/4.2%20-%20Layer-3%20Gateways.mp4. I’m relying on his presentation as the other published sources I have are less clear. It has a great walkthrough of the L2 tunneling encapsulation process. This is more or less the same, except that the distributed router does a forwarding lookup and header rewrite before tunneling as he describes.

Brad Hedlund (@bradhedlund) walks through the Designated Instance (DI) ARP and routing behavior for NSX for vSphere distributed routing in his blog at http://bradhedlund.com/2013/11/20/distributed-virtual-and-physical-routing-in-vmware-nsx-for-vsphere/. With a diagram, even! (Thanks, Brad!) That covers distributed routing between a VM and a physical device.

There are restrictions on running L2 and L3 on the same gateway (at least of NSX for Multi-Hypervisor). I’ve heard that rule compared to the rules for Cisco OTV: a VDC with a VLAN’s SVI in it can’t do OTV, so we use a dedicated OTV VDC. I imagine the rule is to keep things simpler. I recall the logic of Cisco IOS IRB (Integrated Routing and Bridging) always bothered people.

Anyway, see Ivan’s VMware NSX Gateway Questions. I’m going to duck the topic of other rules and constraints, since they may evolve over time, and would be Too Much Information. What I would like to see: more discussion around how one can and cannot use NSX for vSphere L2 and L3 gateways, and typical use cases. Which VMware will likely be publishing over time as they document NSX for a broader audience.

NSX for vSphere does or will run BGP and OSPF, using the controller as proxy for the distributed virtual routing functionality (i.e. one OSPF neighbor to the DI edge routing function, not many). The typical use case may be BGP to the physical world, and OSPF internally.

Summary: in NSX, you can assemble somewhat basic logical switches and routers. The most evolution appears to be happening in NSX for vSphere. The distributed logical router is neat stuff for the virtual world. L2 gatewaying appears useful for P2V migration. L3 edge gatewaying appears simpler for creating a virtual application pod or pods and routing from the physical to the virtual side of things.

Routing from a VM to an external device via the edge gateway tunnels to the edge router, which then acts like a physical router, forwarding to the routing next hop.

Layer 3 Forwarding in DFA

With DFA and ACI, L3 forwarding isn’t that much different than the L2 forwarding in the prior blog.

When a host ARPs for a VLAN default gateway, the Leaf switch intercepts it and returns a virtual MAC address for the VLAN default gateway.
The host L2 encapsulates with DMAC = vMAC, and sends the frame out. The Leaf switch sees it.
If the destination IP is known, the Leaf switch tunnels the packet using FabricPath to the Leaf switch the destination IP is connected to. The MAC header inside uses the MAC addresses of the two Leaf switches.
Just as with L2 FabricPath, the receiving switch de-tunnels the frame. It also removes the L2 header inside.
It then does L3 lookup and forwarding with SMAC = virtual default gateway MAC and DMAC = end system MAC.

All of this is VRF-aware.

Diagram:

20140124-fig02

Layer 3 Forwarding in ACI

The ACI materials make a point that IP address has no location requirement, it can be anywhere. I suspect that ACI is really tracking location of {IP, tenant} type information, to allow for different tenants with duplicated IP addresses. It’s not clear what the capabilities and constraints are when {IP1, tenant1} wants to talk to {IP2, tenant2}. How might such routing work? I would imagine that a global interconnect “tenant” would be needed, just as we currently use the Internet to talk between entities. Some stock firm private back office interconnects are moving to assigned public address space used privately, because the interconnections have become twisty little mazes of static routes and NAT.

If a host in tenant1 wishes to reach say 10.1.1.1 which is in use by an attached host for tenant1 but also for tenant2, there would need to be some way to dis-ambiguate. In the physical world, we would probably NAT to public address space, at least for connecting two entities over the Internet. Sometimes for VPN connections. I’ll have to chalk that up as a detail TBL (to be learned).

Diagram:

20140124-fig03