Extended VLAN Mitigation

This article is triggered by some recent work I’ve been doing. We are migrating a customer with an old Nortel switching infrastructure to a high-speed Cisco 6509-based infrastructure with a lot of 10 Gbps links. Our design calls for Layer 3 (routing) to the closet or server zone. There are some nifty Spanning Tree mitigation techniques coming (or replacements that allow use of many paths in a L2 infrastructure). That technology is not quite here yet.

There is perhaps a theme there. Design wisdom sometimes lies in not using the latest and greatest technology or new features, at least not until “ready for prime time” (mature). It can also consist of “appropriate technology”, i.e. something the customer can sustain. Chesapeake Netcraftsmen personnel not only try to do the Right Thing, we often work to reduce maintenance work, be it for ourselves or for our customers. (Not intended as a shameless plug per se, it is just the way we do things.)

Sidenote: The slightly ironic factor here is that the customer Nortel network (7 years old) uses a Nortel split chassis EtherChannel technique (“SMLT”), which is most naturally replaced using the Cisco VSS technology. We jointly made a decision to be a bit conservative about using VSS, as new Cisco technology. We are using it in the data center for servers that appear to need both of 2 Gbps connections, but only in server zones with a heavy requirement in that regard. The decision reflects that use of Layer 3 routing at all levels of the hierarchy, to minimize Spanning Tree failure domains, and provide a highly stable infrastructure. Replicating the Nortel SMLT design would have been carrying a legacy design into a modern set of equipment, where there really is no need to do so.

The Challenge

There was one potential problem. The present network has some client workstations and printers or servers “isolated” (segmented) by VLANs. These VLANs run all over the campus and into the Data Center, connecting to various DMZ servers and firewalls at Layer 2. I’m told the network has not had Spanning Tree events, and that these “Extended VLANs” were deployed mainly due to not having the ability to use ACLs in Nortel gear, and/or subsequent concerns about performance. There also is the sentiment that ACLs are complex to maintain, and the site has had some problems with botched ACLs or ACLs that quietly got removed from a key interface (due to troubleshooting?).

The problem? How do you carry a VLAN across a routed network core?

Variants of this problem occur within data centers (VMotion or L2 heartbeat across routed server zones, backup VLAN size control). They also occur between data centers (“Data Center Interconnect”). Cisco has some great writeups on the latter topic, definitely a bit of an advanced topic but when you need it, you need it bad!

Sidenote: We’re in the position of trying to hold the Layer 3 line between data centers at some sites — there’s a difference between necessary L2 interconnect and completely unnecessary L2 caused by a server admin just assuming that clustering between data centers is as good an idea as local clustering. But I’ll save that rant for another blog.

The Approach

Our approach was to first buy some time. The operations contractor was tasked with moving the point of connectivity out of the data center, so we could proceed with data center migration knowing that the VLANs went from buildings more directly into the relevant DMZs. This in effect “pushed” the relevant VLANs to not transit the data center.

Some of the extended VLANs were legacy situations, and were removed. Two more connected contractor offices across the infrastructure to the contractor DMZ, and are being replaced with direct patching (which is even easier to support), replacing MMF GBICs to allow use of SMF due to distance considerations.

The remaining VLANs are mostly sparse: a couple of dedicated workstations here and there. The one exception is contractor / guest, where a requirement for a wide-scale secure approach has become clearer over time, due to the large number of both contractors and guests on site.

It appears that the client desktops for the sparse VLANs can be routed to the DMZ VLAN, if an appropriate segmentation technique isoaltes that traffic. This can be thought of as “squeezing the VLAN all the way back to the relevant DMZ”.

The Alternatives

We considered a large number of “segmentation” techniques. These included Layer 3 Techniques:

Access lists (in & outbound at both ends, the client-side VLAN, and the DMZ VLAN ingress point)
VRF Lite
VRF Lite with GRE Tunnels
MPLS VPN

And Layer 2 over Layer 3 techniques:

EoMPLS
EoL2TPv3
Fancier Ethernet-Over Techniques (VPLS, EoMPLSoGRE, etc.)

And other approaches:

Use Cisco switches and just do extended VLANs (yecch — ‘nuf said?)
NAT the clients into the VLAN subnet so DMZ devices think they’re local (turned out to not be necessary)
Local Area Mobility (at the risk of /32 route pollution, plus it is rather a kluge)
Use dedicated smaller switches ($$$)
Bridge over GRE (not supported anymore, assuming it ever was really supported)
DLSW+ (limited performance, would require routers)

The Short List

We know anyone reading this is obviously intelligent (after all, you’re reading this ), so the “other approaches” are pretty clearly not great ideas. I was amused in trying to come up with a Really Complete List of approaches, however half-baked. Policy-Based Routing also came to mind — there’s probably some way to warp it for this use, not that doing so would be a good idea. Readers are invited to add comments if I missed some other semi-plausible way to solve the problem.

From experience, MPLS VPN is something I can only recommend where there is a clear need or where a site’s staff has solid BGP skills. That was not the case here, and after discussion, it appeared best to avoid MPLS VPN.

VRF Lite is very efficient, but has the drawback of requiring “per-VRF VLAN plumbing” between switches. That means turning the uplinks into trunks rather than routed links, which doesn’t help L3 convergence. And a lot of configuration.

The problem with EoL2TPv3 and GRE-based approaches is encapsulation performance. The encapsulation is not done in hardware in 6500 switches unless you use specialized modules (SIP module, or Ethernet Services module on 7600), which cost quite a bit. I’m told the 3750 does GRE encapsulation as process switching, not really supported, not a good idea.

On the other hand, I’ve seen some testing results that say EoMPLS is great on a 6500, and VRF Lite to GRE perhaps capable of 6.75 Gbps (64B packets) on a 6509 Sup720 (about 10 Mpps of 64 B packets). With distributed forwarding, e.g. 6708, perhaps double that. That made our short list:

Access lists (in & outbound at both ends, the client-side VLAN, and the DMZ VLAN ingress point)
VRF Lite with GRE Tunnels
EoMPLS

Of these, ACLs and EoMPLS are pretty much wire-speed. VRF Lite into a star of GRE tunnels to one or dual head ends has performance limitations, but is fairly simple to configure and manage — and the performance looks like it far exceeds what is needed.

EoMPLS is so easy to configure, it is seductive. It would perpetuate the present situation, at low risk and low hassle. The risk is almost that it is too easy. Layer 2 “spaghetti” over a L3 infrastructure kind of defeats (or at least, end runs) the whole intent.

If you’re wondering “why EoMPLS but not MPLS VPN”, the point is that EoMPLS requires enabling MPLS labels, but no BGP. The 6509s have the limitation of not doing local switching on EoMPLS xconnect ports, but there are workarounds. (I hope to blog soon about “loopback cables” and show a small EoMPLS xconnect configuration that was lab tested.)

The large guest / contractor solution? We’re proposed a building VLAN and addressing scheme to support centralized NAC well. One aspect is a guest / contractor VLAN for every closet switch. A simple ACL can deny inbound traffic from that VLAN from going to any internal address. It gets a bit messier since DHCP and DNS services are needed, and web proxy is required (can’t have headlines saying somebody at the site was viewing bad web pages or doing something bad). And the web proxy has to be controlled to prevent “hairpinning” via the proxy to access internal web sites or services that guests should not be accessing. But those are all solvable problems.

The Challenge

The Approach

The Alternatives

The Short List

Leave a Reply

Related Topics