Understanding Layer 2 over Layer 3 (Part 1)

Author
Peter Welcher
Architect, Operations Technical Advisor

I’ve had Layer 2 on the brain for a while. Or rather, mitigating Layer 2. Several prior blog articles reflect aspects of this:

I’ve got some additional thoughts to share. I’d like to recap the situation as I see it, with lots of useful links. (When I started writing this, I optimistically thought I could deliver some diagrams and conclusions, but framing the setting too enough time and space that the best part will now come in a second blog article.)

Why Layer 2 (L2) is Evil

In a few words, Spanning Tree Protocol (STP) melt-down. I’ve seen an entire data center go down twice now, with UDLD helping spread the joy. In one case, a mis-configured port channel hard-coded “on” in two new access switches caused the data center core switch CPU’s to get spun up. Lack of UDLD then caused 16 or 18 access 6500’s to errdisable their uplinks. And the site didn’t have errdisable timeout configured. At the other site, a high-priority server was built and attached by 10 Gbps dual-homing to the 6500 Sup720-10G core switches, until two 6708 10G blades could be ordered. Something went wrong (the story gets a bit fuzzy here — there’s no evidence, and no obvious way how a “bare bones unconfigured Windows server install”could have bridged two ports together). In both cases, the result was a Spanning Tree loop, UDLD errdisable and/or heavy flooding to servers on 100 M ports, data center down for hours.

That wouldn’t be quite so bad except for two things:

  • Spanning Tree problems tend to spread across VLAN extents.
  • And they cause a vast increase in traffic level (broadcast, etc. looping).
  • The ports the cause is connected to / the switches causing the problem can be hard to track down, taking time to find.

Routing problems, by way of contrast, tend to only affect the lost prefix(es), and tend to damp down traffic (less, not more).

The other problem I’ve seen in a large L2 campus, is that your VLAN numbers become global. And you generally end up with a large wall-chart in many colors, showing which VLANs go where. When they’re non-localized, it breaks modularity, including modularity of diagrams (core, building A, datacenter B, etc.). When your network diagram starts requiring an advertising billboard due to size (half-joking), your network isn’t modular. (Or you or your boss like big diagrams?) I like 8.5 x 11 or 11 x 17 — I can read them in Visio on my PC, without mega-zooming.

Why Layer 2 (L2) is Necessary

Actually, it isn’t, most of the time. Closets get along fine with Layer 3 routing, either from the closet up to distribution and access layers, or from distribution layer up.

Data centers are where we need increasing amounts of Layer 2. That’s because Microsoft clusters and Oracle RAC clusters, and vmware VMotion, all require Layer 2 adjacency. Cisco is now recommending containing L2 within the access layer if possible, otherwise the access / distribution pod if possible. No L2 across the data center core. (And how much of the data center do YOU want to put at risk of STP loops?)

In part due to this, in part due to the inefficiency of having L2 links that don’t get used, we have the IETF TRILL (Transparent Interconnection of Lots of Links) effort, based on the RBridge concept from Radia Perlman. The basic idea is, get all the L2 links usable. When you’re paying for a 10 Gbps link, you definitely want to be able to use it all of the time!

TRILL links:

Cisco’s short-term answer to that seems to be taking 8-way EtherChannel or LACP to 16-fold, so you can have Really Big Uplinks. The VSS or VPC technologies allow such EtherChannels to be split across dual chassis, increasing their survivability. 

For that matter, VSS plus EtherChannel takes Spanning Tree off the table (mostly, except when your 6500 switches are having a bad day). That’s Yet Another Answer to Spanning Tree woes. Logically, your two switches with bowtie uplinks to both upstream switches look like one switch, one connnection, another switch — no loop, no Spanning Tree.

The Cisco Bridge Assurance features can also be viewed as carrying on the theme, of let’s make Spanning Tree more robust. Since one of my colleagues and friends has already written about it (for Netcordia), let me refer you to Terry Slattery’s blog about the topic, at  http://www.netcordia.com/community/blogs/terrys_blog/archive/2010/01/06/what-is-bridge-assurance.aspx.

Design Implications, Data Center and Data Center Interconnect

Where is this technology headed, in terms of design? It looks to me like “small” or “moderate” amounts of L2 at the data center access layer, possibly extended through the distribution layer where needed for scaling / migration. (As a physicist might say, “for various values of ‘small'”.) That is, “small” may become larger as time goes on, and the technology matures.

There are two situations I know of where the L2 need can be more severe:

  1. Data center migrations, where servers or Virtual Machines (VMs) need to move to servers in another part of the data center (across the L3 core).
  2. Data Center Interconnect (DCI), where you need the same VLAN / subnet at two or more data centers, for cluster heartbeat or VMotion, typically.

DCI is sometimes used for “geocluster” applications. I love the term! (And Cisco has a couple of mildly older but good documents tying SAN into the discussion as well — google “geocluster site:cisco.com”, I liked the Design Guides.)

The above are situations where you’ve carefully bounded “failure domains” with L3, but you need controlled, safe L2 connectivity across the L3 in the middle. Preferably in such a way that Spanning Tree melt-down in one data center doesn’t take out the Business Continuity / Disaster Recovery (BC/DR) data center.

Cisco has published a number of ways to tackle the DCI setting (various documents in the SRND / Design Zone series; see the top hits when you Google “dci site:cisco.com”). The technology choices: optical technologies, QinQ, VPLS, EoMPLS, EoMPLS with semaphores, etc. For a good summary document, see Data Center Interconnect (DCI): Layer 2 Extension Between Remote Data Centers, at http://www.cisco.com/en/US/prod/collateral/switches/ps5718/ps708/white_paper_c11_493718.html. One consideration leading to complexity is High Availability / redundancy, compounded with the recommendation to not run Spanning Tree Protocol between data centers.

The latest addition (which looks pretty clever, powerful, and well-thought out) is OTV (Overlay Transport Virtualization), currently only available on the Nexus 7000 series. Rumor has it that OTV really stands for “Over the Top Virtualization”. In any case, OTV looks like the cleanest and lowest user complexity among the solutions I’ve seen in print. I’m not sure I’d want to scale it to 6 or 12 data centers. At least not for the next week or two. Or 6-12 months.

For what information is presently available about OTV, see:

Within the data center,  EoMPLS is a fairly simple and workable solution, as long as you don’t insist on redundant pseudo-wires, or are prepared to deal with the ensuing complexity. (Using VSS chassis on both ends might help.)

Let me also throw in “Long Distance VMotion”, which is another highly desirable capability that DCI opens up (at least, within the distances tested). Reference: Virtual Machine Mobility with Vmware VMotion and Cisco Data Center Interconnect Technologies, at http://www.cisco.com/en/US/solutions/collateral/ns340/ns517/ns224/ns836/white_paper_c11-557822.pdf.

Conclusion 

Putting together my first two section headings, I think we can safely conclude that: L2 is a Necessary Evil. (Mostly joking!)

What I propose to examine in my next blog on this topic (complete with diagrams), is what the implications are in terms of traffic flows. There are some definite performance implications that you will want to understand. The technologies mentioned above are striking me as the classic case of “just because you can do it, doesn’t mean you should do it.” EoMPLS is so easy, I worry about the “beer effect” (too much leads to a headache). With L2 DCI, I see the potential for lots of good consulting work, diagnosing mysterious performance issues. Well, maybe mysterious only to those who didn’t take the time to understand the implications of the technology (or read the next blog).

Stay tuned!

Leave a Reply