I hope you are not thinking, “What’s this about OTV and DCI needing defenses?” But if this question puzzles you, this blog is for you. The purpose of this blog is to make sure everyone (who reads this) is aware, Data Center Interconnect (DCI) techniques, and in particular OTV, do not protect your network from cross-data center STP (Spanning Tree Protocol) problems.
The older DCI techniques and recommended designs go to some (complicated) lengths to prevent a STP loop when there were two or more DCI links or virtual links. OTV goes further, in that the AED (Authoritative Edge Device) solves the potential loop issue simply, and OTV inherently does not extend the STP (BPDU) domain between data centers. STP isolation is good since the bigger the STP domain, the less stable it tends to be. (See also “root bridge war”.)
BUT: Just because you are running OTV still does not mean you’re safe from STP impacts!!!
Besides DCI/OTV design, you also still need to think about is safety measures, defenses. If (when?) a STP loop happens in a datacenter, what protects the other one?
I’ve run into some people who think you need to be part of the loop to experience the ill effects of a STP loop. Not so! The looped links generate the major torrent of BUM (Broadcast, Multicast, Unknown Unicast) traffic. But at traffic floods anywhere within its VLAN. If that VLAN extends to your other datacenter, say via OTV, whammo! Your other data center also experiences massive traffic.
The following figure illustrates this “spillover” effect.
If datacenter #1 has shiny new Nexus gear with 10 G NICs, and you have a 10 G dark fiber to Datacenter #2, any old Sup 2-based based Cisco 6500 switches are not going to like it, in a major bad way. This is something to look out for, especially in old to new migration scenarios.
So don’t think that OTV “contains STP to one datacenter” suffices. Not so! Yes, that is an advantage of OTV. But it means STP BPDUs and topology, NOT the spillover effects of traffic. Large scale STP is nasty, with timing effects, so confining the STP tree topology and BPDUs in particular to a single datacenter makes it more robust, less alike to “lose it” or have “root bridge wars”. But the semantics (meaning, expected behavior) of a VLAN require BUM flooding.
Yes, with OTV Cisco proxies ARP to cut the BUM traffic some. That might help contain any looping ARP traffic. Which is a lot of what I’ve seen in packet captures from STP loops. But even the rest of the BUM traffic can be enough to be a real problem. So protect yourself!
STP Defensive Measures
Now that I’ve got your attention, what’s the solution? The issue isn’t STP, so tools like BPDU Guard etc. aren’t relevant. The problem is the flooding.
Tools for dealing with that: hardware and software rate limiting, particularly on the more powerful switches. Control Plane Policing (COPP). And yes, the Sup2 does rate limiting in software, and I’m told that by the time it kicks in the CPU is already toast (rendered useless).
It turns out one of my Chesapeake NetCraftsmen colleagues, Augustine Traore, did some interesting lab testing to see how effective various STP defensive measures are. For the results, and also some ideas as to what you can do to protect your network, see his blog.
I and others have written about risk, howL3 separation is a bit more robust than L2. Fewer, simpler failure modes. For more about this, see Ivan Pepelnjak’s blogs at ipspace.net. If you’re still contemplating L2 DCI (or your boss is), presumably you have a business reason to do so. Meaning clusters, DR via VMware, VMotion, or datacenter migration are causing you to require L2 between datacenters.
If you’re doing a Data Center Interconnect design, do go ahead and think about what technique to use, which devices and code releases support it, is it mature enough, how do I configure it, etc. And with most of the DCI techniques other than OTV, you’ll want to think about how to provide redundancy while not creating a STP loop or having STP block one of the redundant links.
But also think defensively! And do take some precautions along the lines outlined above or in Augustine’s blog.
I googled a bit, here are some interesting articles, either about prevention or finding the cause of the STP loop.
Spanning Tree Loop Troubleshooting and Safeguards, at https://supportforums.cisco.com/docs/DOC-14223
The Case of the Spanning Tree Problem by Fred Baker (Cisco), at http://tcpmag.com/archives/article.asp?editorialsid=20