After my posting about the dangers of spanning Layer 2 between data centers in my blog titled Spanning Tree Protocol and Failure Domains, Shivlu Jain (www.mplsvpn.info) asked about handling requests to span Layer 2 between data centers. One of his concerns was how to handle HA (High Availability) server configurations that require network connectivity within the same broadcast domain (subnet). Another question he had was about the deployment of FCIP (FiberChannel over IP), where the data center staff says that they need Layer 2 connectivity to implement it across data centers. Shivlu asks “How to avoid it?”
The data center staff and managers typically don’t understand the risks associated with various technologies and the impact that those risks can have on the business. The blog I posted a few weeks ago described one organization whose business was down for several hours during business hours because of a spanning tree loop. So, how do we network engineers handle the request that we know puts the organization in danger of a major outage?
My reply to Shivlu was to educate the non-network staff on the risk of a L2 spanning tree loop. Point them to my blog post. Bring in your network equipment (Cisco, Juniper, etc) SEs, who should have examples of other businesses who had damaging outages due to spanning tree loops between data centers. You may even be able to get someone from an organization who has experienced such an outage to talk with you data center and management team to help them understand the risk and how long an outage might last and what it took to troubleshoot and remediate it.
There are a number of analyst reports on the cost of downtime. It varies from industry to industry, so you may need to use a downtime calculator (they are also available on the web) to estimate the cost of downtime to your business. Calculate the cost per hour and then determine the number of hours that it might take to diagnose and correct the problem. Four to eight hours of downtime per event are reasonable figures, especially when you realize that you’ll have to physically go to the data centers and take down links to isolate the source of the problem.
Having a pre-planned set of interfaces that you take down to isolate sections of the network can significantly reduce your troubleshooting time. An out-of-band network connection may allow you some remote access. But beware, the CLI is unusable in switches that are within a spanning tree loop failure domain. The CPU in the switches will be running at 100% load, processing the BPDUs that are cycling through the loop, so the CLI doesn’t get any CPU and you can’t enter commands via a VTY or event the console. So you will probably need to unplug certain interfaces in order to regain control of parts of the network. That’s where having a pre-identified set of interfaces to take down will allow you to quickly isolate the problem.
Back to the education of the non-network staff. If you have a high-availability requirement, you should be testing it periodically. You should have a way to automatically fail over to the backup data center and should test this capability at least quarterly so that you know that it works and can resolve any problems that prevent it from working properly. If you’re not testing it, then you don’t know if you really have a backup data center; you just have two data centers and neither may be a true backup for the other.
Take one of the planned testing windows and create an test of L2 spanning between data centers and whether it is resilient against spanning tree loops. During the outage window, have the applications team monitoring access to the key business applications. Span one or more VLANs between the data centers. Then create a spanning tree loop as a test of data center resiliency. Make sure that the backup data center is still accessible and that the applications are still operating correctly. Your NMS systems should be monitoring what happens and you should see a known set of alarms created. If the number of alarms is different than what you expect, spend time to determine why. Understand the signature that the NMS provided during your testing so that you can quickly identify one in the future.
Of course, if the test takes out both data centers, you have done your education. To counter the argument that the specific test case you used will not occur in production, document the various ways that an STP loop can be created. Of course, adding a new switch in the wrong place and plugging it into the network incorrectly will do the job. Another way, that may not be immediately obvious, is to configure a dual-homed server such that its two NICs are bridging.
And don’t forget to check the root bridge placement. If you don’t have a root bridge selected, insert a low powered switch with a low MAC address and see how your data center runs. Again, do this only during an approved outage window or you’ll have experienced a “career limiting move.” Better yet, have a lab that emulates your production environment (yes, it is expensive, until you look at the cost of downtime) and do your testing there. Or you may be able to configure a similar network scenario at your vendor’s testing labs.
Finally, vendors doing something about allowing L2 to span between data centers, because of the demand by server operations teams to do things like vMotion between data centers. However, the features are still relatively new. Cisco has announced OTV, (http://blogs.cisco.com/datacenter/comments/under_the_covers_with_otv/ ) for the Nexus 7000 product line. Cisco also has whitepapers, such as “Data Center Interconnect (DCI): Layer 2 Extension Between Remote Data Centers” (https://www.cisco.com/en/US/prod/collateral/switches/ps5718/ps708/white_paper_c11_493718.html).
The data centers teams can build high availability systems, but they need to know the failure modes and avoid designs that make their systems vulnerable to common problems. It is best to assemble a cross-discipline team of networking, server, and applications experts to design good systems that are resilient in the face of one or two failures. The big financial firms do this and it works for them. Check out the High Availability Networks talk at Cisco Live for additional tips.
If all else fails, put your risk analysis and recommendation in writing. This documents the risk and makes people more adverse to taking on the risk, especially if they don’t understand it.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html