Spanning Tree Protocol and Failure Domains

Author
Terry Slattery
Principal Architect

Co-worker Pete Welcher recently helped a customer whose network experienced a spanning tree loop (i.e. a melt-down).  Several things can be learned from thinking about the experience and how to avoid it.

Lesson #1:  Adequately plan the task.
Rush jobs carry a higher risk of problems than well-planned tasks. In this case, the server operations team needed to bring up a new server for a project and decided to not wait for an access-layer switch to provide connectivity.  Instead, a couple of ports on a core switch were used.  Something (not sure what) was misconfigured and a spanning tree loop was created.  The new server had high speed interfaces, so there was nothing that would limit the volume of forwarded traffic.

Lesson #2: Limit failure domain size.
The new server was connected to the data center core switch and the resulting forwarding loop took out the entire data center network, affecting all business operations.  Implementing smaller spanning tree domains would limit the scope of potential failures, allowing unaffected business operations to continue to operate.  Such separation may need to be pushed into the distribution or access layer to prevent a potential spanning tree loop from touching the core switches.

Lesson #3: Implement safety mechanisms.
Take advantage of various safety mechanisms like UDLD, loopguard, rootguard, and bpduguard to prevent the formation of STP loops.  While these mechanisms help prevent the formation of loops, they are not a replacement for limiting the size of an STP domain.  By limiting the size of an STP domain, you limit the number of systems affected by a failure.

Lesson #4: Don’t put servers for a single function in a single subnet.
When one broadcast domain is affected by an STP problem or by a denial-of-service attack, the backup servers should be accessible in a separate subnet, hopefully in a backup data center.  Minimizing common infrastructure reduces the opportunity for complete system failure due to the failure of one or two key components.  A key example is the DNS servers, which are required these days for the proper function of many applications (hopefully the apps don’t use hard-coded IP addresses).

Lesson #5: Know your STP topology
Know your STP topologyand how to quickly disconnect sections of the topology so that you can quickly identify the part of the network that contains the source of the STP loop.  You can then return the rest of the network to production while you fix the cause of the STP loop.

In summary, STP loops will quickly congest a network and drive the switch CPUs to 100% utilization.  Implement safety mechanisms and topologies to minimize the impact when they occur and be prepared to act quickly to diagnose them when they happen.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.