Introduction
Should you be concerned about a Layer 2 loop in your campus network? My answer is absolutely. In this article, I explain why you should take steps in protecting your campus network. This article also provides practical steps in reducing the effects of a Layer 2 loop on a Cisco Catalyst 6500. Before delving into details, let’s first define what a Layer 2 loop is. A Layer 2 loop occurs in a campus network when more than one Layer 2 forwarding path exists between two given switches. In this scenario, a switch that receives a broadcast frame sends it to all its trunk ports and access ports (same VLAN). In the presence of a loop, when campus switches forward broadcast frames to all their ports, this creates an amplification phenomenon for broadcast frames trapped indefinitely within the loop. This phenomenon is also known as a broadcast storm. It leads to an exhaustion of bandwidth, and CPU overutilization due to the presence of large volumes of broadcast frames. A broadcast storm brings a network to an unusable state, and in certain cases network administrators may lose the capability to access devices by console.
The Spanning Tree Protocol (STP) was designed to ensure loop-free Layer 2 topologies. Despite the use of STP, some situations can create Layer 2 loops such as wiring mistakes, misconfigured hosts (bridged interfaces), switch configuration mistakes, and loss of BPDU keepalives.
You may have a good campus network design, but human errors are still possible. It is critical to look beyond what can be prevented, and ask yourself the following question. If a Layer 2 loop were to occur, would my campus network be able to sustain it? If the answer to this question is no, or if you are not sure, I encourage you to explore a suggested solution presented in this article to mitigate the impact of a Layer 2 loop.
I got involved in exploring solutions to protect the Catalyst 6500 because one of our customers had experienced Layer 2 issues, and they wanted to leverage Cisco existing tools to alleviate the effects of Layer 2 loops on their Cisco Catalyst 6500 distribution switches. Their campus network consists of a three-layer model with Layer 2 connections between access and distribution switches, and Layer 3 connections between distribution and core switches.
This article presents the steps I took to develop a solution to protect the Catalyst 6500 in the presence of a Layer 2 loop. These steps include the following:
- How to simulate of a Layer 2 loop
- How to monitor the control plane traffic
- Control plane policing
- Storm control
- Hardware rate limiting
- Design recommendations
How to Simulate a Layer 2 Loop
Four switches were used in a lab environment to simulate a Layer 2 loop, as illustrated in the figure below. VLAN 10 was created as a user VLAN to simulate communication between two users’ PCs.
The configuration shown below was implemented on all switches to create a Layer 2 loop, specifically a spanning tree loop on VLAN 10.
Distribution-Sw1(config)no spanning-tree VLAN 10.
Within one minute after issuing the command above, the CPU utilization had risen to 99% for all four switches. To display the current CPU utilization, the command “show processes cpu” was used, as shown below.
Distribution-Sw1# show processes cpu sorted | exclude 0.00%__0.00%__0.00% CPU utilization for five seconds: 99%/92%; one minute: 99%; five minutes: 99% PID Runtime(ms) Invoked uSecs 5Sec 1Min 5Min TTY Process ------------------------ Output omitted ----------------------------------------------------
With the CPU reaching 99%, routing adjacencies were flapping, and communication within the user VLAN 10 between PC1 and PC2 was no longer possible. A set of measurements were taken and the results are shown on the figure below.
To develop a good understanding of the type of traffic responsible for high CPU utilization, it was essential to monitor the traffic processed by the control plane.
How to Monitor the Control Plane Traffic
The figure below illustrates the monitoring context for analyzing traffic that is processed by the Catalyst 6500 control plane. What is needed is a Switched Port Analyzer (SPAN), a host running a network protocol analyzer, and an Ethernet cable.
Control Plane Traffic Monitoring using SPAN
The configuration shown below defines port Gigabit 5/24 as a spanned port for CPU traffic. The traffic processed by the Route Processor (RP) and the Switch Processor (SP) is duplicated on Gigabit 5/24.
Distribution-Sw1(config)#monitor session 2 type local Distribution-Sw1(config-mon-local)#source cpu rp Distribution-Sw1(config-mon-local)#source cpu sp Distribution-Sw1(config-mon-local)#destination interface gigabit 5/24
The configuration shown below checks the status of a created SPAN.
Distribution-Sw1# show monitor session 2 Session 2 --------- Type : Local Session Status : Admin Disabled Egress SPAN Replication State: Operational mode : - Configured mode : -
It is important to note that after the CPU SPAN is created, it defaults to “Admin Disabled”. To make it operational a “no shut” command is needed on the SPAN.
Distribution-Sw1(config)#monitor session 2 Distribution-Sw1(config-mon-local)#no shut
After the “no shut” command is issued, the SPAN is put in Admin Enabled mode as shown in the output below.
Distribution-Sw1# show monitor session 2 Session 2 --------- Type : Local Session Status : Admin Enabled Source Ports : Both : rp,sp Destination Ports : Gi5/24 Egress SPAN Replication State: Operational mode : Centralized Configured mode : Centralized (default)
Now that the Layer 2 loop had been simulated, and that the traffic processed by the control plane was being monitored, the next step was to select what tools, or combination of tools could be used to mitigate the impact of the Layer 2 loop. The possible tools considered for the mitigation were:
- Control plane policing
- Storm control
- Hardware rate limiting
Control Plane Policing
Control Plane Policing is a feature in Cisco routers and switches that enable administrators to configure QoS policies to protect the control plane against reconnaissance, denial-of-service (DoS) attacks, and other scenarios that can lead to exhaustion of CPU resources.
To limit the impact on the switches CPU, control plane policing was applied to limit traffic that could potentially affect CPU utilization and to protect other traffic such as routing against CPU resource starvation. Control plane policing uses the same class-map and policy-map commands that you may be familiar with when you configure Quality of Service. The process of creating and applying control plane policing consists of the following three steps:
- Use class-map to classify traffic processed by the control plane
- Use policy-map to apply policing on classified traffic
- Apply policy-map to the control plane
The traffic targeted for classification included: EIGRP, HSRP, SSH, SNMP, TACACS, DHCP, IGMP, and PIM. I created the following class-map:
class-map match-all class-eigrp match access-group name EIGRP class-map match-all class-mgmt match access-group name MGMT class-map match-all class-hsrp match access-group name HSRP class-map match-all class-pim match access-group name PIM class-map match-all class-igmp match access-group name IGMP class-map match-all class-dhcp match access-group name DHCP ip access-list extended EIGRP permit eigrp any host 224.0.0.10 ip access-list extended HSRP permit udp any host 224.0.0.2 eq 1985 permit udp any host 224.0.0.102 eq 1985 ip access-list extended PIM permit pim any 224.0.0.0 0.0.0.255 ip access-list extended IGMP permit igmp any 224.0.0.0 31.255.255.255 ip access-list extended MGMT permit tcp any any tacacs permit tcp any any eq 22 permit udp any any eq snmp permit icmp any any ip access-list extended DHCP permit udp any eq bootpc any eq bootps permit udp any eq bootps any eq bootpc permit udp any eq bootps any eq bootps
I created a policy-map named copp-policy to apply policy restriction to traffic processed by the control place. As an example, for dhcp traffic I created the following policy.
class class-dhcp police 32000 conform-action transmit exceed-action drop
In the policy statement above, the switch processes DHCP traffic up to a threshold of 32000 bits/sec, but any excess above this threshold is dropped by switch.
The values used for policing were based on measurement of actual traffic in the operational network(using the SPAN port). Below is a complete of list of policing statements applied to classified traffic.
policy-map copp-policy class class-eigrp police 32000 conform-action transmit exceed-action transmit ! protection of EIGRP traffic class class-hsrp police 32000 conform-action transmit exceed-action transmit ! protection of HSRP traffic class class-mgmt police 512000 conform-action transmit exceed-action drop class class-pim police 32000 conform-action transmit exceed-action drop class class-igmp police 100000 conform-action transmit exceed-action drop class class-dhcp police 32000 conform-action transmit exceed-action drop class class-default police 2000000 conform-action transmit exceed-action drop
After the creation of the policy map, I applied it to the control plane as shown below
control-plane service-policy input copp-policy
After applying the policy map to distribution switch control planes, the CPU utilization was reduced from 99% to an average of 92% as shown in the figure below.
With control policing applied, communication was now possible within user VLAN 10 between PC1 and PC2. Despite the CPU utilization reduction, intermittent packet losses were observed.
To alleviate further the impact of the Layer 2 loop, the next step was to use an additional tool. Given the nature of Layer 2 loops, large volumes of broadcast and multicast traffic get amplified, as explained earlier, and storm control was a logical selection for the tool to use.
Storm Control
Traffic storm control is a feature in Cisco switches that can be used to monitor broadcast, multicast, and unicast traffic levels entering a given interface over a 1-second interval. Traffic gets dropped during the monitoring interval when configured thresholds are exceeded.
After observing the amount of multicast and broadcast traffic in the operational network (using the SPAN port) it was estimated that 5% of the total bandwidth was sufficient to accommodate all broadcast and multicast traffic. The following configuration was applied to all switch trunk ports to set the storm control threshold to 5%.
Distribution-Sw1(config)#int range f0/21 - 24 Distribution-Sw1(config-if)#storm-control multicast level 5 Distribution-Sw1(config-if)#storm-control broadcast level 5
Combining control policing with storm control resulted in a significant improvement of the CPU utilization as shown in the figure below.
Communication within the user VLAN 10 between PC1 and PC2 became normal with no packet losses. Although the CPU was less than 40% on average, CPU spikes for up to 95% were observed for short duration of time. An observation of the traffic processed by distribution switch control planes revealed large amount of BPDU, VTP, STP, and ARP traffic. To reduce the volume of this traffic, the hardware rate-limiting tool available on the Catalyst 6500 was selected as the next improvement step.
Hardware Rate Limiting
The Cisco Catalyst 6500 switches with Supervisor 720 or Supervisor 32 engine provide hardware rate limiters that can be used to limit specific Layer 2 and Layer 3 traffic that are processed by the control plane. The advantage of hardware rate limiters is that their use does not impact CPU utilization.
The following configuration was implemented on distribution switches. The selected parameters were based on measurement of actual traffic in the operational network (using the SPAN port).
Distribution-Sw1(config)#mls rate-limit layer2 pdu 620 10 Distribution-Sw1(config)#mls qos protocol ARP police 64000 2000
The command “mls rate-limit layer2 pdu 620 10” rate limits Layer 2 PDU protocol packets (including BPDUs, DTP, PAgP, CDP, STP, and VTP packets) to a maximum of 620 packets per second with a burst of 10 packets per seconds.
The command “mls qos protocol ARP police 64000 2000” rate limits ARP packets to a maximum of 64000 bps with a burst of 2000 bps
By combining control policing, storm control, and hardware rate-limiting, the CPU utilization was reduced to an average of 5% with a maximum peak of 28% as shown in the figure below.
With such a reduction of CPU utilization, no perceptible impact was observed on the performance of switches and the user VLAN 10 despite the presence of the Layer 2 loop.
Results from All Scenarios
The figure below depicts the results from all scenarios. This diagram provides a representation that enables a visual comparison of the results achieved for these scenarios.
Design Recommendations
The following are some basic recommendations that can help reduce the risks of occurrence of Layer 2 loops:
- Limit VLANs to a single wiring closet, whenever possible.
- Use UniDirectional Link Detection (UDLD) aggressive mode to prevent occurrence of spanning tree loops as a result of unidirectional links.
- As applicable, use BGPU guard, loop guard, and root guard switch features to protect against undesirable changes in the spanning tree topology.
- Disable the use of Dynamic Trunking Protocol (DTP) by using the “no negotiate” command on switch ports to prevent automatic negotiation of trunks and access ports.
- Set unused ports to an undefined VLAN, and set the administrative mode to shutdown to prevent unauthorized users to connect devices to unused ports.
Conclusion
The results that I have presented show that by combining control plane policing, storm control, and hardware rate limiters, significant and consistent reduction of CPU utilization can be achieved in the presence of Layer 2 loops. Configuration parameters used to define control plane policing, storm control, and hardware rate limiters were based on a 72-hour measurement of traffic processed by switch control planes in the operational network using the SPAN port. These results are based on tests performed in lab settings, but the conclusions reached are representative of what can be achieved in operational environments.
Even in a well-designed campus network, Layer 2 loops may occur due to wiring mistakes, misconfigured hosts (bridged interfaces), switch configuration mistakes, and loss of BPDU keepalives. By implementing mitigating tools presented in this article, you can prevent the Catalyst 6500 from being overwhelmed in the presence of a Layer 2 loop.
You should follow Cisco design guidelines in campus network design to minimize risks of spanning tree loop occurrence. One important goal in the campus network design is to minimize, as much as possible, the span of broadcast domains by reducing VLANs to a single wiring closet, and using high-speed Layer 3 switching for the core layer instead Layer 2 switching. By reducing the scope of broadcast domains within a campus, a shorter diameter can be obtained for spanning-tree, which inherently reduces the risk and scope of Layer 2 loops.
References
1. Protecting the Cisco Catalyst 6500 Series Switches Against Denial-Of-Service Attacks
2. Configuring Control Plane Policing (CoPP)
http://www.cisco.com/en/US/prod/collateral/switches/ps5718/ps708/white_paper_c11_553261.html
3. Protecting Cisco Catalyst 6500 Series Switches Using Control Plane Policing, Hardware Rate Limiting, and Access-Control Lists
http://www.cisco.com/en/US/prod/collateral/switches/ps5718/ps708/white_paper_c11_553261.html
4. Configuring Traffic-Storm Control