We recently had an interface wedge on a customer router, with some interesting repercussions. The network topology is shown below.
Every day from about 8am until 6pm, a 250Mbps – 300Mbps traffic load starts between the Major Facility and the core network. NetMRI’s interface utilization graphs show that this load has been occurring on a daily basis. The normal path for the load was from the facility, through 3550-01, to 7301-01, and on to the core network. In addition, the load was bi-directional (not shown), possibly hinting at a bit-torrent running.
One evening, just before 6pm, the G0/0 interface on 7301-01 wedged. It stayed in the up/up state, but stopped passing traffic. The routing protocol acted properly and re-routed to the path via 3550-01, 3550-02, 7301-02, and on to the core network.
Why did the interface wedge? What is an interface wedge? Searching for “interface wedge” shows that there are a number of bugs related to interfaces that cause them to stop forwarding traffic and yet remain in the up/up state. Chris Rose, a senior NetCraftsmen consultant, has seen it in load testing of 7301 routers using a mix of traffic types and sizes. Paul Borghese, another NetCraftsmen consultant, has seen it on MPLS routers. Our testing of the 7301 router shows that while it has gigabit interfaces, the traffic forwarding capacity is highly dependent on VLAN tagging, MPLS, and QoS. With all three features in use, the throughput is on the order of 150Mbps. The 7301 routers and core network in this case are running MPLS, making it likely that there is a bug that caused the interface to wedge
Since the offered load was 250Mbps to 300Mbps, the backup path via the etherchannel between 3550-01 and 3550-02 was significantly overloaded. The load on the two etherchannel links was not well balanced. Fa0/0 was running at full capacity, more than 90Mbps, while Fa0/1 was loafing along at 2.5Mbps. Looking at the configuration showed the default load balancing mechanism was in use.
The default load balancing on the 3550 etherchannel is dst-mac, which distributes the load based on destination MAC address. In this case, there is a router on each end of the flow: the router at the Major Facility and the 7301-02 router. So the MAC address on each packet going over the etherchannel to the core is that of the 7301-02 router while the MAC address on each packet going the other direction was to the Major Facility router. No wonder the load was not balanced. Almost 300Mbps was trying to make it through a single 100Mbps link.
Once the failover happened, the applications became very sluggish. LDP, running between 7301-01 and 7301-02, dropped. The users at the Major Facility were unhappy with network performance. While a properly configured etherchannel would help, it would still not have enough capacity to handle the offered load. This is where the link needed to have QoS configured to prioritize important traffic like voice, voice signaling, routing and switching protocols, and premium data.
The technical staff was reluctant to enable netflow on the 7301s because of the load that netflow would add to their CPU, possibly causing them to become unresponsive to the CLI (similar to enabling debug on a busy router). However, netflow data would have allowed us to identify the source and destination of the major flows and properly prioritize the traffic.
The wedged interface was finally addressed by ‘shutdown’, quickly followed by ‘no shutdown’. The traffic load switched back to the 7301-01 path and service to the Major Facility returned to normal.
What did we learn from all this? Properly configure an etherchannel that may be oversubscribed.
- Configure ‘port-channel 1 load-balance src-mac’ on the 3550 switches. According to the 3550 configuration guide, the command option src-mac incorporates source and destination IP address into its hash algorithm for selecting the etherchannel link to use. Note: check the documentation for the model router or switch you’re configuring, the commands and operation are specific to the device model.
- Configure QoS on the port channel interfaces to prioritize important traffic over less important traffic.
- Check NetMRI’s performance data regularly, maybe once a month, to identify other sites that may have the same potential problem, based on the traffic volume at the site.
In the short term, we will be using NetMRI’s scripting capability to add the necessary commands to the switches that are used in sites with similar designs.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html