Cisco Router Interface Wedged

Author
Terry Slattery
Principal Architect

We recently had an interface wedge on a customer router, with some interesting repercussions.  The network topology is shown below.

wedged-interface-network

Every day from about 8am until 6pm, a 250Mbps – 300Mbps traffic load starts between the Major Facility and the core network.  NetMRI’s interface utilization graphs show that this load has been occurring on a daily basis.  The normal path for the load was from the facility, through 3550-01, to 7301-01, and on to the core network.  In addition, the load was bi-directional (not shown), possibly hinting at a bit-torrent running.

One evening, just before 6pm, the G0/0 interface on 7301-01 wedged.  It stayed in the up/up state, but stopped passing traffic.  The routing protocol acted properly and re-routed to the path via 3550-01, 3550-02, 7301-02, and on to the core network.

Why did the interface wedge?  What is an interface wedge?  Searching for “interface wedge” shows that there are a number of bugs related to interfaces that cause them to stop forwarding traffic and yet remain in the up/up state.  Chris Rose, a senior NetCraftsmen consultant, has seen it in load testing of 7301 routers using a mix of traffic types and sizes.  Paul Borghese, another NetCraftsmen consultant, has seen it on MPLS routers.  Our testing of the 7301 router shows that while it has gigabit interfaces, the traffic forwarding capacity is highly dependent on VLAN tagging, MPLS, and QoS.  With all three features in use, the throughput is on the order of 150Mbps.  The 7301 routers and core network in this case are running MPLS, making it likely that there is a bug that caused the interface to wedge

Since the offered load was 250Mbps to 300Mbps, the backup path via the etherchannel between 3550-01 and 3550-02 was significantly overloaded.   The load on the two etherchannel links was not well balanced.  Fa0/0 was running at full capacity, more than 90Mbps, while Fa0/1 was loafing along at 2.5Mbps.  Looking at the configuration showed the default load balancing mechanism was in use.

The default load balancing on the 3550 etherchannel is dst-mac, which distributes the load based on destination MAC address.  In this case, there is a router on each end of the flow: the router at the Major Facility and the 7301-02 router.  So the MAC address on each packet going over the etherchannel to the core is that of the 7301-02 router while the MAC address on each packet going the other direction was to the Major Facility router.  No wonder the load was not balanced.  Almost 300Mbps was trying to make it through a single 100Mbps link.

Once the failover happened, the applications became very sluggish.  LDP, running between 7301-01 and 7301-02, dropped.  The users at the Major Facility were unhappy with network performance.  While a properly configured etherchannel would help, it would still not have enough capacity to handle the offered load.  This is where the link needed to have QoS configured to prioritize important traffic like voice, voice signaling, routing and switching protocols, and premium data.

The technical staff was reluctant to enable netflow on the 7301s because of the load that netflow would add to their CPU, possibly causing them to become unresponsive to the CLI (similar to enabling debug on a busy router).  However, netflow data would have allowed us to identify the source and destination of the major flows and properly prioritize the traffic.

The wedged interface was finally addressed by ‘shutdown’, quickly followed by ‘no shutdown’.  The traffic load switched back to the 7301-01 path and service to the Major Facility returned to normal.

What did we learn from all this?  Properly configure an etherchannel that may be oversubscribed.

  1. Configure ‘port-channel 1 load-balance src-mac’ on the 3550 switches.  According to the 3550 configuration guide, the command option src-mac incorporates source and destination IP address into its hash algorithm for selecting the etherchannel link to use.  Note: check the documentation for the model router or switch you’re configuring, the commands and operation are specific to the device model.
  2. Configure QoS on the port channel interfaces to prioritize important traffic over less important traffic.
  3. Check NetMRI’s performance data regularly, maybe once a month, to identify other sites that may have the same potential problem, based on the traffic volume at the site.

In the short term, we will be using NetMRI’s scripting capability to add the necessary commands to the switches that are used in sites with similar designs.

-Terry

_____________________________________________________________________________________________

Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html

infoblox-logo

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.