Some Nexus Issues

Author
Peter Welcher
Architect, Operations Technical Advisor

This blog is a quick note about some Nexus issues I’ve encountered or heard about recently.

The saga starts with a Nexus 7010 with M1 and F1 cards having many EIGRP resets.

Problem #1

Nexus 7K connected to 6500 switch. The configuration on both Nexus and connected 6500 looked approximately like:

interface ethernet 1/1-2

switchport
switchport mode trunk
switchport trunk allow vlan 1,10-20
channel-group 30

and similarly on the port-channel interface. Two 1 Gbps links for the port-channel. Obviously, the 6500 interfaces were “gigabit” not “ethernet”. And the channel-group command on the 6500 also had either “mode auto” or “mode desirable” in it. [Added, 8/28/13] No vPC involved, single chassis to single chassis.

Do you see the problem? (I’d like to think it’s not really obvious…).

It’s something I would mention teaching Nexus class, but didn’t think of quickly in a troubleshooting setting. (Our Carole Reece gets the credit!)

Enough hints…

The Nexus is doing LACP, and without the keyword “active”, you’ve hard-coded un-negotiated port-channel on the Nexus end.

The 6500 is doing PAgP, defaulting to “auto” mode, which is passive. Since it is never asked to negotiate, the port-channel does not come up. You end up with two un-channeled Gig links. I’ve seen that before with links hard-coded on one end and differently coded on the other. Result: Spanning Tree loop! One switch thinks it has bundled / channeled ports, and the other doesn’t.

Reference: http://www.cisco.com/en/US/tech/tk389/tk213/technologies_tech_note09186a008009448d.shtml

Well, if you have a STP loop, EIGRP isn’t going to get through all the traffic … no wonder it kept resetting and showed RTO of 5000 in “show ip eigrp neighbor”. (I do think we saw discards and the STP loop at the time … but that’s a bit of a sidetrack from the main theme here. Finding a STP loop cause is rarely easy or fun.)

After seeing this, I feel like a better default should have been chosen on the Nexus. Because of this sort of behavior, best practice is to always negotiate port-channeling. So the Nexus should default to negotiate, and use the keyword “nonegotiate” (or something like  that) to specify the rare condition where you actually hard-code the port-channel to an “on” state. That would reduce human error and spotting this odd situation where charging ahead and configuring the two ends similarly leads to big problems due to not thinking to add the word “active” on the Nexus side of things.

Problem #2

Well, telling both ends to use LACP and negotiate, or hard-coding them both on, fixed most of the EIGRP neighbors. A peering between N7K and a second one, across just a 1 Gbps physical link, was exhibiting the same problems, although maybe not as frequently. All sorts of things got checked. This one link went to ports on an F1 card, the others that now worked were on an M1 card, for example.

It turned out that one end had a LRM transceiver and the other an LR. (Show interface ethernet x/y transceiver works.)

The odd thing was lots of traffic was passing on the link, and the error counters were all 0. The symptoms seemed to be limited to EIGRP and HSRP problems. The link was perhaps 2 km, and LRM is only supposed to work to 300 m, so the LRM optics must have been a lot better than min spec. When I’ve seen this before, the weaker optics end’s transmission would have weak signal at the other end, resulting in CRC and other errors. So I’m puzzled the error counters all showed zero. The site is still taking EIGRP bounces, but maybe a couple an hour rather than continuous. I’ve suggested cleaning the fiber terminations and checking proper insertion. The error counters are still zero, all of them.

By the way, noted along the way, F1 ports do not show the non-default “switchport” command (“all Nexus 7K ports default to L3 ports unless you change the global default”). That’s because the F1 is of course L2-only and so the ports default to “switchport”. I’d prefer a little consistency so I don’t have to think about or know which modules are which! I.e. show “switchport” on F1 module ports even though it’s the default. Yes, cosmetic bug, at best. Excellent programming consists of attention to details like this? Anyway, I’m mentioning this in case you hadn’t noticed this little quirk for F1 ports.

I’m conjecturing the error counters always being zero is a bug in the N7K 5.2(1) code. Or else there’s something else going on that I haven’t spotted yet. I really don’t like the Bug Tools, they usually don’t work very well for me, and the cryptic descriptions just frustrate. But just for you, I’ll try to check this one …  Well, “counter”, “error”, “show interface”, and “zero” are returning no relevant looking matches. If you’ve run into this, please add a comment to this blog with info about what you saw, etc. !

Problem #3

Carole Reece found an interesting bug in Nexus 6 code, probably not related (code version, OSPF not EIGRP) but interesting. As I understood it, somehow inbound OSPF traffic on an F2E card loses its CoS, hence is subjected to default class CoPP  internally in the Nexus 7K. This happens in F2E to M1 (not M2) linecard routing proxying. Which means congestion might clobber your OSPF hellos and / or LSAs. Symptom would be random neighbor loss for OSPF. The related thought is, might this be happening to EIGRP as well? The cited best practice is when doing F2E to M1 proxying, do not form routing adjacencies over F2E card ports.

Something similar occurred to me concerning Problem #1 above: might there be some advice somewhere to not form routing adjacencies on F1 card ports when proxying with M1 cards. I haven’t seen anything in print saying that, however (and did a moderate Google search to double-check).

Link: http://www.cisco.com/en/US/docs/switches/datacenter/sw/6_x/nx-os/release/notes/62_nx-os_release_note.html#wp595975. [Revised 8/28/13]

Life Log

We’ve been really busy lately, which is mostly a Good Thing for consultants. (It definitely beats the alternative!) Not so good for writing blogs. Hence my recent silence on the blog front. My list of ideas is steadily increasing, so there may be a torrent of blogs one day when things slow done a little. This one got written since I could pretty much just dash it off. Although like most of my writing, there are more words here than I anticipated.

Disclosure

The vendors for Network Field Day 5 (#NFD5) paid for my travel expenses and small gift items, so I wish to disclose that in my blogs now. The vendors in question are: Cisco, Brocade, Juniper, Plexxi, Ruckus, and SolarWinds. I’d like to think that my blogs aren’t influenced by that. Yes, the time spent in presentations and discussion gets me and the other attendees looking at and thinking about the various vendors’ products, marketing spin, and their points of view. I try to remain as objective as possible in my blogs. I’ll concede that cool technology gets my attention.

Stay tuned!

Twitter: @pjwelcher

 

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.