This blog is a quick note about some Nexus issues I’ve encountered or heard about recently.
The saga starts with a Nexus 7010 with M1 and F1 cards having many EIGRP resets.
Nexus 7K connected to 6500 switch. The configuration on both Nexus and connected 6500 looked approximately like:
switchport mode trunk
switchport trunk allow vlan 1,10-20
and similarly on the port-channel interface. Two 1 Gbps links for the port-channel. Obviously, the 6500 interfaces were “gigabit” not “ethernet”. And the channel-group command on the 6500 also had either “mode auto” or “mode desirable” in it. [Added, 8/28/13] No vPC involved, single chassis to single chassis.
Do you see the problem? (I’d like to think it’s not really obvious…).
It’s something I would mention teaching Nexus class, but didn’t think of quickly in a troubleshooting setting. (Our Carole Reece gets the credit!)
The Nexus is doing LACP, and without the keyword “active”, you’ve hard-coded un-negotiated port-channel on the Nexus end.
The 6500 is doing PAgP, defaulting to “auto” mode, which is passive. Since it is never asked to negotiate, the port-channel does not come up. You end up with two un-channeled Gig links. I’ve seen that before with links hard-coded on one end and differently coded on the other. Result: Spanning Tree loop! One switch thinks it has bundled / channeled ports, and the other doesn’t.
Well, if you have a STP loop, EIGRP isn’t going to get through all the traffic … no wonder it kept resetting and showed RTO of 5000 in “show ip eigrp neighbor”. (I do think we saw discards and the STP loop at the time … but that’s a bit of a sidetrack from the main theme here. Finding a STP loop cause is rarely easy or fun.)
After seeing this, I feel like a better default should have been chosen on the Nexus. Because of this sort of behavior, best practice is to always negotiate port-channeling. So the Nexus should default to negotiate, and use the keyword “nonegotiate” (or something like that) to specify the rare condition where you actually hard-code the port-channel to an “on” state. That would reduce human error and spotting this odd situation where charging ahead and configuring the two ends similarly leads to big problems due to not thinking to add the word “active” on the Nexus side of things.
Well, telling both ends to use LACP and negotiate, or hard-coding them both on, fixed most of the EIGRP neighbors. A peering between N7K and a second one, across just a 1 Gbps physical link, was exhibiting the same problems, although maybe not as frequently. All sorts of things got checked. This one link went to ports on an F1 card, the others that now worked were on an M1 card, for example.
It turned out that one end had a LRM transceiver and the other an LR. (“Show interface ethernet x/y transceiver“ works.)
The odd thing was lots of traffic was passing on the link, and the error counters were all 0. The symptoms seemed to be limited to EIGRP and HSRP problems. The link was perhaps 2 km, and LRM is only supposed to work to 300 m, so the LRM optics must have been a lot better than min spec. When I’ve seen this before, the weaker optics end’s transmission would have weak signal at the other end, resulting in CRC and other errors. So I’m puzzled the error counters all showed zero. The site is still taking EIGRP bounces, but maybe a couple an hour rather than continuous. I’ve suggested cleaning the fiber terminations and checking proper insertion. The error counters are still zero, all of them.
By the way, noted along the way, F1 ports do not show the non-default “switchport” command (“all Nexus 7K ports default to L3 ports unless you change the global default”). That’s because the F1 is of course L2-only and so the ports default to “switchport”. I’d prefer a little consistency so I don’t have to think about or know which modules are which! I.e. show “switchport” on F1 module ports even though it’s the default. Yes, cosmetic bug, at best. Excellent programming consists of attention to details like this? Anyway, I’m mentioning this in case you hadn’t noticed this little quirk for F1 ports.
I’m conjecturing the error counters always being zero is a bug in the N7K 5.2(1) code. Or else there’s something else going on that I haven’t spotted yet. I really don’t like the Bug Tools, they usually don’t work very well for me, and the cryptic descriptions just frustrate. But just for you, I’ll try to check this one … Well, “counter”, “error”, “show interface”, and “zero” are returning no relevant looking matches. If you’ve run into this, please add a comment to this blog with info about what you saw, etc. !
Carole Reece found an interesting bug in Nexus 6 code, probably not related (code version, OSPF not EIGRP) but interesting. As I understood it, somehow inbound OSPF traffic on an F2E card loses its CoS, hence is subjected to default class CoPP internally in the Nexus 7K. This happens in F2E to M1 (not M2) linecard routing proxying. Which means congestion might clobber your OSPF hellos and / or LSAs. Symptom would be random neighbor loss for OSPF. The related thought is, might this be happening to EIGRP as well? The cited best practice is when doing F2E to M1 proxying, do not form routing adjacencies over F2E card ports.
Something similar occurred to me concerning Problem #1 above: might there be some advice somewhere to not form routing adjacencies on F1 card ports when proxying with M1 cards. I haven’t seen anything in print saying that, however (and did a moderate Google search to double-check).
We’ve been really busy lately, which is mostly a Good Thing for consultants. (It definitely beats the alternative!) Not so good for writing blogs. Hence my recent silence on the blog front. My list of ideas is steadily increasing, so there may be a torrent of blogs one day when things slow done a little. This one got written since I could pretty much just dash it off. Although like most of my writing, there are more words here than I anticipated.
The vendors for Network Field Day 5 (#NFD5) paid for my travel expenses and small gift items, so I wish to disclose that in my blogs now. The vendors in question are: Cisco, Brocade, Juniper, Plexxi, Ruckus, and SolarWinds. I’d like to think that my blogs aren’t influenced by that. Yes, the time spent in presentations and discussion gets me and the other attendees looking at and thinking about the various vendors’ products, marketing spin, and their points of view. I try to remain as objective as possible in my blogs. I’ll concede that cool technology gets my attention.