BGP Traffic Engineering
I’ve been seeing some network problems lately, at sites where the problem was designing the VPC and routing mix correctly. Generally, there’s plenty of room to make a mistake, the situation is a bit confusing to most people. So I’m going to try to explain how to separate out routing and Layer 2 (L2) forwarding with VPC’s, so the routing will work correctly. I’m hoping to help by explaining the problem situation you need to avoid as simply as I can, and showing some simple examples, with lots of diagrams. For a simple description of how basic VPC works, see my prior posting, How VPC Works.
Cisco has put out some pretty good slideware on the topic, but there are an awful lot (too many?) diagrams. Either that’s confusing folks, or people just aren’t aware that VPC port channels have some design limitations, you can’t just use them any which way as with normal port channels (or port channels to a VSS’d 6500 pair).
The short version of the problem: routing peering across VPC links is not supported. (Adjacency will be established but forwarding will not work as desired.) The “vpc peer-gateway” command does not fix this, and is intended for another purpose entirely (EMC and NetApp end systems that learn the router MAC address as the source MAC in frames, rather than using ARP and learning the default gateway MAC address).
Let’s start by repeating the basic VPC forwarding rule from the prior blog:
VPC peers are expected to forward a frame received on a member link out any other member link that needs to be used. Only if they cannot do so due to a link failure, is forwarding across the VPC peer link and then out a member link allowed, and even then, the cross-peer-link traffic can only go out the member link that is paired with the member link that is down.
The same rules apply to routed traffic. Since VPC does no spoofing of the two peers being one L3 device, packets can get black-holed.
Here’s the basic situation where we might be thinking of doing VPC and can get into trouble. Note I’ve been using dots for routed SVI’s, just as a graphical way to indicate where the routing hops are. (No connection with the black spot in the novel Treasure Island.)
This is where we have a L3-capable switch and we wish to do L2 LACP port-channeling across two Nexus chassis. If the bottom switch is L2-only, no problem. Well, we do have to think about singly-homed servers, orphan (singly-homed) devices, non-VPC VLANs, failure modes, etc., but that is much more straight-forward.
All is fine if you’re operating at Layer 2 only.
Let’s walk through what VPC does with L3 peering over a L2 VPC port-channel. Suppose a packet arrives at the bottom switch C (shown by the green box and arrow in the diagram above or below). The switch has two routing peers. Let’s say the routing logic decides to forward the packet to Nexus A on the top left. The same behavior could happen if it chooses to forward to B. The router C at the bottom has a (VPC) port channel. It has to decide which uplink to forward the packet over to get it to the MAC address of the Nexus A at the top left.
Approximately 50% of the time, based on L2 port channel hashing, the bottom L3 switch C will use the left link to get to Nexus A. That works fine. Nexus A can forward the frame and do what is needed, i.e. forward out another member link.
The other 50% or so of the time, port channel hashing will cause router C to L2 forward the frame up the link to the right, to Nexus B. Since the destination MAC address is not that of Nexus B, Nexus B will L2 forward the frame across the VPC peer link to get it to A. But then the problem arises because of the basic VPC forwarding rule. A is only allowed to forward the frame out a VPC member link if the paired link on Nexus B is down. Forwarding out a non-member link is fine.
So the problem is in-on-member-link, cross-peer-link, out-another-member-link: no go unless paired member link is down. Routing does not alter this behavior.
Yes, if there is only one pair of member links, you cannot have problems, until you add another member link. If you add a 2nd VLAN that is trunked on the same member links, inter-VLAN routing may be a problem. If you just do FHRP routing at the Nexus pair, no, the L2 spoofing handles MAC addresses just fine (using the FRHP MAC so no transit of the peer link is necessary). It’s when your inter-VLAN routing is via an SVI on one of the bottom switches routing to a peer SVI on the Nexus pair that you will probably have problems.
You can have similar problems even if only one of the two Nexus switches is operating at L3, or has a L3 SVI in a VLAN that crosses the VPC trunks to the switch at the bottom. We will see an example of this later.
Conclusion: it is up to us to avoid getting into this situation! That is, VPC is not a no-brainer, if you want to mix it with routing you must design for that.
You can also do this sort of thing with two switches at the bottom of the picture, e.g. pair of N5K to pair of N7K’s. Or even VSS 6500 pair to VPC Nexus pair. See also our Carole Reece’s blog about it, Configuring Back-to-Back vPCs on Cisco Nexus Switches, and the Cisco whitepaper with details. VPC is allowed and works, but we need to design it to operate at L2 only.
We are also OK if we use a FHRP with a VPC to get traffic from a VPC’d server to a pair of Nexii, and then route across non-VPC point-to-point links, e.g. into the campus core or WAN. VPC does very well at spoofing L2, and the virtual MACs used with the three FHRP’s allow direct forwarding out VPC member links by VPC peers. Routing to the core uses non-VPC non-member links, so no problem.
The problem in the L3 story above is that the frame is being forwarded at L2 to the real MAC not virtual MAC of A, and B is not allowed to do the routing on behalf of A.
The next diagram shows how this typically bites us. If we’re migrating from 6500’s (bottom) to Nexus (top) and we are inconsistent, we can get in trouble. If our packet hits an SVI, is routed to Nexus B but sent via Nexus A, then Nexus B will not be able to route the frame again out the member link marked with the red X, to get to a L3 SVI on the bottom right switch D.
This might happen from datacenter to user closet, if you have L2 to a collapsed core/distribution Nexus pair, with some SVI’s between old 6500 C and new Nexus switches A and B in the datacenter, and closet switches with SVI’s on the same switches as the datacenter SVI’s (switch D in the diagram). It might also happen if you have some VLANs with SVI’s on datacenter access switches like C, and other VLANs on other datacenter access switches like switch D (perhaps even with all SVI’s migrated to live only on the Nexus pair). It can even happen on one switch, where C and D are the same switch, and you’re routing between VLANs via an SVI on C. (Same picture, just a little more cluttered because the green arrow and red X are on the link back to C.)
Here’s the Cisco-recommended design approach, using my drawing and words. The black links are L2 VPC member links. The red links are additional point-to-point routed links.
The simple design solution is to only allow L2 VLANs with SVI’s at the Nexus level across the VPC member links. If you must have some SVI’s on the bottom switch(es) and some others on the Nexus switches, block those VLANs on the L2 trunks that are VPC members, and route them instead across separate L3 point-to-point links, shown in red in the above diagram. Of course, if you’re routing say VLAN 20, there would be no point to having a routed SVI for VLAN 20 on the bottom switch and on the Nexus switches as well.
The point to point routed interfaces do not belong to VLANs, so they cannot possibly accidentally be trunked over the member links, which are usuallly trunks.
When you have SVI’s rather than routed interfaces or dot1q subinterfaces, you have to be aware of which VLANs you do and do not allow on the VPC member links. If you have many VLANs that need routing, use dot1q subinterfaces on the routed point-to-point links to prevent “VPC routing accidents”. Or use SVI’s and trunking over the point-to-point non-VPC links, just be very careful to block those VLANs on the VPC trunk member links.
As you will have noticed in my recent blog, Simplicity and Layer 2, I like L3 closets. That generally means your L2 is mostly confined to the datacenter. No L2 problems out in the closets!
Our present discussion is highly relevant if you are migrating from L2 to L3 closets. Several hospitals we are working with have had spanning tree problems (or risk). They wish to reduce their L2 domains size and any risk by moving to L3 closets. One way to tackle this is to drop Nexus switches in at the core or distribution layer (they are sometimes combined layers), and start out running VPC to all the L2 closets. That “stabilizes the patient” to buy time and stability for the cure, L3 closets.
If you whittle away at sprawling VLANs spanning closets, buildings and campuses, you can generally manage to clean up one closet at a time. Iterate for the next year or two. Painful, but much more robust!
Consider a single closet switch that you’re working on. You can get yourself to a situation where the SVI’s are in the distribution layer Nexuses, say, and you have L2 VPC member trunks to the closet switch (now represented by the bottom switch in our above diagrams). When all the VLANs are single-closet-only VLANs, you can then un-VPC the uplinks to that one closet, turn them into point-to-point routed links, put the SVI’s on the closet switch instead, and be done. If you want a slower transition, add separate L3 routed point-to-point links like the above red lines, and control which VLANs are trunked across the VPC member links. All it takes is organization and being clear about where you’re doing L2 and where you’re doing L3 — which I’d say should be part of the design document / planning.
One more real world example shows how it is easy to not see the potential problem. Suppose you have a router, e.g. an MPLS WAN router, and for some reason you have to attach it to a legacy switch at the bottom of the picture, as shown in the following diagram:
Why would you do this? In one case we’ve seen, the vendor router had a FastEthernet port, and the Nexus switch had no 100 mbps capable ports. Another is copper versus fiber ports and locations of the devices in question.
Suppose the uplinks are VPC members, and because of the VPC routing problems, the site is trying to make this work with just a VPC on the right Nexus switch, switch B. In the case in question,; C and D were actually the same switch, but I’m presenting it this way since the diagram is more clear when I show two switches.
At the left we see a packet hitting a SVI in the leftover 6500 (which some sites would shift to being a L2-only access switch, and other sites would discard or recycle elsewhere in the network.).
The bottom left switch SVI can route to other SVI’s that are local. To get to the WAN router, the left switch needs to somehow route the packet via the top right Nexus, Nexus B somehow. It turns out there is exactly one VLAN with SVI’s on all three switches, which gives switch C a way to route to the rest of the network. Switch C therefore follows the dynamic EIGRP routing, by routing into the shared VLAN with next hop Nexus B.
In 50% of the flows, the packet goes via the left Nexus A, across the peer link, and thus B cannot forward it out the VPC member link to get to the router.
Exercise for the reader; Consider traffic going the other way, from the WAN back to the datacenter. See the following diagram:
Does it work? If not, what goes wrong? Can you explain it? [Hint: there’s a red X in the above diagram for a reason!]
(1) Attach the MPLS VPN WAN router to one or both Nexii directly. Note that dual-homing via the 6500 (bottom right) is a Single Point of Failure (SPoF), so connecting to only one N7K is no worse (or better).
(2) Put the SVI for the router’s VLAN on the bottom right switch, and convert the uplinks to L3 point-to-point. Or use dedicated point-to-point links for all routed traffic from bottom right to the two Nexii. Since point to point routed interfaces don’t belong to VLANs, they can’t accidentally be trunked over VPC member links.
(3) Have no SVI’s on the Nexii — do all routing on the bottom switches. That actually works — but doesn’t help in terms of getting the routing onto the much more powerful Nexus switches, which is where you probably want it.
Please don’t draw the conclusion that you can’t do routing with VPC. You can in certain ways. What you do not want is a router or L3 switch interacting with routing on VPC peers over a VPC port-channel link. You can route to VPC peers as long as you’re not using a VPC port-channel, e.g. just a plain point-to-point link or a L3 port-channel to a single Nexus. If there is an SVI at the bottom (that is, not on the Nexus pair) for a given VLAN, block it from the member links and thereby force it to route over the dedicated routed links. In that case, don’t allow the VLAN across the VPC peer link either: that link should only carry the VLANs that are allowed on the VPC member links, and no others, no routing, nothing else.
You can also route over a VPC port-channel, as long as your routing peers are reached at L2 across the VPC but are not the VPC peers your VPC connects to. That is, routing peering across a L2-only VPC Nexus pair in the middle is OK.
In the datacenter, stick to pure L2 when doing VPC, up to some sort of L3 boundary. When doing L3, use non-VPC L3 point-to-point links. If you have a pod running off a pair of L3-capable Nexus 55xx’s and you feel the need to VPC some L2-ness through your Nexus 7K core, fine, just use dedicated links for the L3 routing. And when doing so, don’t use SVI’s, use honest to goodness L3 ports, that is, “no switchport” type ports. That way you cannot goof and forget to disallow any relevant VLANs across VPC member links that are trunks.
Upcoming design consideration: don’t VPC multi-hop FCoE. It’s OK to VPC FCoE at the access layer, just don’t do it beyond there. Why not VPC multi-hop FCoE? Among other reasons, it makes it far too easy to merge fabrics accidentally. That’s a Bad Thing, definitely something you do not want to do! Also, you do have to be careful about FCoE with a 2 x 2 VPC — that’s covered in the Nexus course (now named “DCUFI”). Which I’m teaching about once a month for FireFly (www.fireflycom.net)
I think the engineers expected everyone doing L3 to put it on separate links. It’s not clear to me why they thought people would WANT to do that. Nor the confusion about SVI’s and where you were doing routing that a lot of people seem to have (i.e. understanding it too complex for real world). It might also have something to do with the datacenter switch positioning of the Nexus products.
Quote from that thread: “We don’t support running routing protocols over VPC enabled VLANs.”
BGP Traffic Engineering
Design: Is It One Site or Two?
What Business Leaders Should Know About Network Monitoring
Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.
Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.
John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services. Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.
He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.