Knowing and following standard network design principles is a Good Thing. Below, we’ll go briefly into the rationale for that statement. One reason is that clarity is needed to be able to properly secure a design, and to validate that security.
There are lots of sources for good design principles and patterns. For instance, Cisco Validated Design (CVD) and Design Guide documents, and the Cisco Press ARCH book. If you stick with the advice in those documents, you’re less likely to run into bugs and other problems, and more likely to be able to get good support from Cisco TAC.
Yes, some of the CVD’s are light or I disagree with parts of them, but overall, they’re very helpful. And let’s not get bogged down on that here.
This blog goes into a couple of design patterns I’ve noticed over the years, ones that are not in any books or articles I’ve seen. We’ll then extrapolate them to the Security domains with some discussion of how Cisco ACI can help or hinder along the way. We’ll go into what looks advisable to me and why. And if you disagree, I’ll be glad to see your comments, or hear from you (e.g. Twitter). Please do provide discussion of how and why you disagree.
Why Design Principles and Patterns?
Following key design principles and using familiar design patterns produces more reliable networks.
Common patterns are easily recognized.
Known expectations, behavior.
Easily described (e.g. “rectangle vs bow-tie connections” for say, core pair of switches to firewall pair).
Servers Doing Routing
Principle #1: Server-specific host tables and routes are bad.
Let the network do the networking.
Having each server doing its own thing leads to inconsistency, is hard to maintain correctly, and a major nuisance to troubleshoot — highly non-scalable. Re host tables, that’s why we have DNS. Concerning host-configured routes, that’s what the network is for.
Principle #2: Avoid servers with multiple data interfaces. Dedicated FCoE storage interface(s), backup interface, and management interface are OK, as long as they comply with Principle #1 above.
Leveraging directly connected subnets to avoid host-based routes is tolerable, as long as an outage cannot cause the server to think its default gateway is out such an interface.
VMware kernel, vMotion, etc. are same-subnet, so host-based routing shouldn’t be in play there.
What I’m specifically recommending here is avoiding things like “front door” or “untrusted” or “back end network” interfaces on servers.
I have seen abuse of this, e.g. sites where the backup network is also carrying vMotion or Production traffic, which it is not intended to do. That can sometimes be a different problem, when the backup or management networks have way too many (e.g. 1000) hosts on them. For backup, having the backup front end with many connected interfaces, or several front ends, is one possibility. Using different subnets off a “management network router” and far less L2 switching solves that tendency.
IP-based storage is a bit of a problem here, where either L2 adjacency or at least one static route might be needed, if you insist on traffic engineering the data traffic to be on a different interface than the IP storage traffic. More bandwidth and / or port channeling (bonding or teaming) as one possible answer?
That’s the what. Now the why: scalability / ease of troubleshooting / fewest touch points, and predictability. Let the network do its job.
Potential Problem: Servers with differentiated interfaces for security zoning reasons, e.g. an “outside” and an “inside” interface. Or spanning, say, Dev and Prod.
I stick with Principle #2, don’t do that. It doesn’t add security, it just moves the problem, making networking and security more complex. Doing this means you now have two IP addresses to think about in ACL’s, not just one. And likely different paths to the server, so the server may be back in the routing business.
A server with interfaces in both Dev and Prod environments is an accident waiting to happen.
Add to that: doing this usually means you are providing a way to get to the server that bypasses the firewall. Unless you have a strong reason for doing so, that’s a bad idea.
My justification for this: simple, clear, auditable security. You can tell at a glance how Production traffic gets to and from a server with one data interface, and no routing to other subnets on any other interfaces.
Principle #3: Security enclaves or zones should have precisely one entrance / exit point, namely the associated firewall.
Basically ditto. Any device that does forwarding (routing) between an “inside” and an “outside” interface adds complexity. Some such devices are necessary. Limiting how many routing objects there are reduces complexity. Servers should be computing resources, not routing resources.
Note that with NSX or other network micro-segmentation (ACI, if done properly), there is logically no way to escape the L4 security rules, unless you configure a bypass. If there is a physical firewall (usually for L4-7 / ALG and inspection functions), the same should likely be true in a good design. Security people are not going to want complexity where some sort of “back door” or “firewall bypass” might occur, especially if covert or just hidden by complexity. Perhaps the “Keep It Simple and Secure” principle? Umm, let’s not make “KISaS” a well-known acronym!
Overall Recommendation: Hunt down and fix all instances of servers with host tables or local routing. This is a “fix it now or spend your time mid-crisis troubleshooting it later” item. And get to know your app team better, to avoid app / server designs or implementations violating the above principles.
Potential Problem: Using switch VLANs as “cabling”, e.g. firewall on a stick, and L2-only VLANs.
- Easy to accidently mess it up
- Unless clearly documented, the L2 “cabling” can become confusing.
I’m torn on this one. I actually think “firewall on a stick” can be a Good Thing, since you can then use VLANs as sort of patch cables, virtualizing connectivity. That is much more agile than having to visit the datacenter or schedule “hands” just to patch cables. On the other hand, your security folks may prefer physical connections precisely because they can’t change as quickly, and changes might be more obvious.
To me, the best practice there is to put a very good description on all the L2-only switch interfaces in question. It should say something like:
interface vlan 123 description *** L2 transit from FW outside to Internet Core Switches inside. DO NOT ADD AN IP ADDRESS ***
Documenting this sort of thing also helps. The convention that “outside” refers to the (indirectly) Internet-facing interface on a forwarding device may be helpful.
Principle #4: When using VLANs as “patch cabling”, document it well, both in a design overview document and/or diagram, and via descriptions such as the above.
A somewhat related principle: don’t use VLANs where you don’t have to, use dot1q routed interfaces instead. OSPF with multi-point interfaces converges slowly, since you usually don’t get a link down event. In general, when links are routed point-to-point, align your cabling topology with the routing.
Principle #5: Don’t use SVI’s when a dot1q sub-interface fits your design and likely future needs.
My reasoning for this: it is fairly easy to do ECMP and routed failover with routed dot1q subinterfaces. With SVI’s, you’re looking at STP for failover, and no ECMP. In a recent design review, the setting was a partly L2 / partly L3 datacenter interconnect (DCI). STP over a moderately long-distance WAN circuit does not strike me as a good idea.
Potential Problem: Multiple security enclaves with an “untrusted common” subnet between them. Or multiple “security zone” interfaces on one firewall.
The good thing about this pattern is that there is one and only one clear way in and out of each enclave or security zone.
The problem with this pattern is that most sites that do this end up with inbound and outbound rules on the firewall(s) or interfaces front-ending the enclaves. Every time you make an ACL change, you have to do it in 2 or 4 places. Easy to make a mistake, hard to troubleshoot. So, if you can, doing ACL’s in one direction might only lighten the maintenance burden.
Is there a way to do this better?
In the SAN world, single-initiator zoning is recommended. For each possible source, list what it is allowed to talk to. That cuts down on duplication and confusing overlap between security rules.
Thought: following some such organizing principle in enclave or ACI rules might be useful?
In the server world, I personally think single-destination might work better. I like inbound ACL’s near servers, and outbound perhaps limited to “permit outbound to my private network only” or something equally broad and low maintenance. The TCP keyword “established” can also help.
This is one place where micro-segmentation can help, with two-way policies allowing traffic and replies.
Tentative Principle #5: If possible, organize your security zones carefully, to minimize cross-zone traffic and resulting ACL maintenance. Or organize things so that access control rules are zone-based.
I’ve seen a lot of sites with what looked like ad hoc server / zone placement. I wasn’t involved in the planning, so there might have been good reasons for that. Or lack of planning and entropy happened. For some reason, I’m thinking of the 5 P’s (or whatever number) principle: Proper, Planning, Prevents, Poor, Performance.
ACI and Transparency
I’ve had the dubious pleasure of reverse engineering a rather undocumented third party’s ACI implementation. It was probably a “migrate fast now, clean it up later” project (or perhaps never clean it up?).
I’ve come to think of ACI as effectively one big switch, router, and L4 firewall. I’m fine with that. And the dubious pleasure aspect may just be lack of practice at and comfort with reverse engineering ACI.
My near-term reaction is the old saying “just because you can doesn’t mean you should”.
The challenge I encountered was that the usual “breadcrumbs” were missing. Routed networks funnel traffic through control points: routers, firewalls, load balancers, etc., with VLANs or physical patch cables constraining the connectivity.
In ACI, you can pretty much connect anything to anything. But if you let that happen, then troubleshooting or security review means digging through route leaks, L3OUTs, and contracts pretty much globally, to figure out what is allowed to talk to what. In two layers: routing layer, and then contract layer.
I’ve mulled over what I saw a good bit. Without trying to explain what I’ve seen (and can’t un-see) …
My initial reaction was probably biased. I personally prefer interface-based ACL’s, rather than CheckPoint style overall policy. Ok, that’s what I grew up with, comfort zone. But also, interface-based leads to divide-and-conquer; you don’t have to wrap your brain around everything at once. Yes, you still need some overall vision (design principle) for what gets enforced where. Hopefully documented.
That leads tentatively to the following:
Principle #6: In ACI, leverage tenants and VRFs as you would enclaves or security zones. Make sure you use something like Common or an external firewall as transit between tenants, or a selected VRF or firewall as transit between VRFs in one tenant, to provide a point of control. The point being that many to one is better than many to many.
Doing so sub-divides the connectivity policy, making it easier for someone to wrap their brain around it.
In short, creating a pattern and having a modular approach is good!
My other tentative conclusion is:
Principle #7: If you have contracts governing routing a bunch of things to each other, that had best be the connectivity between a bunch of tenants or VRFs, possibly to the world outside ACI as well.
Pick a style and stick to it.
If you’re going to use an ACI Bridge Domain like a “VLAN patch cable”, i.e. to privately connect two devices together, fine, but do that consistently. E.g. external firewall to load balancer, or load balancer to web front end farm. Don’t then bypass that approach elsewhere via ACI contracts. And document it either way, to ensure the use of Bridge Domains provides good “bread crumb trails”, but only if people realize that’s what you’ve intentionally done.
Pair-wise contracts are a lot easier to understand than “fur-ball” ones (many to many).
As noted, I much prefer distributed policy to the CheckPoint “everything in one lump” approach — divide and conquer is easier to understand, reverse engineer. My conclusion: if you must route between two enclaves, in doing so, pairwise with associated security contract should improve clarity. Yes, more than a couple, you’re stuck with the N:1 shared common approach.
Please do document your design and intent. I’m referring to the overall layout, high-level design, organizing principles, where key stuff like redistribution happens, where security policy is enforced (here but not there), that sort of thing.
What I’ve found over the years is that engineers are great at details — and to some extent the configuration IS the details. The problem is that nobody explains the big picture to consultants, new hires, etc. and over time the big picture gets lost. Then someone starts doing things differently and you end up, in effect, fighting the system. Not good!
Concluding thought: If you can’t briefly document your design, chances are you’re down in the weeds. Pull up and find some organizing principles!
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!
Hashtags: #CiscoChampion #TheNetCraftsmenWay #NetworkDesign #Routing
Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at firstname.lastname@example.org.