I’ve recently been discussing Security in the Datacenter with a consulting customer. Their security folks are getting a lot of management support in one of the more stringent security pushes I’ve seen to date. The security team fell in love with Palo Alto Networks firewalls (UTM’s), bought some big ones, and want to stick them in the middle of the datacenter, controlling not only all traffic between users and servers, but within the datacenter. They also feel they should become the default gateway for all servers, and while they’re at it, might as well take over doing the routing for the datacenter. (None of this is necessarily a Palo Alto Networks problem, I’ve run into similar things with other brands, including Cisco.)
You can probably imagine how stunned all that left the network team feeling.
Lesser forms of this have shown up, some driven by “it’s a Cisco security best practice”. Some of the Cisco SRND / Design Zone diagrams look like there’s a firewall between the campus / users and the datacenter. There’s a Nexus conceptual diagram showing Agg and Sub-Agg VDC’s on the two sides of a firewall doing contexts. The fine print (after-thought) suggests that non-firewalled server VLANs live on the Agg VDC, i.e. on the Core and user-facing side of the firewall(s). Unless you catch that fine print, you might design to put everything behind the firewall.
Well, what’s wrong with that? Isn’t it more secure?
My answer: yes and no. (Your answer might vary. Security’s answer is generally that if it has a firewall in it, it MUST be more secure — or at least, that’s what I sometimes think I’m hearing. No offense intended here to security staff that do not have an adversarial relationship with the network group.)
My biggest concern here would be that the firewall has much tighter bandwidth capacity limits than that expensive switch set you bought for your datacenter. If you place a firewall in the user to datacenter path, that might not be too big a problem (or cost). If all you’re doing is dual 10 Gbps links to datacenter, no problem. For the last few years, I’ve been figuring every closet has 10 Gbps to the Distribution switch, and with Nexus that might well be the same chassis as the building Core switch and the datacenter Aggregation switch, possibly all in the same VDC, possibly not.
Ok, so that works if the user to datacenter connection is slim. Perhaps your building switches aggregate into a building distribution and core, and then all you have is 2 x 10 Gbps links to the datacenter core. That might be do-able. What happens when you outgrow it?
The situation becomes a bit more extreme when your closets and your datacenter all come together into the same switch. Now if you force your user VLANs to run through the firewall to get to the server VLANs, you’re essentially replacing the backplane or fabric in the switch (massive forwarding capability) with the forwarding performance of the firewall (dare I say “puny”?).
Do you really want to replace a giant N x 1 Tbps fabric with a firewall, one that with a tailwind might achieve perhaps 30 Gbps or so of throughput? [I know, this article will instantly become out of date when I put numbers like that in it.]
I don’t think that’s a particularly good idea.
The same applies to routing amenities (code features, quality of routing implementation). .
L3 switches are good at it (well, Cisco’s are). Firewalls generally have some semblance of routing, but at best its RIP (aka “network malpractice”) and OSPF. If you do EIGRP, good luck, unless you bought an ASA. I’ve heard of all sorts of oddities of firewall OSPF implementations, like not summarizing or not doing OSPF ABR particularly well. So I try to keep it simple as far as how much routing complexity I put into the firewall.
I have a second concern with The Firewall That Rules Them All (or perhaps “It Who Must Be Obeyed”). My concern is the access list (ACL) rules for it lets do a little reality check here.
How big and complex is that ruleset going to be? Especially if it has to deal with traffic between any pair of VLANs. Hmm, if you have 100 VLANs, that’s about 100 x 99 rulesets to maintain. Yup, that’ll be fun. Zones might work a bit better. I like numbers such as 3, 4, or 5. Five zones, that might be do-able.
Oh, and if you mess up the Ultimate ACL, you just broke the datacenter. That’s rapidly going to become a CLM (Career Limiting Move). Two or three strikes and you’re out?
The reality check? The firewall goes in with “permit ip any any” configured. Some time later, somebody sticks their neck out and starts building rules. First time they make a mistake, all rules efforts cease. The firewall then becomes an expensive speed bump / bottleneck and dust collector.
This is why I like the Cisco services architecture and firewall contexts. Firewall contexts modularize the code. I’ve had a discussion with a former co-worker, yes Eric you might have to replicate some common ACL rules across contexts — but each context is much more narrowly scoped and breaks less of the datacenter. That seems like a win to me!
The big thing I like about the Cisco approach and contexts is modularity and incrementality. [Assuming the latter is actually a word.] You can incrementally firewall one or a few VLANs of servers, and incrementally deploy protection where you need it most — without major impact on the rest of the datacenter. Modularity in the sense that if you run out of capacity, you can just add another firewall pairs, and use their contexts to service more VLANs worth of servers. That scales a whole lot better!
What about the other point of view, the Security one? I’d like to think I’m not shabby at understanding how others think. This one just puzzles me, it comes down to the reality check versus what, extreme optimism?
If you think about it, tightly specifying allowed and denied traffic is a massive task, especially considering how poorly documented most apps are. Yeah, they’ll list a few ports. If you’re lucky, they’ll even tell you if the port is a server port or a client-side port. (Yes, it makes a difference.) Good luck finding out which other servers (IP addresses) a given server needs to legitimately talk to. So the best you can achieve is to allow all the traffic the app needs and not much else. Oh, but what about functionality that runs once a year? Good luck with that!
I’ve been down this path in my relative youth, with QoS and trying to specify the heck out of my QoS traffic classes. To make sure no improper traffic could masquerade as voice. You know what? I think I went through a lot of work making life harder for myself. Now I try to balance gain versus cost. What are the odds somebody is going to be want to, and be able to, transmit traffic marked with DSCP setting EF on a voice VLAN? Slim to none? So do I really need to be super-careful, verify it’s a Cisco phone out there, check for unexpected DSCP markings, check source IP, check for the payload type with deep inspection?
Programming / networking lesson learned: too much error checking creates more problems than it solves. I’ve been seeing that lately in networking too: IPS false positives blocking some Cisco guest WLAN by triggering on some CAPWAP tunneled traffic from AP to DMZ controller. Ditto with application traffic.
A topic for another time: do you really want to use your Server Load Balancer (or F5 uber-box?) as a firewall? Which is best exposed to hackers, a purpose-built firewall or a SLB? Or do both have their own strengths and weaknesses?
[And thanks to Ivan P for his hand-sketches in blogs, it got me playing with HTML editors and drawing tools on the iPad. Let’s just say informal hand-sketches on a PC aren’t necessarily easy?]
One response to “Security in the Datacenter”
Thanks for your comments.
I agree about switch ACL’s being high-speed (wire speed) and have made that argument at times. It seems to mostly irritate security people (hence my crack about firewall = security in the article).
I’ve been talking to people about "security vaults" to emphasize that you secure the valuables (in say firewall contexts), not everything. So to me, switch ACL’s is how you block the usual suspects, like Microsoft ports that shouldn’t be getting used, at least not from the outside.
Re state, people use the argument "it’s easier for ensuring legitimate return traffic". Which has some validity. There’s also the "protects the host from SYN attack" sort of argument. (And then the firewall takes the hit in place of some server?) What I keep noticing is that state in various forms (firewall, NAT) gets in the way of failover, e.g. between diverse Internet links. I’m not sure I see a good answer there, even if your exposed servers have public addresses on them (or the VIP on the SLB does).
I probably should have made the point that some similar arguments apply also to IPS and anything else in front of the entire datacenter, or the Internet link, although capacity is much less of a problem on the Internet side of things. [b]If I have a message, it’s that you have to watch the capacity, and structure things to give yourself some flexibility if you’re close to maxing out the box. [/b]
Someone suggested I check out some of the Palo Alto Networks literature to get their perspective, and I plan to do that. Right now, where I’m coming from is that I’ve seen either app problems that turned out to be firewall or IPS or web proxy AV capacity problems — and that some of those boxes really don’t make it too easy to figure out that they’re where you’re dropping packets. It’s not good when you’re troubleshooting by guessing that since you’re doing more than say 80% of the published capacity of a box, that’s probably the cause of the problems.