I’ve recently been discussing Security in the Datacenter with a consulting customer. Their security folks are getting a lot of management support in one of the more stringent security pushes I’ve seen to date. The security team fell in love with Palo Alto Networks firewalls (UTM’s), bought some big ones, and want to stick them in the middle of the datacenter, controlling not only all traffic between users and servers, but within the datacenter. They also feel they should become the default gateway for all servers, and while they’re at it, might as well take over doing the routing for the datacenter. (None of this is necessarily a Palo Alto Networks problem, I’ve run into similar things with other brands, including Cisco.)
You can probably imagine how stunned all that left the network team feeling.
Lesser forms of this have shown up, some driven by “it’s a Cisco security best practice”. Some of the Cisco SRND / Design Zone diagrams look like there’s a firewall between the campus / users and the datacenter. There’s a Nexus conceptual diagram showing Agg and Sub-Agg VDC’s on the two sides of a firewall doing contexts. The fine print (after-thought) suggests that non-firewalled server VLANs live on the Agg VDC, i.e. on the Core and user-facing side of the firewall(s). Unless you catch that fine print, you might design to put everything behind the firewall.
Well, what’s wrong with that? Isn’t it more secure?
My answer: yes and no. (Your answer might vary. Security’s answer is generally that if it has a firewall in it, it MUST be more secure — or at least, that’s what I sometimes think I’m hearing. No offense intended here to security staff that do not have an adversarial relationship with the network group.)
My biggest concern here would be that the firewall has much tighter bandwidth capacity limits than that expensive switch set you bought for your datacenter. If you place a firewall in the user to datacenter path, that might not be too big a problem (or cost). If all you’re doing is dual 10 Gbps links to datacenter, no problem. For the last few years, I’ve been figuring every closet has 10 Gbps to the Distribution switch, and with Nexus that might well be the same chassis as the building Core switch and the datacenter Aggregation switch, possibly all in the same VDC, possibly not.
Ok, so that works if the user to datacenter connection is slim. Perhaps your building switches aggregate into a building distribution and core, and then all you have is 2 x 10 Gbps links to the datacenter core. That might be do-able. What happens when you outgrow it?
The situation becomes a bit more extreme when your closets and your datacenter all come together into the same switch. Now if you force your user VLANs to run through the firewall to get to the server VLANs, you’re essentially replacing the backplane or fabric in the switch (massive forwarding capability) with the forwarding performance of the firewall (dare I say “puny”?).
Do you really want to replace a giant N x 1 Tbps fabric with a firewall, one that with a tailwind might achieve perhaps 30 Gbps or so of throughput? [I know, this article will instantly become out of date when I put numbers like that in it.]
I don’t think that’s a particularly good idea.
The same applies to routing amenities (code features, quality of routing implementation). .
L3 switches are good at it (well, Cisco’s are). Firewalls generally have some semblance of routing, but at best its RIP (aka “network malpractice”) and OSPF. If you do EIGRP, good luck, unless you bought an ASA. I’ve heard of all sorts of oddities of firewall OSPF implementations, like not summarizing or not doing OSPF ABR particularly well. So I try to keep it simple as far as how much routing complexity I put into the firewall.
I have a second concern with The Firewall That Rules Them All (or perhaps “It Who Must Be Obeyed”). My concern is the access list (ACL) rules for it lets do a little reality check here.
How big and complex is that ruleset going to be? Especially if it has to deal with traffic between any pair of VLANs. Hmm, if you have 100 VLANs, that’s about 100 x 99 rulesets to maintain. Yup, that’ll be fun. Zones might work a bit better. I like numbers such as 3, 4, or 5. Five zones, that might be do-able.
Oh, and if you mess up the Ultimate ACL, you just broke the datacenter. That’s rapidly going to become a CLM (Career Limiting Move). Two or three strikes and you’re out?
The reality check? The firewall goes in with “permit ip any any” configured. Some time later, somebody sticks their neck out and starts building rules. First time they make a mistake, all rules efforts cease. The firewall then becomes an expensive speed bump / bottleneck and dust collector.
This is why I like the Cisco services architecture and firewall contexts. Firewall contexts modularize the code. I’ve had a discussion with a former co-worker, yes Eric you might have to replicate some common ACL rules across contexts — but each context is much more narrowly scoped and breaks less of the datacenter. That seems like a win to me!
The big thing I like about the Cisco approach and contexts is modularity and incrementality. [Assuming the latter is actually a word.] You can incrementally firewall one or a few VLANs of servers, and incrementally deploy protection where you need it most — without major impact on the rest of the datacenter. Modularity in the sense that if you run out of capacity, you can just add another firewall pairs, and use their contexts to service more VLANs worth of servers. That scales a whole lot better!
What about the other point of view, the Security one? I’d like to think I’m not shabby at understanding how others think. This one just puzzles me, it comes down to the reality check versus what, extreme optimism?
If you think about it, tightly specifying allowed and denied traffic is a massive task, especially considering how poorly documented most apps are. Yeah, they’ll list a few ports. If you’re lucky, they’ll even tell you if the port is a server port or a client-side port. (Yes, it makes a difference.) Good luck finding out which other servers (IP addresses) a given server needs to legitimately talk to. So the best you can achieve is to allow all the traffic the app needs and not much else. Oh, but what about functionality that runs once a year? Good luck with that!
I’ve been down this path in my relative youth, with QoS and trying to specify the heck out of my QoS traffic classes. To make sure no improper traffic could masquerade as voice. You know what? I think I went through a lot of work making life harder for myself. Now I try to balance gain versus cost. What are the odds somebody is going to be want to, and be able to, transmit traffic marked with DSCP setting EF on a voice VLAN? Slim to none? So do I really need to be super-careful, verify it’s a Cisco phone out there, check for unexpected DSCP markings, check source IP, check for the payload type with deep inspection?
Programming / networking lesson learned: too much error checking creates more problems than it solves. I’ve been seeing that lately in networking too: IPS false positives blocking some Cisco guest WLAN by triggering on some CAPWAP tunneled traffic from AP to DMZ controller. Ditto with application traffic.
A topic for another time: do you really want to use your Server Load Balancer (or F5 uber-box?) as a firewall? Which is best exposed to hackers, a purpose-built firewall or a SLB? Or do both have their own strengths and weaknesses?
[And thanks to Ivan P for his hand-sketches in blogs, it got me playing with HTML editors and drawing tools on the iPad. Let’s just say informal hand-sketches on a PC aren’t necessarily easy?]