This blog discusses a problem that several of my NetCraftsmen peers and I are seeing.
Once I explain, I’ll leave it up to you: Is this phenomenon real? Is it a problem at your site?
TL;DR: Organizations may need one or more dedicated persons who own NetFlow tools and understand traffic flows, especially application flows, on the network. The “flowmeister”?
There’s a related job, maintaining access lists, ACI contracts, etc. We’ll look at that briefly here and some more in a later blog.
I’m going to use “access-list rules” as a vague term, including ACI contracts and other access-list rule-like policies.
The following are observed data points. They may or may not apply to your organization:
- Server/VM admins in many organizations know little about the applications running on their platforms. In many cases, they’re the 2nd or 3rd generation “owner,” and with each handoff between owners, knowledge has been lost. Or they just followed instructions to install the app but know little about it. The admin sees that software patches and updates are applied, and backups are done, but lacks visibility into whether app behavior and traffic flows have changed.
- New apps or patches and upgrades are often installed by consultants or the app vendor. Ports and services they consume or provide are hastily documented at best. What happens to any documentation like that may vary widely. Documentation of flows is not required before the consultants or vendors get paid.
- In short, you probably shouldn’t trust what’s documented. It’s a starting point. Trust but verify?
- Security staff is busy monitoring alerts from their various tools, managing NAC and roles, etc. They may or may not be involved in generating access list rules for new apps. My impression is: usually NOT very involved.
- Networking doesn’t really want to own access list rules but ends up getting stuck with it.
- Whatever staff team owns access list rules. They are usually understaffed, and so they have little time or inclination to alter access list rules. Or even check if they make sense.
There’s one major principle that applies to access lists. I’ll call it “Blame’s Law of Inertia.”
- Nobody likes changing access list rules. If you break something, then everyone hates you and blames you. Doing nothing may not improve security but doesn’t break anything.
- In addition, reviewing the ACL rules app by app is very time-consuming. So inertia definitely applies there.
So, what’s wrong with that?
- Nobody owns getting rid of old access-list rules; they just accumulate. (And possibly fill up ACI TCAM, eventually.)
- Access list rules may not be well-organized, e.g., by source, destination, and port. Random rules are hard to understand and maintain.
- New apps often start out with “permit any any” to get them up and running. The intent is to observe the app flows and then tighten things down, but the “tighten” part never happens. Or only partly happens. Yes, inertia!
There is a related phenomenon that can be seen with what I’ll call flow tools.
Flow tools: any kind of NetFlow, sFlow, etc. collector, e.g., Cisco StealthWatch, or many others. Or tools (agents and central platform) like Cisco Tetration (ahem, “Cisco Secure Workload” = “CSW” – I have to go look it up every time, “Tetration” was cool and memorable!), Illumio, etc.
What seems to happen is that someone buys such a tool. But starting to use it takes a lot of work. You might have to install agents on “all” servers and VMs. Or enable NetFlow etc., on a lot of network devices. That may require code upgrades. And all the change process for that takes time too.
The follow-on stage to such a tool is to start analyzing the flow data. What is talking to what, using what protocols and ports? Someone has to analyze all that. Or, at the very least, after some suitable period of time, possibly do something to tell the flow app to go ahead and generate an access list entry for you. Or do that manually. Then put that in place and move on to the next case, batch of servers and apps, whatever. With Tetration (aka “CSW”), Cisco was up-front about that, emphasizing to pick a service or app, tackle it, etc., as you work up groups of server/ports (DNS servers, MS AD servers, NFS storage, etc.) and give the groups meaningful names. Etc.
And here’s the KEY POINT(s):
- You have to budget the time if you want flow analysis/ACL review to happen (upfront or periodically afterward, or for new servers/apps)
- You also need comprehensive coverage.
The latter may entail licensing and other costs (number of flows license, storage for all the flow data over a longer time period so you can do trending if desired, and perhaps compute resources to provide faster reporting to make better use of staff time). Re that latter: pulling up data in some flow tools can be SLOW. As in painful enough, staff won’t want to use the tool.
Oh, and quarterly/annual/whatever reviews (app audits?) also need to have time allocated. Apps change. If new ports or resources are used, you’ll get gripes and go fix the rules. But if they stop using ports or other resources, will you ever notice and adjust the access lists?
One Key Thing
If you’re trying to analyze flows, your best friend could be IP reverse lookup (IP to DNS name). Except that is often not implemented well, in some cases, because the DNS server requires double-entry of essentially the same data (Microsoft used to do so, I can’t speak for current MS code).
And I should note that a security person might want to disable reverse DNS to hinder a hacker from finding pots-o-money.bigcorp.com. A better answer might be to limit outside DNS views of server names or names that do not reflect the app or purpose. There are probably lots of other views on this subject.
Hey, I’m not a full-time security person. What I’ve seen in firewall rules tends to look like a mess that evolved over time.
And recent vicarious experience (thanks, Carole) indicates that deciphering large ACI rule deployments can be challenging and time-consuming. (Especially the large matrix of EPG source to EPG destination pairs doing “allow all.”)
There is perhaps a lesson to be learned from fiber channel best practices, where you might typically organize your zones (storage ACLs) around the source server and what storage it needs to access. Doing so localizes all the rules involving any given source, easing maintenance.
For access lists, it probably makes sense to do something similar while recognizing that groups of similar servers might be the source? Or group by service/consuming app? Or both, using a naming convention.
In the case of a micro-segmentation (host by host enforcement!) tool such as Illumio or Tetration (“CSW”) agent, your deployed enforcement will actually be just the rules applying to a given host. Possibly extracted from higher-level groups.
Starting with ACL rules and trying to understand, organize, and verify them, or starting with flow tools and trying to obtain flow data to use in verifying ACL rules were discussed. It seems like there’s a real problem there of getting your arms around the actual flows and the permitted flows and documenting policy clearly. And in easily maintained form. Well, “easy” is a word that likely should never be applied to either topic.
I’m a big believer in ownership, as everyone should know what tasks they are responsible for and have realistic time allotted for them to do the tasks they own.
With access lists, someone often gets the job “dumped” on them, but not the time. The same applies to flow data – potentially, it can detect malicious activity, security gaps, what ate your Internet bandwidth, and so on. But ramp-up time is a major obstacle.
Need I note that this is NOT good for overall security?
Above, I suggested some simple ways to organize ACL rules to make them possibly a bit easier to maintain (and understand?)
Food for thought (all out of answer today):
- How could vendors make this easier?
- How could you make this easier?