Is NAT evil? To be avoided at all costs? I’ve been mulling over this topic for a while. I wish I’d invented the title concept, but from Google search, it has clearly been around for a while.
When is NAT appropriate? Can (and should) it be done away with?
Recent Blogs About NAT
Recently, two good blog posts about NAT have appeared and jump-started my thinking.
Geoff Huston, chief scientist at APNIC, blogged about NAT — in part NAT versus IPv6 in terms of address space and where the marginal costs fall (IPv6 being longer-term with little short-term benefit).
In my words, his blog ends up concluding:
- We can squeeze more out of NAT to extend the lifetime of IPv4
- Names are essential glue (I’m not sure that ties to NAT, however — services consist of IP + port but names only map to IP — and NAT usually doesn’t tie to names)
- There may be some future IPv6 middleware value to NAT for indirection and dynamic binding
Part of what he says — the bit counting part — reminds me of some of the service provider techniques for NAT46 to transport IP4 over an IPv6 network (e.g., where IPv4+port range statelessly maps to unique IPv6 address). The always-impressive Scott Hogg has a great three-part introductory blog series about this topic, IPv4aaS. This is a deep topic, but let’s just say NAT46 and NAT64 can be extra problematic.
Tom Hollingsworth commented on Geoff’s blog with a response blog containing some good thoughts. I’m not sure I viewed Geoff as saying NAT is good. It’s more like, “Hey, NAT is acting in a way that might turn out to be useful going forward in some ways.”
Stirring the Pot
All that is interesting and NAT-related, but a little bit tangential.
Let me throw out some use cases I’ve been contemplating.
Use Case 1. Yes, I’m guilty of committing LISP and OTV DCI (datacenter interconnect). Friends don’t let friends …
Anyway, first hop localization (HSRP filters in OTV) and stateful devices are the bane of mobility (think vMotion) in a DCI setting. LISP is somewhat of a bystander; the real problem is tracking which datacenter a VM is in and trying to use that to steer traffic optimally. Doing so is pointless if you can’t move the firewall or SLB state with the VM. Yet HA pairs of devices split across DCI means you have one big failure domain (aka datacenter) with complexity — not two HA datacenters. Such HA pairs, or worse, split clusters of several firewalls, effectively, are one big firewall with a backplane subject to external events. That is something I consider highly undesirable, as in, “Let’s build it so it can have massive failures.” NSX clustering across WAN/MAN, ditto.
Meta-considerations aside, maybe you’re stuck doing DCI and wish to avoid state problems. What can you do?
One way to handle stateful devices in such a setting is NAT. It causes return traffic symmetry. It solves a problem!
Unfortunately, if you have say two tiers of firewall + SLB, you’d end up with four NAT points — getting ugly!
Conclusion: There might be some occasional value to NAT in a stateful device setting. Like many good things, if you overdo it, you’ll get a headache. I’ve been calling this “the beer principle.” Because about two NAT points is the practical limit, in this case it might be more like some sort of “whiskey principle.”
Use Case 1.5. In some recent work, I was discussing incremental migration from Internet to Equinix Cloud Exchange™-based direct connections to various Microsoft cloud entities. There are firewalls in the path, as there should be. One wants to avoid asymmetric flows. The migration concept is to use BGP communities and filters to advertise certain MS prefixes selectively. By default, they are reached via 0/0 to the Internet. Leaking more specifics steers traffic to the direct connection. And NAT is what ensures symmetric return traffic! So, this is a case of NAT for the win!
Use Case 2. If you’ve dealt with stock quote firms (New York, New York), those that have private WAN connections to customers can end up with three, four, or more NAT points along the way. In at least one case, a firm started moving to use a public block as the WAN interconnection mechanism, to try to de-conflict all the NAT (i.e., everybody NATs, hopefully once, to their assigned public IP). Definitely ugly. Yet when everybody is using network 10, what else can you do? Yes, IPv6 solves that. Not holding my breath. Conclusion: Multiple NAT here is ugly. But is it inevitable?
Use Case 3. Docker containers. Let’s stick with Docker here, for simplicity — something similar may apply to VMware and the way some use Zerto “bubbles,” but I can’t find any good links on those topics.
If I have a container running a service built with say 192.168.1.0/24 and the container networking layer NATs that to a unique address (public or otherwise), is that good or bad?
The good news is I can copy the image and reuse it without modifications, using Docker networking to NAT it to a different IP that reflects the location.
Insight/Question: If the outside world can’t tell there is NAT, do we care? I’m coming to the conclusion that for most purposes, we don’t care. But if we have to troubleshoot the application or service and don’t know the NAT address mapping, that could get interesting. Basically, potentially one more layer of confusion and missing documentation.
Note also that the NAT state is very localized. That’s why I feel this NAT situation is less “evil.”
Two more brief use cases (at least) come to mind:
Use Case 4. Firewalls doing NAT between public IP blocks and internal addressing. This we’re all familiar with. Not much alternative with IPv4.
Use Case 5. A site with no NAT: The web front end is publicly addressed. It uses local private addresses for all its services, both in production and for the DR site. The point here, perhaps, is that if your edge services are re-addressable to represent location, and if they use local back-end services only, you don’t need NAT. Or for that matter, you might NAT the front end but nothing else. The attraction here might be using full clones of VMs for the backend devices; no need for vMotion or DCI to manage copying them if the two sites are active/passive.
Reaching a Conclusion
So, what is it that makes NAT undesirable?
- Complexity of tracking two IP addresses and appropriate context for each
- It rarely gets documented
- Opaqueness for troubleshooting (I rarely have direct access to the firewall doing the NAT when troubleshooting)
- Return traffic to a stateful device (e.g., SNAT for SLB return traffic, symmetric return traffic when a small business has two ISP connections and uses provider-supplied addressing).
- A special case of that: Controlling flow symmetry through firewalls to cloud partners.
- (From Paul Gear’s blog response to Tom H’s blog above): Hiding internal network address structure.
The preceding Docker use case suggests that most of these objections go away or become minimized in that particular setting.
What do you think?
Concerning designing with NAT:
- NAT seems to have occasional ugly — but necessary — uses, state preservation being a big one.
- I still want to avoid it wherever possible. That includes Docker containers; though maybe there, I’ll settle for good documentation (not that I ever count on having that).
- Doing NAT in one place rather than scattered across several places helps when managing and troubleshooting it. Don’t have some on the border router and other NAT on the firewall, for instance.
- IPv6 might find NAT or something similar useful. I say that despite the end-to-end addressing purists. I’m not holding my breath for an implementation of IPv6 NAT. Admittedly, all the yelling and screaming that would ensue might be interesting/amusing. (See also Google NAT66 or Cisco and NPTv6. Apparently, there is demand for this!)
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!