I just spent some time on an interesting and somewhat obscure ASA troubleshooting problem. It ended up being resolved by a note in some of the Cisco web pages, something I suspect is an often-missed but important little tidbit. And I suspect it is quietly a potential problem or irritant for all those of us who missed it. It applies to any router or firewall doing NAT.
Background
I recently moved a small university from DS-3 Internet to 155 Mbps fixed Wimax, and their remote branch from T1 to 15 Mbps. The remote branch needed a VPN tunnel from an ISR back to the main campus ASA.
A little while after this, the main campus was noticing slowness on certain web pages — a shared library search and inter-library loan service. The main search was fine, but drilling down for book details often took 10, 20, or even 80 seconds. No other noticeable slowness. The satellite campus was having no problems.
What We Did
We were concerned about the WiMax connection and possible interference or noise bursts, or whether something hadn’t quite carried over well in migrating the ASA configuration. After all, traffic through the router at the remote site was fine, suggesting something about the link or ASA at the main site. So we spent some time looking for problems and not finding any.
To nail down whether the problem was the ASA, we asked the WiMax provider for a 2nd /30. That allowed us to put a PC outside the ASA and test. Performance was great on the problem site! That suggested something in the ASA, perhaps NAT behavior triggered by faster speed, although you’d think that would affect all external traffic. We looked at the ASA some more, then tried putting the address from the /30 on the spare (former outside) interface. We used static routes to force traffic to the problematic website through that interface. When the static route was in place, no problems. Remove it, problems.
That suggested something about the new outside interface configuration (long access list, NAT) … but why was it only selectively affecting one site?
Resolution
At this point we were running out of ideas. We had a good this way / bad that way situation, which is usually a strong position to be in for troubleshooting. We had tried putting in static NAT for the one site, or a first ACL rule to ensure ACL length wasn’t a factor, and neither mattered.
Maybe its a guy thing (not asking for help or directions) … we finally did Google search. And up came some Cisco items. Including several like http://www.cisco.com/application/pdf/paws/22040/pixperformance.pdf.
We checked, and sure enough the new /30 had reverse DNS provided by the WiMax carrier. The block assigned for NAT by the campus did not, possibly because it was considered temporary while the campus adjusted its NAT rules. The remote campus, yes, reverse DNS worked.
Upon thinking about it, we suspect the web site in question had “naive” logging, which did reverse DNS lookup of the IP address to log the DNS name of the querying site. If you think about it, if the application is waiting on a DNS reply and logging entry, then DNS timeouts might cause considerable delays. I’ve seen this before in other forms, it had just not occurred to me when I heard the symptoms.
For what its worth, someone we know did work with CNN on a system for reverse DNS focused on quick cache search, the point being to either log the DNS name quickly or initiate lookup and caching but not hold up the main web results display. That’s why I used the term “naive” logging — the problem, if any, was a somewhat poorly written application that didn’t consider the impact of a failed DNS lookup on user experience. But the ASA was taking the blame! (“The network is at fault.”)
From my perspective, the application could have been clever (like CNN), or could have just logged IP addresses for offline post-processing (where the delays would not matter to the end user).
Practical Deployment Conclusion
We repeat what Cisco said … when doing NAT at a router or firewall, you want reverse DNS to be set up for the addresses in the NAT block. Or else FTP and HTTP and other applications logging client (source) by DNS rather than IP may cause user slowness due to waiting for DNS timeouts (possibly across multiple DNS servers).