If you have slow applications, your DNS may be a contributing factor. This is a networking topic in that the network will of course get blamed. I ran across DNS and digital certificates as an issue recently in a post-event analysis of why an application had been slow. I’ve run into this before and seen that people don’t usually think about DNS when having certificate or application authentication issues. It seems like an appropriate topic, especially useful since google search is coming up with little relevant information about the role of DNS in SSL /HTTPS digital certificate authentication. (Google is coming up with way too many links, and I can’t find a better search string to tune out the all the irrelevant or useless articles.)
DNS and SSL? Yes, reverse DNS is used on the IP address of the other party, to verify the DNS FQDN (fully qualified domain name) matches the name built into the certificate. In other words, the party that sent the packet is the one the certificate belongs to. The Microsoft link http://support.microsoft.com/kb/257587 says this fairly clearly (paragraph 4).
If there is no DNS name match to the certificate, that means your client might be experiencing a Man In The Middle (MITM) attack, i.e. some third party computer that wishes to insert itself into your secure conversation. Since the SSL key exchange would be with the MITM acting as a web proxy between client and real server, it would be able to read anything sent either way.
Let’s go through what I encountered and what I learned in a little more detail.
The symptom was slow logins to certain key applications, including scheduling operating rooms at a major hospital. Clicking on some links within the applications also was slow. The data captured while the intermittent problem was occurring was a bit thin, but the application log had some interesting data, and site personnel had used some SPAN ports and OPNET App Xpert to capture some data.
Testing the 3700+ IP addresses logged by the application as having slow logins, all were network 10 addresses that were not being resolved by internal DNS, instead bouncing around internal DNS servers, then ultimately going to a root name server and then failing reverse DNS resolution. My conjecture is that the intermittent slowness problem might have been due to congestion along the Internet path to root name servers. The slow link click response might have been where the link went to a different server, hence required fresh certificate validation.
I believe I also saw clients trying to resolve their own IP address, possibly for some other reason. I’m guessing that the self-lookup might have been to choose from among several certificates based on FQDN, which might be necessary for laptops that connect to different networks at different times. Sort of “what is my current FQDN = identity”. I researched this some (Google) and have not found any discussion of it. (In general, the web seems to include a lot of basic “what is a certificate” and “how do I configure PKI” information, but little about the DNS interactions.)
It turned out the site was using Microsoft Dynamic Update DNS (DU-DNS), where clients’ DHCP triggers DNS updates, and apparently had problems with configuration, so that some clients were registering with a random Active Directory server. For some reason, maybe slow or failed zone transfer, such clients were not getting found by reverse DNS queries. The site had no DNS server in the sequence queried that was authoritative for reverse lookups for the entire 10.0.0.0/8 address space. If you want PC’s identified in reverse lookup, I would want to try the Microsoft DDNS data and if that fails have a local server provide the error mesage.
Along the way, I also saw queries for IBM and Microsoft addresses that suggested certficate validation, perhaps chain of trust certificates for some server functionality. What happens when your Internet connection is down? Does timing out on such certificates take a while? What happens then? I don’t know, haven’t been in a position to test it. Does deploying local certificate copies help with that? I’d hate to find out that’s a problem the day of the Big One (whatever nasty event takes out the Internet connection for days).
One app server also kept looking up mlb.com, which leads me to guess some embedded code or link is hitting the major league baseball site due to something someone missed in editing some toolkit, sample, or prior programming work. How many people use WireShark to see what strange things their commercial application servers might be doing or talking to?
A few years back I worked with a site having some issues with their web service when the link to the main datacenter was down. Among the symptoms: application servers couldn’t authenticate to the database. At the time I was guessing DNS and digital certificate verification, which apparently turned out to be correct. Eventual placement of auxiliary DNS servers in the web / application server site resolved the issue and protected against future outages.
It is a generally a good idea if you are use addresses out of 10.0.0.0/8 or other private address blocks to make sure your local DNS is authoritative for reverse lookups on that entire private address space. If lookup is going to fail, it is a good idea for it to fail as fast as possible. For internal applications, I would sure want as little DNS interaction with the Internet as possible, to ensure that they still function when my Internet upstreams are having a Really Bad Day.
Make sure your servers don’t have old DNS server addresses in them. DNS timeouts are ssssslllllooooowwwww! I’ve seen that slowing down an application at one site. A periodic audit of server configurations for such errors might be a good idea.
Make sure your DNS servers are not old and slow. DNS is critical to modern application security hence performance. VMware is known to be highly dependent on DNS, so if you’re using VMware you want to get your DNS done right, in particular, check that both forward and reverse lookup entries get configured for every VM or VMware host (and in general, every server).
Having a maze of many DNS and Active Directory servers (doing DU-DNS) is probably not wise: slow, complex, hard to troubleshoot.
For that matter, time is critical for SSL and other encryption. Make sure you have a robust NTP time hierarchy, and please have clients and servers ultimately using the same time sources.
Data exported from OPNET App Xpert can include duplicate packets, in which case WireShark will report duplicate and out of order packets, not very useful.
If you look at your DNS traffic, you may see many lookups for items in the sophosxl.net domain. Apparently this Anti-Virus tool does lookups to subdomains to check out web hosts. Hence sites running Sophos software might need a more robust DNS server.
It is a good idea to have in-house DNS expertise, clear responsibility, and accountability. I have the impression that a lot of DNS admins are in the “got it working” category, i.e. some degree of clue but not expert. Re accountability: does DNS belong to the network team or the server team? “Both” is generally not a good answer.
I’ve experienced adventures in application slowness over the years. Lesson learned: key services such as DNS, LDAP, Active Directory need to be robust and fast. Part of that is not putting them on old junk servers, part is monitoring the server resources. One site I worked with did a data center core upgrade before Christmas. It turned out the new (at the time) version of Lotus Notes pounded the heck out of the LDAP server. Which nobody on the server team looked at until the network team established innocence. The problem could have been solved a lot faster and with a lot lower cost if folks had just been monitoring the key servers and storing historical performance data. (“Gee, the load went up considerably on 11/28, which is when we had that Lotus change window…”).
Comments welcomed, particularly if you can answer any of the open questions above. This is a case where I’m not finding the information I’d like to have, know it might be useful or important, and don’t have the time to lab it up.