There’s a neat story that’s been around the ‘net about a North Carolina University System Administrator who faced a problem where emails from a particular system would only work if the recipient was less than about 500 miles away (see The Case of the 500-mile Email). There is a good set of comments that follows that describes other considerations that could have been incorporated into the story. These comments reflect a good understanding of how TCP works and how that impacts applications, which is the point of today’s blog entry.
My experience in designing and running big networks has taught me that there are a relatively small set of common problems that affect networks. Like the 80/20 rule, they account for 80 percent of network problems. The small set may depend somewhat on the type of network design you have chosen. So a switched network will have a different set of problems than a design that pushes L3 to the wiring closet. Even still, there will be a sub-set of problems that will occasionally come up, like the one about an OS ‘upgrade’ in the story above. If you have a good understanding of how networking operates, you can use that understanding to troubleshoot many problems. In the story above, a network engineer who can do a packet capture would see that the connection is taken down using the standard TCP connection close mechanism. This means that the mail servers are taking down the connection and that it isn’t a network problem. A less intelligent system administrator than Trey Harris (see FAQ with Answers about the 500-mile email) may have said ‘The network is broken.’
I’ve seen similar problems in the past and have included them in network engineer interview questions. One example that tests a candidate’s understanding of TCP and the ability to work through a problem goes like this:
You have two ground stations that communicate over a 10Mbps link via a geo-stationary satellite, stationed about 24,000 miles above the earth. A file copy operation needs to occur daily over this link, copying a 1GB file. Your boss says that it will take about 1000 seconds (16 minutes, if you allow for some overhead). The systems running TCP are using a standard implementation. Do you agree with your boss? Explain your analysis.
These are the sorts of problems that I find interesting and is why our products use built-in rules to look for them. As a network engineer, I rarely have time to look around for problems – there’s typically a backlog of problems waiting to be solved. It would be great to be able to identify many of these problems and fix them before they get added to my list of tasks. The end result is a much more smoothly and efficiently running network, which means that the business operations that rely on the network are also more efficient and is how networks become a strategic part of the business. Networking departments who realize this are able to move themselves from being ‘information plumbers’ to sitting at the executive strategy meetings.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html