I’ve been doing Data Center design recently and the customers are interested in what to monitor and why it should be monitored. At first, a data center is just another set of edge devices, albeit servers, connected to a set of access switches. But the traffic patterns and usage in the data center is much different than at user edge ports. Historic traffic patterns were between clients and servers, often called ‘north-south’. But with the distribution of applications between web front ends, middleware servers, and back end databases, there is frequently a lot of server-to-server traffic, called ‘east-west’.
I think it is important to know the most active servers, particularly as VM usage allows multiple applications to run on each server. In this context, a server is a hardware platform that has some number of CPU sockets, each with some number of cores, memory and I/O capability. Each blade in a blade server is what I would call a server (an ESX Host) because the sockets and cores on the blade share I/O, just as they would if it were a standalone chassis.
As VMs move from server to server, the mix of disk I/O and network I/O changes, depending on the application that is being run on each VM that is hosted on the server. If you don’t monitor each server’s network port, how do you know that the mix of applications is congesting the network interface? Alternatively, how do you determine that a particular server is hosting applications (VMs) that are no longer in use? Decommissioning unused servers has always been a problem in large data centers, and with VMs, it is becoming more challenging to identify the unused application VMs. Yet these unused VMs consume resources, adding to data center growth and increasing the server management load.
What should be included in data center monitoring?
- Server ports. Know when a server’s link is saturated. Basic interface monitoring is a start, but follow that with flow monitoring to report what traffic is the greatest. Be careful – some applications are designed to run at full link speeds, such as backups. This is where a server that is hosting too many VMs is detected – its network interface will be congested and the mix of traffic will be to/from the various VMs that are hosted on that server. Monitoring server ports can also be used to inventory unused ports or underutilized ports. Server ports, unlike most other edge ports, need to be monitored at a higher frequency, so that traffic microbursts can be identified, helping you understand the true network utilization profile. Polling for data at 10 minute or longer intervals just doesn’t work for this level of visibility. The minimum that I want for server port monitoring is 5 minute data and I would prefer 1 minute polling. This is certainly possible, as demonstrated by some vendors.
- Uplinks. Most data centers will be designed with some level of oversubscription of the uplinks and knowing when the uplink is oversubscribed will help you move the VMs around the data center in order to maximize overall system throughput. In the longer term, it can help you determine where you need to add higher speed links or where to add additional parallel links (etherchannel or equal cost multipath).
- Interface errors. All interface errors should be reported. In addition, report on discards, which indicates interface congestion. Other errors may indicate bad cabling or connectors. For fiber connections, look for either bad cables and connectors or dirty connections. This is also where you find duplex mismatches on interfaces where the server and network teams don’t always agree on the speed/duplex settings.
- Traffic distribution on parallel links. Make sure that the etherchannels and equal cost multipath links are running at reasonably even traffic levels. An incorrect traffic distribution algorithm can significantly reduce network throughput, impacting business productivity.
Get a baseline of the above information so you know what it should look like. Refresh the baseline at least once per quarter (monthly or weekly is even better), allowing you to do trend analysis to predict bottlenecks and act to eliminate them. Add thresholds to generate alerts when traffic levels and errors exceed suitable levels. With regular baselines, you can identify new applications as they are rolled out on the network. You may even be able to identify that an application will not work well in a certain environment (e.g., over the WAN) before it is extended there. An don’t forget that the baseline may allow you to more easily identify malware and security problems and their origins.
I like the idea of using flow analysis (Netflow, sFlow, IPFIX) to identify server traffic patterns. Which servers are talking with which other servers and what is the traffic volume? Are some servers talking with servers in a backup data center, implying that the application is really distributed across multiple data centers? In this case, is there really a backup data center, or is it simply a distributed data center? What happens when (not ‘if’) connectivity is lost to the other data center? Does the application stop running and if it does, what is its impact to the business?
Data center monitoring is an important function of network management. While many of the above functions can be accomplished with today’s tools, there are some functions that cannot. The capability to do all of these functions is not far off. If we ask the vendors for these capabilities, all of which are possible with existing technology, then we will all benefit.
-Terry
_____________________________________________________________________________________________
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html