I’ve been pondering writing a series of blogs about the topic, Data Center of the Future (“DCoF” below). The present topic, “How Many Servers?”, may be the first of that projected series. I’ve been tracking, reading, and learning about the various topics for some years now. I don’t claim to know everything about any of the topics, but I do hope I have some thoughts that are original if not challenging. And I hope my discussion points will stir some debate and alternative points of view via comments on the blog pages.
We (Chesapeake NetCraftsmen, aka CNC) are in the midst of doing a number of data center assessment and planning projects for some fairly big data centers. When you’re trying to figure out what next-gen equipment to buy (be it Cisco or HP or whatever vendor / vision you’ve signed on to), there are a couple of key design factors that we have to jointly determine:
- How many servers do you have now, how many do you project having in N years?
- Are they vSphere blade servers, or what?
- How many NICs, and are they 1 Gbps or 10 Gbps?
- What size Layer 2 domains do you want?
The answers to these questions give you some idea of how much space the servers will need, and how you’re going to connect them all up. Our focus in this article is primarily the space, with some brief comments about the other factors.
As with most budget and design situations, there is a certain amount of uncertainty, and trying to be too precise is a doomed exercise in GIGO. Don’t go there!
How Many Servers?
The “how many servers?” question is emphatically not (NOT!) a no-brainer. You have to factor in the rate at which servers are being virtualized. And then extrapolate based on the assumption that some percentage, perhaps 80 or 90%, of the servers will be VM’s (Virtual Machines). So you’ll have growth in the non-VM server count, plus growth in the VM count.
Well, that’s just the beginning. We have all probably noticed discussions of “VM sprawl” in print. Sometimes it seems like they come mostly from people and companies who want to sell you a product to manage the sprawl. Oops, my cynicism is showing. Don’t get me wrong, sprawl is a real potential problem. If you have VM sprawl, then you have additional VM count growth due to the ease of saying “yes” and spawning all sorts of new VMs for people. So what we’ve got now is (physical server growth) + (VM growth due to usual server growth rate + growth due to P2V physical to virtual conversion + sprawl factor). Fine-tune that to your heart’s content.
That doesn’t quite translate into space in the data center. The other factor you need is the number of VM’s per processor (or socket, or socketed CPU chip), and the processor density per rack (or square foot, etc.). If you’re using a Cisco UCS or an HP or IBM blade chassis, you might have 128 processors or more in under or up to one rack of space. As I may have mentioned elsewhere (you think I re-read my old scribblings?), processors are getting a whole lot more powerful, both in terms of number of core processors but in terms of memory access capabilities (cf. Intel Nehalem-EX chips, or some IBM announcements). I’ve recently read that Intel was sampling 48 core processors. Heck, I have a quad core laptop, and it screams (except when Windows isn’t paying attention to my mouse, anyway — cooperative multi-tasking has its drawbacks).
This business of VM’s per processor or rack is a very real factor. It factors into not only space (# of racks, power, cooling, etc.) but also touches the whole question of “how big are your Layer 2 domains / VLANs going to be”.
We were discussing DCoF at one site over a year ago. A whole bunch of “architects” had been discussing DCoF (or Next Gen Data Center — “NGDC”), and determined that VMotion was going to drive a corporate requirement that every VLAN go everywhere. Since they had at least 8 rows of servers bookended by Cisco 6509’s, that exceeded my Spanning Tree tolerance by a factor of 8, especially since the site already had had some noteworthy STP loops and data center-wide meltdowns. (It was designed using an older approach, and changing to L3 to the row had never been considered. With VMotion and more L2 clustering now prevalent, L3 to the row is generally not popular.)
In talking to the local VMWare / blade server expert, we learned that the site was then running at about 12 VM’s per processor. We did the math, two racks of HP = 2 x 128 processors, times 12 VM’s each, is about 3000 VM’s. Since the site was at 1000 to 1500 servers at the time (roughly 8 rows x about 200 per row?), that meant that after P2V, the whole data center would fit into two racks!!! That incidentally solved the whole “VLANs go everywhere” debate. The expert didn’t even want VLANs to allow VMotion from old ESX chassis to new vSphere chassis — running older VMWare on new Hypervisors is inefficient, so a cold move / VM migration was what he wanted to do.
Let’s pursue the story, factoring in some new hardware capabilities. If you figure that was dual-cores, and if 8-way cores support 4 times as many VM’s, that’s roughly 50 VM’s per processor, or 128 x 50 = 6400 per rack. And climbing (multiply by 4 to 6 for 48-way cores?). Is that a realistic number? Some people tell me IO, especially SAN IO and throughput, are likely to hold those numbers down. I’ve been told memory is a limitation — but Cisco (and I gather IBM) have addressed that with virtualization techniques, that allow their server chassis to slice up additional memory, somewhat addressing the RAM per VM concern. I suspect people are going to also be conservative — having everything in one rack is kind of scary. (And might get your data center downsized out from under you?)
That’s all theory. There are some practical factors that come into play. I think I’m seeing human skills and time as being a very limiting factor. You can only P2V and test so many VM’s per day, month, year. So the data center is not going to undergo P2V and re-appear within one rack overnight, despite some of the automation tools that I hear about. Sites are converting the underutilized and old servers to VM’s, especially the ones with non-critical or not-very-demanding apps on them. Critical servers or demanding applications, not so much. And legacy apps, the problem is they run with drivers or OS’s that are so old that you’re stuck with the platform they’re on.
To sum that up, VM’s are a win as far as space (power, cooling, etc.), but touching every server is something most sites are not close to being staffed for. (Most sites seem to stand up servers and apps and then leave them mostly alone for years, except for patches, upgrades, and tweaks. If you don’t change it, it probably won’t break.)
vSphere or Not?
The reason for the vSphere or not question is that VM’s running on a Hypervisor tend to use more of a CPU and do more IO. So the number of such VM platforms impacts what you assume about bandwidth to the access switch, and probably upstream oversubscription ratio. If you spread the heavy-duty platforms around, then your average oversubscription per row might not balloon on you. If you concentrate the cruncher apps / hardware, then you might need a zone with a very low oversubscription ratio. It all comes down on how you want to try to manage bandwidth.
How Many NIC’s?
There seems to be a popular firm belief that you can’t have enough dedicated NIC’s. It is also current VMWare Best Practice. Namely, two NIC’s for Production, one or two for Backup, one or two for Management / Vmotion, 2 HBA’s for SAN, etc. I also suspect staff in different stovepipes want to “own the bandwidth”. Production worries about backup traffic stomping on data, and Backup worries about data slowing backups. (And perhaps they could co-exist fine, if backup had bigger pipes to the backup unit(s), and/or multiple back ends — the problem isn’t volume of backup data, it is efficiency and parallelization.)
Ok, perhaps I’ve swallowed the Cisco Kool-Aid on this, I think shared 10 Gbps connections are the way to go. I heard that it costs $400,000 to build out the copper patch panels and cabling for a row, and one site has confirmed that number. (Cabling, labels, labor, etc. — it all adds up.) See my earlier blog about 2009 VMWorld, the Cisco UCS / SAN demo area, and how little network cabling there was, at VMWorld 2009 Impressions #2.
If your future will contain more / new 1 Gbps NICs, then the Cisco Nexus 2000 series is a way to do that but reduce the horizontal copper cabling. If you have constant cabling churn or need a switch upgrade, it might be worthwhile to retro-fit. You can tune the N2K to rack and N5K mix to meet projected port count and bandwidth needs.
Concerning relevance, this goes towards how you use the space you’re allocating. A subsequent article may look at some of the new ways of building out data centers, with some concrete examples. I’m omitting the whole “To FCoE or Not to FCoE?” question, too far off topic, and a good subject for another blog. (FCoE = Fiber Channel over Ethernet).
What Size Layer 2 Domains?
A lot of sites are having problems managing space. Sometimes, an entire zone gets filled before the next one gets built. Or different Business Units “own” space, and as needs change, servers and space become available in a non-adjacent part of the data center.
There are trade-offs. You can do VLANs to connect VMWare clusters (VMotion) or MS etc. L2-requiring clusters. As they grow (number and breadth), your risk of a Spanning Tree loop quietly goes up. Rapid STP, VSS or VPC, and other such techniques can be applied to reduce the risk. We hear that L2MP or TRILL are coming, or vendor-specific many-switches-as-one-big-virtual-switch technologies. Do you want to bet your data center (or multiple data centers) on new technologies? If you do, do you want to hedge your bet, by just doing the new approach in part of the data center?
I personally have the urge to widen the mix. Can you manage the space differently? Does the data center really need to be carved up by “ownership”, or can different Business Units cooperate? (I gather, coordinating priorities for scarce change windows can be challenging.) Can server chassis be physically moved (i.e. labor versus risk)?
The challenge there seems to be that the various stovepipe teams (server, app, SAN, network, BU’s, etc.) each have their own priorities and presumed requirements. There seems to be nobody in a position to really evaluate the trade-offs and look at what’s best for the business, both in terms of cost (ROI) and risk. Or that’s the impression from the outside. I see STP risk and managing the space as alternatives to “spend more, buy more”, but I’m also not in other people’s shoes seeing how hard it is to do that.
Multiple platforms (IBM, HP, Linux versus Windows) is another aspect of how you use space. If you have firm app/server/vendor stovepipes, usually each has their own data center territory. (Do territorial data center people mark their territory? I don’t want to think about how they might do that…) That consumes more space. It usually means each has their own VM farm. Adjust your space calculations accordingly.
Politically, a couple of firms seem to have made the decision to get onto single platforms, e.g. Windows, VMWare, HP hardware, etc. If your shop can do that, it seems to have real benefits. “Cookie cutter” lets you automate more consistently, think “assembly line production of VM’s and deployment of apps”.
If you have an ongoing software architecture committee, that just churns through the popular dev tools of the day, then you’re probably not going to see those benefits. I get the challenge of standardizing but not freezing progress. I get that vendors for costly apps always want to tell you what to run those apps on — and deviate at your peril. So achieving cookie cutter is hard.
I also wonder why hardware, like networking approach or server load balancers or cache engines, is rarely part of the architecture — it probably just makes the discussion too wide. Are you tired of supporting too many Server Load Balancer approaches (software and hardware) in your shop? Then you see my point. Going forward, you get the best results by using as many common elements as possible. Variety means a higher ratio of admins to apps, servers, network hardware, etc. Commonality saves money. (Hmm, did I just unintentionally make a pitch that variety of platforms protects jobs?)
I’m not a server guy. I don’t manage large numbers of vSphere processors or clusters, and I have limited interaction with those folks at most sites I consult at. It might be pretty useful to know what other sites are getting, in terms of number of VM’s per processor or per rack. If you have read this far, and could take the time to let us know what your site is doing in these regards via a comment, please do so! Or share any other related considerations you’ve encountered with the other readers as well! (And thanks in advance!)