The New Datacenter

Author
Peter Welcher
Architect, Operations Technical Advisor

Best practice datacenter design architectures have changed — there are different and new datacenter designs now that can save you money. So why am I telling you this? Two reasons. First, I’ve been seeing a fair number of people / sites proceeding as if doing business as usual, as in “replace my old box with the modern equivalent”. While you can to a fair degree still do that if you really want, you may be missing the point. The second reason is skills, there’s a lot of  new technology coming along, and if you aren’t keeping up, how can you evaluate it, let alone be in a position to design for it. 

Ok, the “save you money” part is a bit of Cisco (and other vendor) Kool-Aid — but happens to be true, at least in some/most ways. Concerning skills, I’m sub-contract teaching the Nexus class roughly once a month for Firefly, and keeping up on not only Nexus products but the virtualization suite (1000v and related blades, VMWare/vSphere) and other technology (i.e. reading about Juniper QFabric, OpenFlow, VXLAN, and so on). I plan to write about some of it, e.g. VXLAN, shortly. In the meantime, I highly recommend Ivan Pepelnjak’s blog IOS hints blog, also Tony Bourke’s Data Center Overlords (Tony teaches for FireFly). Amusingly, when I searched a bit, I quickly hit upon the specific article link here, which starts out with almost the exact same theme as my starting point: “that data center landscape is changing rapidly”. 

By the way, if your organization is grappling with this, and would like to bring me in for a couple of days consulting and discussion facilitation, well, that’s my idea of fun!

Changed How?

The biggest reason “it ain’t just box replacement” anymore is the Cisco FEX technology … and I suppose other vendors’ attempts to “flatten the datacenter”. (Which always struck me as a Bad Thing — isn’t that what a tornado might do?). The Nexus 2000 (N2K) products allow you to do manageable Top of Rack switching. That is, you don’t suffer “death by small switches”, with 10’s to 100’s of small switches to manage, 1 or 2 per rack. At present I’m mildly mixed about the N2232 (4:1 oversubscribed ports) — but imagine the possibility of something like that with 40 or 100 Gbps uplinks. (Another article I intend to write: all the ways Cisco might take the FEX technology.) 

But it isn’t just the N2K. The FEX technology is coming to NIC cards with tight VMWare integration. Who currently manages NIC connectivity? Who is managing the VMWare vSwitch or dvSwitch in an ESXi host? How about getting that back into network turf, where people can control it, people who understand the network and security implications of what they’re doing, and where people need the visibility to troubleshoot?Oh, and by the way, let’s offload the switching to hardware, to preserve CPU cycles for applications. 

The N7K is a doggone big switch, big enough that for fairly large-sized medium enterprises (up to 10-20,000 people?) the 7010 easily provides enough ports to serve as campus core, distribution, and datacenter core as well. Yes, VDC’s might modularize that, it’s still a lot of eggs in one basket. As speeds go up, one might end up wanting separate campus and datacenter core N7K’s at that scale. 

The N5K plays nicely with the N2K to provide the “pod” approach Cisco’s been talking about. I like the Just In Time provisioning aspect: build a pod of 4-8-16 racks using N5K and N2K to minimize and localize cabling. As time goes on, decrease the racks per pod to increase 10 G port density — or use newer N5K models as they become available. This is one place where “saves money” may or may not apply — you can end up with a bunch of N5K’s, which aren’t that cheap. On the other hand, is the total cost more or less than book-end 6500’s on the ends of a long row? It’s hard to tell. Cheaper than high-end 6500’s with all the newest tech trying to get close to wire speed on a lot of 10 G ports. Heck, how does a single 5596 stack up against a 6500 performance-wise?

FCoE has the potential to reduce access layer cabling by half — if your SAN team is will to co-own the FCoE (which can be a barrier). Less cabling = win!

Another item that  is easy to overlook: swapping 6-8 x 1 G ports to each server for 10 Gbps ports greatly reduces cabling. Cabling in the first  place, labelling, and maintaining cabling plant are all more costly than you’d think (time consuming!). Think about using VLANs and VRF’s instead of re-cabling to new switches, e.g. when server ownership or security zone changes. Wouldn’t that be a win?

I see the UCS as a game-changer too. The memory mapping aspect means more memory per socket, cheaper than HP’s approach, which is limited by standard DIMM slot count to using expensive more dense memory. More memory per socket means more VM’s per blade server socket, hence higher density. Less space, less power overall.

Facilities are changing too. In some recent datacenter tours I’ve been seeing more sites with things like:

  • No raised floor
  • Heat or cool containment aisles
  • A lot more attention to cooling air flow, placement of floor tiles in raised floor buildings, etc.
  • All cabling in cable trays
  • All power and localized power distribution overhead, in a 2nd layer of trays (sometimes)
  • Generally less space for servers, more space taken up by storage arrays

The one thing I haven’t noticed much of (yet) is use of twinax for inexpensive 10 G server connections. And by the way, don’t put your N5K’s at the ends of  long rows, if you put a pair of N5K’s something like 8 or 9 racks apart, you can use the much less costly twinax. For a row of 16-20 racks, go with N5K’s say in racks 4, 5, and 16, 17 — that is, break the row in half, and put the N5K’s in the middle or near the middle of the two pods of 8 racks, so that all server to N5K distances are within the 10 meter max for twinax.  See the diagram below.

I see we sort of passed over the dark side of this. It’s not that bad, but it’s hard — I’m talking about operational procedures and shared ownership. The technology stovepipes have to become less rigid for this to work optimally. Personally, I see that as a great career opportunity for people — having combined network / server / SAN skills will make you very employable going forward. 

Save Money

Here are a few thoughts about that:

  • Fewer boxes = less $.
  • Use VLANs and VDCs and other virtualization techniques for security zones, to reduce box count.
  • If your site security people or someone insists on separate switches in each row for production, DMZ/perimeter, and backup, have a chat with them. That’s expensive and inflexible!And takes a lot of labor to cable and maintain.
  • 10G on twinax copper to servers is a lot cheaper than optical transceivers and fiber. 
  • The Cisco bundled FET for N2K to N5K or N7K connectivity is also relatively low cost, works with fiber over quite adequate datacenter type distances (25 – 100 meters).
  • 10 G versus many 1 Gbps connections — less cable to manage, less money on patch cables, less switch real estate, aggregate power to drive switch ports reduced.
  • If you want control over / visibility into how the servers connect, the 1000v can replace a physical switch at less cost. The coming VM-FEX NIC and software should allow you to use an adapter that is in effect a small N2K. HP is already selling a Cisco N2K that goes into their blade server chassis, the “Blade Fabric Extender“. I have yet to compare costs, but I like the idea of a remote-controlled FEX in the chassis (in many chassis) rather than having the server team squandering money on HP VirtualConnect (which has never struck me as very useful, it does the processor to external interface plumbing the UCS manager does intrinsically). 
  • Cutting down on internal and external hardware and connections: less heat and power, less to manage = lower cost.
  • FabricPath (when mature) has the potential to mean more VLANs throughout your datacenter without STP risk and having to exercise control. Although I’ve written elsewhere that you might want to still try to keep some VLAN discipline going. How to do so, I’m still grappling with — as I suspect most of us are.

This article is getting a bit long on me. I’m going to leave it to you readers to add to this (CHALLENGE!) — please add comments with other ways datacenter changes lead to cost savings.

New Technologies

Well, there is all that Nexus stuff. Are you sharp on what the Nexus line can do for you? How about up on FCoE and FIP? Do you know enough SAN basics to at least talk to your SAN cousins? 

We’ll skip the shameless plug for the FireFly classes, which I think are tops. They really invest in trying to ensure the best possible instruction.

And there are classes on 1000v, VMWare, Cisco UCS, etc. too. A chance to express your server side personality! <grin>

For the more exotic, there’s the somewhat less-Cisco-flavored stuff: VXLAN, EVB, OpenFlow, OpenStack. 

I should add Data Center Interconnect to the list. Although that’s becoming more of a known art, the “optimal routing” aspect (and/or using LISP) is still pretty new stuff. 

So for those who like me enjoy the technology and making this stuff work, happy reading!

11 responses to “The New Datacenter

  1. I just built several dc’s worth of network for a large enterprise, and the biggest problem we had was access density using 5k’s and 2k’s when they’re still deploying a lot of 2-4u boxen loaded with 5-9 interfaces by the time you’re done with ilo and such to appease applications folk. Not to mention FC HBA’s. Server and application people hate sharing bandwidth with backups, san, heartbeat links, and other application nonsense, so I find companies are still going out of their way to separate bandwidth interconnects and provide separate L1/2 purpose built vlan/switch "blocks" for backups, fcoe, etc. Add 9 ports per server, about 3-4 2248’s per rack, add in oversubscription, rack real estate, switch ports, and even a pair of 5596 for a good 10-14 racks of servers is cutting it thin. Good thing fabricpath allows for 16x interconnects, but now you’re pushing heavy oversubscription on the f1 modules. We’re right back to the days of 6148 modules in a 6k with asic drops on giant l2 networks. Can’t wait!

    Over subscription is another thing Cisco tip-toe’s around with nexus. As you said, 4:1 on a 7k is still not optimal as cisco doesn’t provide an extensible (snmp) way to monitor capacity at an asic level. At least none I’ve found, and I’ve asked for this for 10 years in a 6k. So much for fcaps. QoS is still an elusive voodoo to most, and even more complex on nexus, so it just leaves a lot more possibility for misuse.

    In the end, I think most enterprises simply aren’t ready for ethernet convergence to make the dream of true pods a reality, but then again neither are the vendors.

  2. Hi Michael! Sounds like you’re agreeing with me that a new approach is needed.

    Yes, the need separate interfaces to avoid traffic competing for bandwidth mindset is a good part of the problem. I’d say "use a management tool and MANAGE it". App people don’t seem to want to do that. You say you’re seeing a lot of 2-4 RU boxen, and my next question is "why not blade servers". But if that’s what the customer (internal or VAR) wants, nothing you or I can do about it. This all sounds like rigid thinking in the server team(s) you’re working with. Is it an education problem, to new ways of doing things, understanding rather than fear of the network, something like that?

    Re the ASIC’s, see the comment to a prior article about Cacti and customized reporting for port groups.

    Yes, I’m not wild about 4:1 oversubscription on M1 card ports in N7K either. Mix M1 and F1 cards and deal with the quirks of doing so?

  3. Great post, Pete.

    I’ve built 5K/2K pods the way you recommend here (center-of-row 5K with TwinAx) and I’ve built environments with centralized 5K and far-flung 2Ks. I wouldn’t centralize the 5Ks again.

    FET is cost competitive when compared to TwinAx, but doesn’t suffer the length limitations, so why constrain your logical topology by making it look like the physical (row length) topology?

    Centralize the 5Ks. Put FEXes where you need them. You want just-in-time pod/rows? Fine, centralize the 5Ks anyway. By getting the 5Ks out of the server row, you eliminate the requirement to have management copper Ethernet and RS232 serial in every server row. Instead, this equipment can sit right on top of your stack of 5Ks.

    A recent deployment I worked on includes tens of pods with not a single strand of copper data cable running into or out of any server cabinet. We did elect to use copper for power 🙂

    Second point: FEX airflow is a problem. If you don’t do something about it, the FEXes will ingest hot server air, possibly leading to overheating.

    Shameless but relevant plugs:
    5K/2K layout: http://bit.ly/emxisN and http://bit.ly/i3jzpj
    FEX aiflow: http://bit.ly/qDh5Hw

  4. Thanks for your comments.

    I’m curious why you wouldn’t do the centralized 5K’s again.

    You can ONLY use FET from 2K to 5K. That works if all your connections go via 2K’s. The central 5K is needed if you want line-rate 10 G connections directly to the 5K pair. 2232’s don’t totally excite me due to the 4:1 oversubscription — so I anticipate needing direct connects via twinax to cover a growing number of servers which will use most of their 10 G connection.

    Ah, I see by central you mean central to the datacenter. I see why you’re attracted to that. On the other hand, it means either putting "access" 5K’s in pods for line-rate 10 G, or home run cabling outside the pod to the central 5K’s. Neither localizes the cabling the way the 5K in pod approach does. Yes, the management ports are a pain. So is ILO to the servers. N2K for that?

    The FEX airflow is "backwards": you can now reverse it when you order. See [url]http://www.cisco.com/en/US/prod/collateral/switches/ps9441/ps10110/product_bulletin_c25-680197.pdf[/url]

  5. Hey Pete,

    Gah, I misspoke in my opening comment. I meant that I wouldn’t do in-row 5Ks again. I’d put them in the center of the DC.

    I’m with ya on the oversubscription question. It hasn’t been an issue for me, but could be in some environments.

    Here’s something else to consider pricing-wise. There are two ways to connect *line*rate* servers to a center-of-DC Nexus 5K:

    1) Connecting up eight 10-gig server ports directly to a centralized 5K costs about $7600 real dollars (Cisco SR optics at 50% discount in the 5K, and $200 Finisar optics in the servers).

    2) Connecting up eight 10-gig server ports to a non-oversubscribed 2232 (8 uplinks and 8 downlinks) costs about $7800 (N2K-C2232PF plus 8 twinax at 50% discount).

    It’s interesting that these options price out almost the same. The FEX count budget (16 FEXes on a 5548?) might drive you to the first option, but option 2 introduces the possibility of mild oversubscription without any cost penalty.

    Connecting to an in-row 5K with TwinAx is nearly free by comparison, but comes with the operational overhead of managing far-flung Nexus 5000s, and the cost of installing copper structured cabling into the server rows. This cost offset the fact that in-row 5Ks gave "free" 10-gig server ports.

    Yes, N2K for the server iLO. Thank goodness for the 2248’s 100Mb/s capability! My 2148T pods included a 2960 with Gig copper vPC uplink to the in-row 5K pair.

    I’ve just worked through this question with a new customer. They had an expectation that "management network be separate from production" (reasonable), and applied that logic to server iLO (silly). We chewed on it for weeks and eventually decided that there was no sense in keeping the iLO off the prod network. What would server admins do with their iLO connections if their access ports are offline? Nothing 🙂

    On the airflow front, I’m familiar with the options. They’re a bit confusing:
    – The default airflow option is referred to as "front-to-back"
    – "back" is the port end of the FEX (what?)
    – The default airflow option makes FEXes flow in the same direction as the servers. This is a good thing.
    – The FEX is short. If its exhaust is even with the back of the servers, the intake is in the *middle* of the cabinet.
    – It’s HOT in the middle of the cabinet. This is a bad thing.

  6. No problem re "misspoke",I just saw about 2 possible meanings for "central" and asked for clarification. So you like the datacenter-central approach, and I like the mid-row central flavor.

    I’d never really thought of the N5K in the context of what I’ve been calling "death by small switches" (too many = management nightmare). I guess that could happen if you have say 2 per pod and > 15-20 or so pods. OTOH, your approach ends up using the more-expensive fiber transceivers and more fibers to the datacenter-center boxes. To me, spending some money on fiber optics to save you management time may well make sense. I consciously spent some money to save my time in a former job where I was sliced way too thin. On the other hand, you could also debate that the extra fiber (many pairs versus a couple per each row-centered 5K) is a different form of management time cost. Yes, probably one-shot.

    I’ve been leaning towards in-band management, use an ACL to limit the SSH source if you must lock it down more. I do see enough SSH username/password probes on an Internet-exposed server that I really do see some point in controlling that. But of course that’ll then mean you have to make an extra trip to the datacenter the next time you have an outage on the weekend. 🙂

    Interesting point re front/back, I’ve been feeling that way. It seems like "the side with the ports" ought to be considered "back" since servers do it that way. But the port side of the 7010 is considered the front, at least in the courseware.

    Good point re intake in MIDDLE of rack. Hadn’t really thought about that with Nexus 2K.

  7. Pete,

    I believe my approach can actually be less expensive, but to clarify, I’m not using 5Ks for 10G access ports. 2232 FEXes are doing that job, so there are no expensive SR-attached servers. I recognize that oversubscribing the hosts is a luxury that not everybody has.

    My center-of-DC 5K scheme introduces the possibility of using twinax between the 5Ks and 7Ks, eliminating the cost of the 7K5K links. Delhi release was supposed to have supported the OneX converter in the M108 card, and the M132L, F132 and F248 cards support twinax directly. I expect the upcoming M2 cards will as well.

    I haven’t actually *done* this, mind you, but I think it’s compelling. Twinax saves $2800 (list) per link. Multiply this savings by the hundreds of ports you’re likely to use in a loaded 7018 pair, and it gets north of 7 figures quickly!

    Next, when 5Ks are deployed at center-of-row, you need console and management copper connections all over your data center. A couple here, a couple there… This is my main beef with the in-row 5K layout. These two management ports are the only things plugged into the stinking copper panels installed into the cabinets of my first (twinax-based) 5K deployment. Maybe it’s not about the cost. Maybe I just have a chip on my shoulder about installing these turds. 🙂

    Finally, centralizing the 5Ks decouples the 5K/2K/cabinet mapping from the geography of the room. Short/long/inconvenient row length is no longer an issue.

    I did do the inter-cabinet wiring no-no you mentioned, but only for iLO. 10gig server cabinets each had a pair of 2232s at ToR. In addition, there was a 2248 in every 3rd cabinet. Each server had a 2248 for iLO in either its own cabinet, or in the next cabinet over.

    The management headache isn’t about the number of 5Ks. As you note, its the same number either way. It’s more about geography: structured copper for management, and mapping FEXes into rows.

    "side with ports" *is* the "back" of a 5K or 2K, and it makes sense when looking at the rack of equipment. …But nobody guesses right when you plunk a FEX on their desk and quiz them about it 🙂

    Google Panduit CDE2 for airflow fix.

  8. Chris thanks for the great comment conversation! (Debate?) I’m enjoying the different points of view and discussion!

    I tend to assume things are going to rapidly head towards full-bore 10 G ports, so I may have been a bit deaf to your plan to bank on Nexus 2232’s. (I could also blame age, I suppose :-)) I take the approach that a pod is some number of racks with "modular connector" being two N5K’s. As density goes up, the number of racks per pod goes down — but my hope is this approach keeps some commonality. With your approach, you’re indirectly betting on a less-oversubscribed future N2K — which probably isn’t a bad bet.

    I felt the need to evaluate what you say by working some numbers. I’ve encountered blade servers with 12 x 10 G ports. 4 such would fit into a rack. If you dual-home them, that’s 24 ports to a 2232 in the rack, 24 ports to one in the next rack. Ok, those numbers don’t work with 32 ports very well, so let’s say 4 blade servers with 8 of the 12 10 Gbps ports to be used. That makes 32 ports per rack, do pairs of racks, single N2232 in each rack, 8 x 10 Gbps uplinks per rack therefore.

    You might get connect up to 12 such pairs (24 racks) to a Nexus 5596 pair, for a total of 24 x 8 = 192 x 10 Gbps uplinks, probably FET to keep the cost down. So that’s 192 fiber links patched across to the dc-centered N5K pair in your approach. I’d instead run them locally in the pod, and only have a few x 10 G links going from my N5K in the pod pair to the core N7K’s. Add 10 G as needed. More costly, but a lot less to fiber patch through. Note that I do assume far fewer N5K to N7K links than 2K to 5K links. If you’re doing FabricPath for high East-West bandwidth, you might violate that assumption.

    Re copper, I haven’t seen a great answer. Patching 2 copper mgmt ports per huge pod as above isn’t a big hassle. Run ’em to a in-pod 3560 or something and you only need 1 or 2 uplinks out of the pod, and can use that 3560 to handle 10 Mbps UPS mgmt ports, etc.? There are also darn small but capable Linux terminal servers available for $300-400 for OOB access. (See [url]http://www.opengear.com/products.html[/url] for instance.) That reduces your console port copper count to one connection to the terminal server. In which case all you need the mgmt port for is to have an address to SSH to, SNMP to, PING, etc. to manage the chassis. I’m wondering if plugging the management port into a N2224 off the other N5K might be a strange but valid solution if you complement it with a terminal server for when the whole pod is unreachable. There’s got to be a better simple way. (Comments anyone?) I’d say if you put L3 in the N5K you don’t need the mgmt port, but that’s a darn expensive way to solve that problem.

    Good point re de-coupling the row length.

    For clarity, maybe Cisco needs to document this stuff as "hot side = side with ports" or something like that.

    Yes, Panduit makes barriers, also useful for airflow with Nexus 7018 I’m told. Good tip!

  9. I was just checking the latest stuff at the link in the previous comment and saw [url]http://www.opengear.com/product-acm5000.html[/url], tiny box with a WiFi option. I didn’t check the price, but that’s a creative way to address the "don’t want more copper to patch" objection noted above. Assuming you like/trust/are OK with WiFi for such a purpose.

  10. Hey Pete,

    +1 on the opengear stuff. Great products (Hi Jared!) I think a wireless terminal server would be a tough sell in the data center, but I did once see an enterprise deployment of bluetooth-based console ports! 😮

    I worked up layout and pricing for some of these scenarios, and wonder how closely they relate to what you had in mind. They’re on the blag I linked above.

    Also, I’m curious to hear more about the blade servers you mentioned. The platform I’m most familiar with is the HP C-class. …But I’ve never seen one with so much 10Gb/s connectivity! Last ones I worked on had redundant 2-member 3120X stacks (four switches) with a total of 4 links configured to each enclosure. Even with four of these guys in a cabinet, that’s half of the 10Gb/s density that I’m building for now.

    Also FWIW, the 3120X (if that’s what you’re working on) is actually [i]undersubscribed[/i]. It has 2x10Gb/s uplink, but only 16x1Gb/s server interfaces.

    I haven’t explored it yet, but I’d probably be most interested in the new blade FEX if I were facing more C-class deployements.

  11. Thanks, Chris. I was thinking wireless if you didn’t want to run a copper connection, and small since you might only have a few 5K consoles to connect up here and there (if they were out in the pods). Yeah, that still leaves OOB mgmt — but maybe they hook up in-band somehow.

    Like your blogs.

    Dunno the HP model that’s going into a major hospital Epic deployment, but I’m told they have 12 x 10 Gbps ports each.

Leave a Reply