I thought I’d write a quick note relating to performance problems, and since it is Halloween when I’m writing this, the title “Performance That Goes Bump in the Night” seems somehow appropriate. The triggering event is having some performance situations come up recently that all came together. They affect the Cisco 3550, 3750 and 6500 switches, and the 7301 router.
Terry Slattery’s great blog titled Application Performance Troubleshooting contains a discussion of overruns. The background story behind that is really what made me want to write this blog.
The basic point I have for this blog is that one needs to know the performance characteristics of the equipment you’re working. You need to know limitations stated by the vendor. And you need to do performance testing with your prospective configuration and traffic mix to see if there are any hidden gotchas to the stated performance numbers. Packet size is only the most obvious of them.
Caveat: The following contains some moderately careful test results by various people representing their best effort at the time. I have done my best to fairly state the test conditions. I cannot guarantee accuracy. Test in your own setting if you need 100% reliable numbers that represent likely performance in your network.
Oversubscription and Cisco 4500/6500 Model Switches
The first thing you probably want to avoid is oversubscription. You need to be very aware where you put oversubscription into your network, and monitor (ah yes, those pesky network management tools nobody uses) to make sure you’re not exceeding the capacity of the port or device. The easiest way I know of to oversubscribe right now is to use 6148, 6348, or 6548 cards in a 6500, as these cards are 8:1 oversubscribed. Put your 1 Gbps servers on adjacent ports, and as soon as their combined throughput exceeds 1 Gbps, you’ll be merrily dropping packets. In a closet, not so bad unless you’ve got a bunch of stockbrokers or power users. IP videoconferencing or other video, maybe not so good.
Conclusion #1: Avoid putting servers on 6148/6348/6548 line cards, unless you only put a couple per block of 8 port group. Better: upgrade to 6748 cards with DFC or Nexus.
In a related note, I’m not a big fan of putting 4500 / 4500-E switches in datacenters either. The 4500-E, until recently, was 2:1 oversubscribed in terms of backplane performance (24 Gbps per line card). The older 4500 model only did 6 Gbps per line card — in other words “great bottleneck for Gig-attached servers”? Now, with the Sup-7, it arguably is a cheaper closet switch with better throughput than a 6500.
I also see a lot of 6513 switches in datacenters. Just don’t do that. Until the 6513-E, they didn’t support full throughput with the Sup720 on all slots.
Cisco 3750 Switches
Some people like 3750 model switches. You can do cool things with them, like build a stack and dual home it off different stack members. Try that with your 6500! Modular growth, a heck of a lot less costly … all Good Things! However, one of my customers (Keith you know who you are — thanks for the info!) has been testing, in regards to aggregate MPLS WAN throughput. He found a couple of mis-configured edge devices where he wasn’t getting the nominal throughput. Reportedly, Verizon had to fix a couple of mis-configured policing commands in edge devices, having not set the edge burst capacity correctly, among other things. (Do you know YOU are getting the WAN throughput you’re paying for?)
Along the way, throughput was at one point a lot lower than expected. It turned out the 3750 was only apparently capable of doing about 100 Mbps Gig port to port, as verified by an Ixia tester sending moderately sized frames. As far as can be recalled, there was minimal configuration on the 3750. The 3750-X tested out at about 300 Mbps. The total throughput went up with bigger frame sizes, as one might expect. The tester did much better on 6500 ports, so the issue was apparently not the Ixia configuration / test parameters. A Cisco partner document I have states the 3750 (original) is capable of 13 M (no units, I’m guessing packets per second). That’s about 13 Gbps total, or about 260-270 Mbps per port. If those are two-way numbers, then 130 or so one way is in the same ballpark. The tentative conclusion is that neither is stellar for 1 Gbps-attached servers. Your mileage may vary, and the usual cautions apply.
Conclusion #2: The 3750 may not be a great datacenter switch. It also may not be what you want front-ending a 500 Mbps to 1 Gbps Ethernet-based Internet connection. Test and verify for yourself — and do let me know / comment this blog if you disagree with these test results!
Cisco 7301 Routers
We (mainly a colleague I won’t name) did some 7301 testing at a customer site to verify a QoS policy. This is my recollection of what he found, based on an old-ish email thread. Along the way, it turned out a 7301 running 12.3(11)T ran at 48% CPU for traffic between two Gig ports, bi-directionally, at a packet size of 222 bytes. Double that was 100% of the Gig port utilization but 100% CPU load. Adding “trust DSCP” on input and 4 output classes dropped throughput to 150 Mbps.
That testing also showed that a 3550 with 5 input classes, 4 of them ACL matching based, peaked at 20 Mbps on a Gig port. Removing one ACL raised that to 220 Mbps. (Sorry, no data about the number of ACL rules, probably pretty short.)
Conclusion #3: Test with the configuration you plan to use. Just because there’s a Gig port there doesn’t mean you get to use all of that bandwidth. Some features, such as QoS, may adversely impact max throughput in ways you don’t expect.
Oversubscription and Network Management
I recently worked with someone who reacted to something I said by pulling out a wallet card and telling me I’d violated “Man Rule #5”. I’m starting to wonder if a lot of male engineers think something along the lines of “Real Men don’t use network management tools”. Is that also written on one of those wallet cards?
Even using the tools, it’s hard to see oversubscription directly — sometimes overruns are a better indicator. Why you can’t see oversubscription directly: I don’t know of any tool that is going to be all that great at reporting oversubscription of ASIC-based port groups. Is there one that will add up the inbound utilization for ports 1-8, 9-16, etc. and report it? On a Nexus 7000 32-port 10 G card, add ports 1, 3, 5, 7 or 2, 4, 6, 8??
You also have to bear in mind that utilization averages lie. That is, an average is reporting a bunch of traffic spikes mixed with a lot of low traffic levels. Yet the spikes are what cause the drops — you don’t get “rollover” credits for having a lull in traffic! So Terry and I believe in 95th-percentile data, which tells you how bad things are getting while leaving out some of the most extreme behavior. And that’s a good topic for another blog.