I keep seeing use case discussions where SDN magically solves bandwidth problems. I am very interested in them. I do wish the authors would take the time to explain the network topology and their assumptions. Frankly, this does not add up for me, at least not without some additional information. My plea for authors writing about this is: please explain a bit more of the context or assumptions about the network which your thoughts are based on! And no, I am not going to put in a link to the latest blog I read exhibiting this problem, there is no value to possibly embarrassing the author.
Do not get me wrong. There are probably lots of SDN (usually OpenFlow flavored SDN) situations where Programming Flows can do something nifty that QoS cannot do. And I have seen some examples, and later I will identify some that seem plausible. But some of what I am seeing is broad marketing with no context, triggering skepticism.
I had better explain one of my assumptions: we are talking datacenter and LAN here, not WAN.
For WAN I get it. Google is doing it. Traffic engineering via SDN. There are paths routing will not normally take, and flow or MPLS technology lets you traffic engineer.
What I do not get is some of the claims for SDN-based Traffic Engineering in the datacenter. Context is one issue, the other is how rapidly an affordable control platform can collect and analyze data.
I am aware there are a bunch of academic papers out there about different topologies and flow routing algorithms. Personally, I think that battle is over, unless something dramatically changes. As far as I can see, most sites will be doing a spine-and-leaf CLOS tree topology in the datacenter. Special purpose compute modules may use different topologies. But in general, I do not think the discussion should be about arbitrary topologies. They do not (should not?) exist in the real world. I say <should not>, because there are lots of networks out there built by people with little training.
Yes, I probably just mildly dissed the academic world measuring cross-sectional flows and modeling. I am not convinced they are modeling the right things, or using the right metrics for success. Putting in the time to firm up those impressions is just not on my priorities list.
I suspect some of the flow shifting discussions are assuming full mesh or fairly meshy topologies, where ECMP does not apply, leaving more room (and more need) for flow shifting. How likely is it we will be using such physical LAN topologies?
Getting back to spine-and-leaf, I have not said FabricPath (or its less capable TRILL cousin). But I am thinking they are likely what you will have. If you are doing the flat all-VLANs everywhere datacenter that we keep hearing about, then one of FabricPath or TRILL is probably where you are headed. (My current take on SPB is that it is nested VLANs attempting to make L2 switches behave more like L3 — did I just say kluge?) With NSX and all the talk about overlay networks, it appears that datacenter L3 to the access layer might be coming back as a possible trend. Have your vMotion L2 cake while having robust L3 underlay? If you think 16-fold ECMP and hashing or anycast HSRP is not good enough for random traffic flows, and SDN can do better, please say so — and present some evidence and specifics, or the scenario you have in mind.
The thing is, in the datacenter most people are going to build out what links and bandwidth they can, based on the equipment they have. SDN is not going to be able to add spine switches and cabling, at least not in the near term. (Yes, Plexxi sort of can, using optics, but…). If you have paid for the port and the optics,why would you not put in the cabling, and maximize your throughput?
So it seems there are two cases:
- If you have lots of bandwidth, there is no need to do anything, applications get what they need. So what is the role for SDN?
- If you have fairly full links or congestion, or periodic congestion of all access uplinks (as perhaps with Hadoop data distribution), then you have a zero sum game at that time. If you expedite e.g. heavy-weight data distribution for Hadoop, because it needs more bandwidth, then you are probably depriving other applications of bandwidth. Which ones are you going to deprive?
The second of these is the one that intrigues me. If SDN (OpenFlow) is to somehow alleviate Hadoop traffic surges, who is the loser? Why is this not something we can accomodate with QoS as prioritization? That is, with QoS, we can easily provide Hadoop hosts or traffic with first use of some percentage of the bandwidth, and a proportional share of any otherwise unused bandwidth.
After “none” and “all” there still is “some”. That is, there really is a third case:
- What if Hadoop (or whatever) is sparsely occurring for each access switch pair, so that some uplinks are congested and some not.
That is the situation where shifting some flows might free up bandwidth for Hadoop and improve performance for the shifted flows.
To elaborate, if you randomly have one or two elephant flows to an access switch pair, then OpenFlow traffic shifting might apply — ECMP yields roughly equal results only if you have enough flows for statistical averaging to apply. In some ways, that seems to me to be as much a VM positioning / addressing problem, and a limited corner case though.
If you have a shifting Hadoop configuration, or are using reserve capacity on VMs for calculation, something along those lines, then spot congestion might come into play. Or competing Hadoop runs with different priorities. In that case, one might want SDN or the controller to tie the provisioning of the Hadoop calculation to the provisioning of QoS (or flow handling in whatever form).
If the SDN article I read was about randomly distributed Hadoop hosts, I did not get that out of it. Randomly distributed and sparse elephant flows with underused capacity in an ECMP setting, yes, shifting flow assignments could help there.
The second aspect of this where I start to have some significant problems with SDN magic is that an SDN controller can track fluctuations in application flows in datacenter networks. Sure it can do that. But in time to react and do something useful, let alone at acceptable cost?
In the WAN, sure, it is being done now. Particularly if you are dealing with large aggregate flows (site A to site B) so that individual flow fluctuations statistically average out (modulo TCP porpoising and other artifacts). Presumably the measurement can be done frequently enough if you do not have many entities to measure.
I note that most network management products can only poll a few SNMP variables across all interfaces at 3 to 5 minute intervals (except perhaps StatSeeker, which can do more, faster). And others make it too expensive to do so (I am thinking mainly of the high priced products, but also to some extent even SolarWinds here). Sure, throw a scale-out architecture at it and maybe we can get more data faster. At what cost in server hardware and product licensing?
What comes to mind is the sort of accomodation data centers now make to persistent large flows such as replication. So OK, I can imagine SDN reacting to a small number of long-lived large flows that do not tie up all uplinks out of an access switch pair, and shifting traffic to reduce congestion. That should be do-able now.
For years, I have been calling this situation (by bad analogy) the network management Heisenburg Uncertainty Principle (many devices/interfaces @ slow polling or few devices/interfaces @ rapid polling).
Here is what I would like some data on: how many flows are there in a datacenter of size X, how fast do they change, how often would you have to poll to get useful information about them, and how fast would you have to update flow tables to do anything with the information? Bear in mind that congestion involves micro-bursts filling up queues, not 5-second or longer utilization averages. Also bear in mind that a delayed feedback loop generally exacerbates oscillations. (Say that three times quickly!)
Kudos to Ivan Pepelnjak, I really liked his blog about estimating number of TCP sessions per host, at http://blog.ipspace.net/2013/10/estimating-number-of-tcp-sessions-per.html. That is the sort of data I would like to have for this topic, for discussions of on-the-fly datacenter Traffic Engineering (via SDN or otherwise).
To sum up, I would like to see some feasibilty data and context backing up SDN / flow control discussions, use cases, and claims. Especially broad claims that it will alleviate troublesome congestion in (any) datacenter.
I am looking forward to learning from any (polite) comments and debate this blog inspires!
Twitter: @pjwelcher