I’ve been pondering cloud services / cloud computing for a while. I’d like to pass along some thoughts. It happens that a reader asked a good question about https://netcraftsmen.com/blogs/entry/data-center-l2-interconnect-and-failover.html. I responded at length with some of what’s been on my mind, and would like to re-present that unexpectedly long answer with some revisions as a blog that is more visible to everyone. (Justification: instructors are allowed to like the sound of their own voice! — and I’m teaching one week a month now, the FireFly Nexus class.)
I like the idea of cloud. Blue sky, puffy white clouds … I like the suggestion someone made of a marketing-free name: swamp. “Let’s put our computers into a swamp.” Makes you think about the impact of the words used. Clouds are warm and comfortable, swamps… [Reminds me of someone I knew in sales and marketing who was very into Psycho NeuroLinguistics, hidden influence word choices make on how we think about things.]
Anyway, I’ve been sitting on the cloud sidelines quitely watching and pondering. Below, some thoughts, with luck they’re even somewhat original.
Cloud Security
I whole-heartedly agree with those that say some form of security audit / compliance certification pass-through is needed for cloud security to work for customers. There are legal implications (“due diligence”), it seems like if there’s a breach you can’t just say Amazon (or whomever) told me they were secure, go sue them. My legal perception is you need to be able to prove you made suitable effort and weren’t negligent — and had reason to believe the provider is secure to the level needed. That’s quite a mind-set change for classic Service Providers. I’m still repeating the I-need-to-know dance with MPLS carriers, e.g. why should I trust your SIP trunking, how can I minimize the exposed portion of my network.
Partial Sidetrack: Recently Verizon provided a customer with some pretty good canned SIP security alternatives and discussion. That feels like a rare exception. I don’t like that notion.
A few years ago I did a MPLS security analysis for a major retail chain. It took a lot of effort to get the carrier to provide broad information about security measures. Ditto for a more recent server colocation farm due diligence effort, where the carrier in question had a pretty good story to tell in almost all regards. The two things I found that I didn’t like: no plan to contain water should sprinkler heads actually go off, and not having any automated tools that would actually detect and report unauthorized config changes to managed devices. They told me “that couldn’t happen due to their change control process” and that any employee who was caught going outside the process would be disciplined or fired, so “employees wouldn’t think of doing that”. They didn’t think hackers or staff might do something outside the system and hope to go undetected — and I sure couldn’t see what prevented someone from doing just that.
My thought: if you feel you can’t talk about it and learn, then maybe your security isn’t so good after all? Conversely, the discipline of preparing for public scrutiny might in fact shed light into some dark corners and gaps and actually reduce risk — which could have significant financial benefits as well (avoidance of lawsuits?).
I tend to generalize, sometimes on little evidence. What I see here is a small pattern: it might be good if Service Providers (including cloud providers) actually told us (in broad terms) what they’re doing for security, including intrusion / configuration alteration incursions (one of my pet concerns, since the above discussion).
There are some situations where cloud security is perhaps less important. If you’re processing large data sets of numbers, someone who doesn’t understand the data schema is probably not going to be able to do much with your data. As someone told me: “we have trouble understanding our data, and we know where it came from. Good luck to anyone else with that.” (Source unidentified since that might provide too much info.)
I hear that computer animation is moving to cloud since the need for compute resources is zero until a movie is funded, then all of a sudden you want max cost-effective capacity until no longer needed. I suppose if someone got rendering data they might somehow steal an animated scene… what is the real cost if that happens? Is the volume of data so large that data theft would be conspicuous?
My final comment is to note the trend with government and DoD IT: using private clouds. Or other government entities requiring USA data location plus some degree of server / hypervisor isolation and other measures. That suggests to me that what’s out there right now is Cloud 1.0, and as the security packaging improves we’ll probably see several variants. Practical conclusion: internal cloud is a good way to learn, and if you can partner with a provider with automation tools and Operations procedures, that might really reduce learning curve costs and time. Don’t re-invent the wheel!
Cloud Bandwidth
The cloud discussions I’ve seen have not fully considered WAN/MAN bandwidth’s impact. Cloud computers generally have lots of bandwidth to the Internet, so end user access speed isn’t the concern I have. Instead, there’s the “hotel California” issue — moving your data or VM’s to or from the cloud, and the cost of that. That particular aspect of bandwidth has been pretty well discussed. For that matter, I just copied about 1 GB of data across the country to my laptop over a VPN, and it didn’t take all that long. With huge databases, that’s a different story. All that seems to be known and rather thoroughly talked about.
Here’s a different angle on this. Juniper is touting the sub-micro-second latency of their new switches (high hundreds of nano-seconds?!). Depending on the algorithm, etc., latency can really slow the speed of computation. Example to consider: home computers doing DNS lookups for the average web page lately, plus fetching ads, is supposedly slowing the delivery of web content to users noticeably.
When you move a bunch of servers to the cloud, what you’re doing is effectively sticking part of your datacenter at the end of a long slow pipe, the WAN / Internet connection from your site to the cloud location. In effect, you’ve moved some portion of your datacenter access switch uplinks and/or distribution or core switch backplanes out of your datacenter and into the cloud. Can the MAN/WAN/Internet provide comparable bandwidth and low latency?
- Bandwidth, maybe. While 1 Gbps TLS service is commercial viable in more and more cases, I don’t see 10 to 40 Gbps MAN/WAN links being cheap anytime soon. Can the MAN provide equal cost-effective capacity, comparable to within your datacenter? It depends on how fast things are, which is probably related to industry and size of the organization.
- For latency, no. The speed of light is something we have to live with. Move servers elsewhere, you’ve definitely added latency.
For lack of a term, refer to the group of servers delivering a service as a “service delivery aggregate” or perhaps an “AppPod”. (like a VMware vApp perhaps?). If you split up that group across the cloud, you might be fine. Or not. What matters is whether there are heavy or latency-sensitive data flows between the stuff you moved to the cloud and the stuff you didn’t move.
I strongly suspect (from experience) that most server admins and app developers are unaware of their data flows, let alone where they might have embedded latency sensitivity. The first thing I’d like in solving an outage or slowness issue is a rough indication of who talks to who, especially DNS/LDAP authentication, major database connections, that sort of thing. I have yet to see where anybody has really documented their mission critical application, presumably because the vendor doesn’t document the various tools that make the application work (I’m thinking something as complex as WebSphere here). One would think it might be good to examine packet captures and document this stuff up front before the first time one engages in troubleshooting — but of course, the crisis de jour takes precedence.
Coming back to latency, we certainly see enough of the “I changed my T1 to T3 to improve app performance and it didn’t help” type of consulting troubleshooting work. Programmers just aren’t taught Networking 101 — a topic I’ve previously expended words upon. What they need to understand is where to be concerned about network latency, how to recognize such a problem when coding on a high-speed uncongested dev network, and programming and database techniques to reduce or totally mitigate a latency-related speed problem.
Conclusion: I expect a booming business analyzing why the application got a lot slower when it was moved to the cloud.
Ultimately I think we need to identify “tight groups” of servers (AppPods) that talk to each other a lot, but not much outside the group, and move them to the cloud as a single entity. Automated tools to do that would be nice. I’m not holding my breath, since for at least 10 years or more I’ve been looking for network management tools that would identify flows involved in delivering a service (e.g. www.my-corporation.com web pages). (The automated answer to the lack of time to document such things manually.) Oh, and the net management app must not be extremely or even moderately labor intense. I’ve seen a lot of claims in this space — salespersons please do not contact me unless you have reference customers I can talk to, maybe a cross-sampling that I select rather than those who drank the strong Kool-Aid.
Another thought (perhaps coming from security concerns, or size of database) might be to shift applications and servers to the cloud, keeping the data at the company data center. That seems rather clearly to run afoul of latency issues unless you have exceptional applications that run stored procedures and/or only need to transfer small amounts of information back to the servers that live in the cloud. I would think need for cloud relates to how much computing is needed, web front ends and lots of users or lots of data to crunch being the two most obvious uses of CPU cycles.
I suspect we’ll also see that services such as DNS, LDAP, Active Directory, etc. need to be replicated to the cloud since apps may make frequent calls to such services — and any latency will drastically affect app performance. I’ve seen this in a classic datacenter with Lotus Notes and an overwhelmed LDAP server, for instance. I’m not actually in the middle of any cloud efforts right now, so if someone can provide some feedback in a comment, that’d be great!
The Human Factor
There’s also the human factor. That’s the one that was partially responsible for ATM failing to replace Ethernet. If you have to change the application, it takes time. If you have to qualify the hardware the application runs on, that takes time and effort. Thus if you’re running on different hardware, etc. in the cloud, and the application is critical, you may need time to validate correct application operation. This is just like P2V (physical to virtual conversion, shifting a server to VM form). If people insist on manual testing, it’s going to take time. As there might be job fears alongside a move to cloud, there might be a lot of insistence on testing.
This says to me that moves of apps to the cloud are going to be messy and time consuming, unless they’re new apps that were developed on a given cloud.Strong hardware virtualization (dare I say “Cisco UCS” or something like it) might be needed.
The good news to me is that this might provide the impetus for some developers to clean up their act. I’ve been wondering for something like 20-30 years now why every database I’ve ever encountered wanted to stick the server name and possibly IP in all sorts of itty-bitty files, etc. Surely it can figure all that out when the program or service starts up?
Conclusion
I worry I’m sounding too negative here, and that’s not my intent. My intent is more that by being aware of the potential snags, we can deal with them and make cloud work for us. There are various success factors for cloud computing, and all the practical aspects have not yet been fully resolved. The Human Factor side strikes me as something I’ve seen a lot with new technology. People start out with fear of novelty, have to kick the tires, check the first pilot apps or whatever thoroughly. Then, with the right mindset, a lot of that can be automated and sped up.
The cloud looks great for rapid scaling changes of fairly generic applications (ones that run happily as a VM or can move easily to different platforms). Applications that are tightly bound to hardware, now they may be a problem. One-off obscure legacy applications that some project team built using carefully chosen hardware 10 or more years ago — doesn’t seem like a good candidate for the cloud.
Since I’m doing a lot of medical center work lately, that last resonates. Guess what sort of apps are all too frequent in the space. Along with FDA approved applications that must run on certified hardware as is. Not to mention the patient data and the challenge that might represent for cloud security. The opportunity: most medical sites seem to have a bunch of tiny datacenters with lots of inherent fixed cost inefficiency. Along the lines of recent publicity about consolidation of government datacenters, it would seem that hospitals could benefit from pulling their datacenters together into two bigger ones, and the economies would be even better if they could pool with other hospitals to share operating and licensing costs. Some of the consolidation I’m seeing in the healthcare industry seems to be about the latter (electronic patient record systems and networking being too costly for small to medium size hospitals to afford). I have yet to really see shared datacenter or similar initiatives. For that matter, I’d also expect the networks to need to become more robust and provide higher availability levels as healthcare becomes more critically dependent on electronic patient records — but that’s a subject for another blog.