An article about the role of a systems engineer hit home the other day. Key points were that the systems engineer had to understand how disparate technology pieces fit together, balancing which components do what functions, trading off costs and other factors, and considering the system’s risks and responses to possible failures, severity of impact, and mitigation.
This resonated for me. Maybe it’s just that I think holistically. Sometimes that means too broadly. Other times, it means spotting things other people have missed. I recently presented on Internet of Things (IOT), which necessitated lots of IOT reading. That’s a scarily broad topic, with room for various sets of deep skills — or skills that cut across different IOT technology/solution clusters.
It is clear that IOT needs people with skills in multiple specialties to do cross-specialty planning. But the same applies to what I see of digital transformation, and cloud/DevOps strategies. And even network design. Stovepipes have to go — everyone needs to be an expert at a few things, but also cognizant of other areas.
Network Janitors
One implication of this: For too long networking people have been the quiet people in the background, doing what we’re told. I call this “Network Janitor” syndrome, as in “clean up the mess on aisle three.” After the app is developed with unrealistic or strange networking (or security) assumptions, the network people hear about it at the last minute and are told to deploy it so some manager hits his or her deadline. Result: Network kludge, unhappy network ops ever after, and perhaps some not-so-happy app folks. We’ve got to move beyond that — and become involved MUCH earlier. Which requires management buy-in above the individual stovepipe level.
It also means we can’t ignore things that will be on the network. For some hospitals, all that ugly clinical stuff is someone else’s problem. For some utility companies, all that Smart Grid and IOT is going to be a separate network. Future reality: It’ll be necessary to have one network with appropriate segmentation and security baked in — costs, complexity.
Where all this really sunk in, however, was in some recent troubleshooting. Yes, I’ve been doing that a lot lately.
Troubleshooting a Medical System
Hospitals, medical imaging. The medical imaging system at hospital X was slow (for a couple of values of X). Small team, my/our job was to figure out why the system was slow.
The first time I did this sort of engagement, the symptom was that pulling up medical images was rather slow. Doctors’ time was being wasted, which makes doctors cranky. Must be the network’s fault!
We explored some. Solid info was scarce, so we ended up back with Application/Network Troubleshooting 101: Identify the traffic flows, make sure we have all the servers ID’d, check links and routers for issues, measure performance, see if we could catch degraded performance. As usual, SolarWinds had very spotty coverage – manual management of interfaces to conserve licenses.
All the imaging servers were on one switch in the datacenter. We put a medical imaging client on the switch, the imaging system servers were on, and it was just as slow. Convincing evidence, also handy for packet captures. It was Not The Network! MTTI quick. To be safe, I double-checked everything I could about the network, and found nothing visibly wrong. Only 1 Gbps links, but sufficient, at least for the present load and imaging system configuration.
With some Wireshark effort, we got the key servers and flows / flow sizes figured out and confirmed.
We did the math, and it turned out the actual average server MB-per-second performance accounted for all the observed delay. Recommendation: Faster disks, perhaps SSD. (Other details and recommendations omitted as not relevant here.)
The second time around, the problem was more that the imaging system overall was really slow, and the relevant servers were pretty much known. The challenge was not only to solve the immediate problem, but also to identify other components needing upgrade, such that mitigating one bottleneck didn’t immediately run into another limiting factor.
My simplified mental model of the system is that medical devices do various scans and send the data to one of several servers, which process, carefully compress, and store the data via database front-end server(s). When doctors retrieve a “study,” they’re in effect pulling files/image data out of a back-end database via those same servers and displaying it locally on dedicated reader systems. All that goes through the datacenter network, extending out to the image scanners and to the viewing “reader” stations.
Local sources told me the image studies for tomography were in the 8–10 GB range. I used 10 GB to do some simple calculations of the time it takes to transfer that amount of data across the network at various network speeds, and similarly for disk I/O.
Exercise for the reader: Figure out file transfer times based on 10 GB. Would you recommend 1, 10, or 40 Gbps networking between the relevant locations?
I’ll spare you some of the intermediate details. The main conclusions were:
- An F5 SLB in front of the servers was a bottleneck due to low licensed throughput, about 200 Mbps
- The images were crossing multiple routed hops on switches with all 1 Gbps ports, so the three servers were sharing a 1 Gbps uplink to the F5 and to the image sources
- The server disk drives used for temp space were not all that fast, and the processing servers were a bit low on RAM
- Ditto for the disk drives used to hold the “active” images, which also were very low on free space for recent images (as in, last couple of years’ worth). Overall, 2 TB of storage when tens of Terabytes might have been appropriate.
Let’s get back to the systems engineering topic now.
If you think about this from the systems engineering perspective, medical imaging is a system. It involves:
- Image scanners
- Image processing software
- Servers with RAM and disks
- Back-end database with storage
- Networking to tie them all together
What I suspect: Doctors understand that image scanners are nifty medical tools, highly useful for various medical tasks and also a driver for revenue. Doctors are also a strong presence on the hospital Board. So, the hospital was aggressively buying image scanners. However, budget controls prevented network, server, and storage upgrades, and the hospital in question may have tuned out advice from the vendor. Perhaps there was no one to build a strong business case for such upgrades. The end result was lopsided funding for the input side of the system, and underfunding of the rest of the overall system.
If you approach this one way, all the back-end medical imaging support is a cost item. Viewed differently, it’s an opportunity to make the system more productive: Spend doctors’ time more effectively, get surgery/procedures scheduled and completed faster, move more patients through the system, possibly increase revenue. There might be legal/liability issues involved too.
My experience suggests this sort of problem is common in many hospitals.
Generalizing
This has been popping up in other contexts, as well. It helps if the server/storage team talks to the network team and vice versa, as the reported performance problem may be one or the other.
What our consultants have been seeing overall is that you have to pay attention to all of this. Luckily, few business apps outside civil and other engineering work with 10 GB of data at a time.
The worst case I’ve heard of was the company doing geologic sub-layer mapping for oil drilling, where an engineer might open up something like 300 files, each 10 GB in size. “It’s a network problem” turned out to be not enough Windows resources allocated on the server to handle N users x 300 files each.
Conclusion
When it comes to business applications, networks are part of a system, as are servers and storage. Good design balances performance and costs of those components.
The caution and IOT tie-in is that as the number of devices and/or data volumes get bigger, disk I/O can be significant. It is already, in terms of VM and server performance; but the symptoms will be more conspicuous. Network speeds will also matter, for data transfer times. E.g. medical resource, acquiring 10 TB of data per week!
Exercise for the reader: Analyze an important application in your network. What’s the key component or factor currently limiting performance? How could you make that component faster? What then becomes the new bottleneck?
References
The following blogs are somewhat related and make some good points in a DevOps context:
- http://www.networkworld.com/article/3197393/it-skills-training/why-you-need-an-enterprise-architect.html
- http://www.cio.com/article/3192531/careers-staffing/why-you-need-a-systems-reliability-engineer.html
Comments
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!
—————-
Twitter: @pjwelcher