Improving Network Management Tools

Here we are, it’s Summer, it must be time for another rant (ahem, “carefully reasoned polemic”) about Network Management (NM). What should NM tools do for us? How might they help? Let’s explore that a little…

Have you ever had to sit through a boring network management product training? If so, that may be a sign of poor training development, or it might indicate something deeper.

When you’re sitting in training, what do you care about? You’re probably thinking, “How do I use this tool and how does it help me?” A lot of NM training is more oriented around the tool, installation, admin, and driving the individual components of the tool.

Symptom: “Show me the value! Need more coffee so I can stay awake.”

That’s a problem. The vendor hasn’t actually communicated with users of the product, watched their workflows, etc. Especially picky / critical / insightful users (Those are the adjectives I’d like vendors applying to me, not the four-letter words they may actually be using).

I had the bright idea of doing use-case oriented Ops training for one site, in one hour “lunch and learn” nuggets. Then, I found it hard to come up with ways the NM products at that company actually enhanced troubleshooting.

That’s not a good sign! And it ties back to the thought that vendors maybe need to do a better job of talking to their customers and their use cases. For that matter, talking to non-customers might be VERY educational (hearing why they think your product is pathetic will certainly indicate what needs fixing — assuming the non-customer is clueful).

All this might also explain why most networking people that I’ve watched go straight into SSH / show command work when troubleshooting.

So What Good are NM Tools?

Detecting outages should be easy, given competent / complete network discovery (and preferably, mapping). Tools have been doing red icons or lines in a log for decades now.

Brown-outs (service slow-downs) are more the problem these days.

Map or path colorization is useful but can be one-dimensional. You can use colors for up / down status or for various performance measures (utilization, error, or discard percentages). But only one of those at a time? So maybe our map needs to let us shift between colorization schemes? Or display multiple link / router metrics at once somehow?

This is still coming at it from the “got info” rather than the “help me solve this problem” aspect.

Suppose you have nothing to work with, namely the trouble ticket looks like “App X is slow”, and you know nothing about the application. That’s often the case. Then you’re in the awkward position of having to establish that a lot of the network is innocent, as in, not the cause of the problem. We’ve all been there, MTTI (Mean Time to Innocence), and finally other teams start looking at their piece of the puzzle. It’s far better if everyone doesn’t assume it’s the network until they’ve checked their piece of things is innocent.

So “scanning for possible problems” is where maps potentially help us absorb a lot of information quickly (threshold / fault logs less so).

As a result, however, we may end up checking out every glitch that shows up in our network management tool. Not very efficient!

Show Me the Way

If we have application endpoints (CEO to Internet, or user to application front end), we can then do the path trace / path display thing. If we get colors or other indicators of potential problems with easy drill-down, now maybe we’re troubleshooting faster. We’re certainly not writing down the hops in a traceroute in each direction and poking around one device at a time. I’ve seen folks doing both of those, all too often. Slow!

In the last couple of years, some tools (NetBrain, Riverbed, then SolarWinds) came up with tools that show paths (traceroute, to / from when asymmetric, other wrinkles). This approach is what I call “Google Maps for network management”, colorizing or showing info along the path. Paths are good; easier to code and display flexibly than whole chunks of network.

Maybe the tool could even do that for different metrics, as suggested above.

The Network Field Day presentations by SolarWinds around their path capabilities were interesting, and to some extent made me say, “whoa, it’s not that simple.” Ok, but still, vendors could do something with paths, then do it better.

Licensing and server capacity should be non-issues at this point, for well-written products on up to fairly large networks. StatSeeker and some other Australian products demonstrate great polling and storage speed.

Taking that to the next step, perhaps network neighborhood maps would be good. As in, I think the problem is around this part of the datacenter, or around the CEO… Colorization of maps has done that for a while, but primarily for outages.

Let It Flow

The problem with all this is, it lacks specific enough information. Even if you’ve got fairly good trouble ticket information that “Application X is slow”, the network path approach needs some network endpoints to work with. This is a case of network people or tools taking a network focus… sort of like the joke about “I lost my watch over there, but the light is better over here.”

Seriously, why are we stuck with tools that talk only to the network devices, and not anything else? One reason might be having control and ease of getting access. But if you step back and think about it, that assumption is rather limiting.

When I started writing this blog, I thought I wanted application flow tool (AppDynamics, etc.) integration with classic SNMP tools, all on a path diagram or map. However, that may not be the right answer!

My Crystal Ball Is Cloudy

For Cloud, we’re not going to have SNMP data about interfaces and devices, unless perhaps we’re using virtual devices. The best we can hope for is for actual application or agent data to tell us about service delays, one-way or round-trip time, latency, jitter, and packet loss. Which, incidentally, are the things good SD-WAN products should be tracking based on the application traffic flowing through them.

We probably need a third-party tool to do that because the app developers may not consistently instrument their application, either internally or using the cloud provider’s tools for tracking application / service health. And you don’t want to have to code your own app to get info from 100 different app API’s, something commercial vendors are unlikely to support. It all comes down to who you want to pay for information about what’s going on.

What I’ve seen so far in the app flow information biz seems very service oriented, but not strongly knowledgeable about where the services were running or what platform they were running on. Being a network person, we need to know where the various services are located. Are the tools going to evolve to help us with location info? Or does our naming convention or documentation (which is either missing or out of date) somehow have to do that? In other words, more manually tracking things down instead of automation?

Even if your organization isn’t that heavily in the cloud yet, we network people need service location information anyway. I’ve worked for a few organizations collecting that information by talking to people with institutional knowledge (for lack of consistently useful other tools).

Doing that takes quite a bit of time. I know of one instance where a major travel website was having issues for three weeks while they tracked down a former employee through several subsequent employers.

Conclusion: Document now or be prepared for whoever has to troubleshoot it later to experience pain and possibly major delays just gathering app flow / server information.

So, while the tools are costly, so is application discovery by consultant and downtime. The right tool may have the answers you need when something goes awry, whereas having to hunt things down manually (NetFlow, etc.) takes time.

Sample Use Case

Imagine that someone moves a service from Cloud A to Cloud B or back to private cloud, increasing latency. That causes response time of the service and of services that depend on that service to slow, so cloud auto-scaling kicks in to load-balance across more micro-service instances. The invoice at the end of the month shocks management and breaks the bank (or maxes the credit card causing cloud provider to halt all server instances).

How long does it take for someone to notice that latency went up, the load on the relocated service went up as well, and then figure out it’s a service location problem, not an underperforming instance?

How do we troubleshoot something like that, let alone automatically, before high cost ensues?

Conclusions

IF you accept the above, then I have some tentative conclusions:

Network people could start looking at, testing, learning, and using tools that can manage cloud-based apps. You may not get the right one the first time, but you’ll be figuring out what you need.
App flow type tools may need to evolve (or I haven’t found the right ones yet).
Network people need to think a bit more like application folks, and vice versa. And perhaps the right tools can help lower the barriers to doing so.
Broadly cross-functional teams or at least contacts perhaps?

Comments

Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

—————-

Hashtags: #CiscoChampion #NetworkManagement

Twitter: @pjwelcher

Disclosure Statement

NetCraftsmen Services

Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at [email protected].