The SolarWinds Networking Field Day 13 presentation focused on its new NetPath tool, a supplement to its Network Performance Monitor (NPM) product.If “improved multi-path aware TCP-based traceroute on steroids” grabs your attention, read on!
SolarWinds produces relatively inexpensive network management tools. If you haven’t figured it out from my prior blogs, I’m a picky consumer of network management products, and I am accustomed to customers often failing to put in the maintenance the products need (lower priority than other tasks).I’ve recently been fairly hands-on with several network management tools, including SolarWinds Orion. I just confirmed what many people will tell you: “SolarWinds is better than many of their competitors’ far costlier tools.” Their target appears to be solid data meeting customer needs while keeping their and your costs down.With any tool, I do recommend you buy a full set of licenses and monitor all active interfaces — but that’s a rant for a different blog. SolarWinds’ pricing model (basically, licensed node = interface not device) makes that a bit more challenging than with most other network management tools (or raises its effective price a bit, depending on your point of view).
Anyway, I wasn’t sure what to expect from the #NFD13 presentation. More refinements? Well, the entire session focused on the new NetPath tool and kept the NFD13 delegates’ interest!
It turns out, I was very impressed with the new NetPath tool!
The fascinating part (for me, anyway) was the learning process that the SolarWinds engineers went through in developing the tool. That includes how they had to overcome various hurdles, among them reducing false positives and identifying the most important problems along a path, filtering out some of the “noise.”
The first problem SolarWinds had to solve was finding the right path, taking TCP and TCP port behavior into account. If you’re thinking traceroute, well that’s UDP-based, and UDP is not what most internet applications use. Media (voice, video), yes, UDP. Anyway, using TCP may help avoid any issues with policy-based routing of TCP in the network.
Traceroute and SNMP MIB-II routing table data both can be misleading. That’s because neither handles ECMP routing well — the second challenge in the path space.
For technical reasons, the SNMP MIB-II routing table can only supply one entry per prefix. Vendors that rely on that for displaying paths are displaying only one of possibly many paths — and one that might differ from the path your traffic actually takes. Traceroute sort of shows when there are two paths, but piecing together the ECMP information into paths is not necessarily straightforward. Finding many/most alternative paths when each hop has multiple next hops is not simple. So SolarWinds had to solve the problem of stitching together the TTL-exceeded and other information to learn about paths. You have to know the paths before you can figure out where the problems are along them.
The third challenge (or group of challenges) was spotting performance problems and weeding out the noise.
Along the way, the development team also had to solve displaying the data in a user-friendly way.
If that interests you, watch the #NFD13 videos for the details.
The value of NetPath
I’ve become a big fan of path-oriented products. It comes down to troubleshooting: I know something between A and B is causing poor performance (a speed or quality “brownout”); how do I rapidly find the problem(s)? I’ve seen people writing down traceroute hops in each direction, then SSHing into the routers (please tell me you’re not still using telnet), doing show commands … it can take hours just to get a single snapshot! If your tool can’t easily provide the path and the performance data, maybe you need another tool, one that’s actually useful? Ditto, historical data along path(s), so you can spot intermittent problems.
That’s what NetPath does: It lets you see paths from virtual agents or other endpoints across your network to internal servers or to cloud-based SaaS or other servers.
You can also see if the path changes (and when), if there are problems (latency, packet loss), or where device configurations have recently changed. (That last item assumes you also have the NCM configuration management component installed.)
As soon as you include the internet/SaaS, any path performance data must be based on probe packets, since SNMP data is not going to be available from ISP or WAN provider routers.
It is (or should be) well-known that ping does not provide good packet loss or latency data, since processing ping is not a priority in Cisco devices or most other devices. (Think busy person, putting off responding to an email or phone survey or salesperson calling….)
My understanding is that the probing for each NetPath path consumes a one-node license, basically for the polling.
As far as I know, SolarWinds does not cluster servers for larger networks. You instead can throw multiple servers at a large network and assign portions of your network to each. Something to bear in mind.
I also suspect SolarWinds is more focused on basic functionality, not performance/scaling optimization. That makes sense, given their customer base, also the link between licensed entities: per-node/per-interface/per-polling item licensing tends to encourage focused polling. Other products with efficient polls/database IO don’t need to limit their users’ resource consumption as much — but then don’t do useful things with the data gathered either.
I do get the impression that SolarWinds products attempt to provide good responsiveness by limiting the magnitude of what you can ask the system to do. Having said that, that does help eliminate clutter and too much information overload.
The NetPath focus seems to be at the “flow” level. To track performance for a multi-tier application, one might need various virtual agents polling the various flows. That suggests that NetPath is probably best used for front-end SaaS responsiveness monitoring. Although virtual agents in the cloud might track back-end flows where your company controls the cloud-based application.
I wonder how well path tracking would work for, say, a company with 500 “critical” apps. (App owners often get their egos wrapped up in being “important” or “mission critical.” I’ve learned to use the term “fragile” when talking QoS; guys tend not to want inclusion in the fragile class unless their app needs it.) Setting up the polling for 500 apps might be rather — extremely — time consuming. Also, one might want grouping of polling results by application or service.
As a side note, monitoring cloud-based app performance has potential challenges if the DevOps team is much heavier on the dev than the ops side — i.e., testing/monitoring performance gets skimped on, or if stove-piping hinders cross-team visibility (network team visibility into inter-server instance or inter-container flows and their performance). That’s a topic for another blog, however.
Other vendors with some of the same or similar capabilities: Riverbed, NetBrain, ThousandEyes, AppNeta, and NetBeez come to mind. The latter two are somewhat more oriented toward tracking historical end-to-end performance with some pathing data, e.g. when the path changes. The first three provide more details about the path hops.
I’ll note that Riverbed allows you to use NetFlow data to stitch together flows into “service policies.” The point being, to (a) alert you when a key app is having performance problems, and (b) allow you to drill down to the component flows and paths to see what might be the cause. That might be a future evolutionary step for the SolarWinds NetPath capability.
Links and other blogs
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!