AIOps: Using AI for Smarter More Efficient Operations

Since Artificial Intelligence (AI) has been a hot topic for a while, I should be blogging about it. Specifically, AI’s use in Ops, aka “AIOps.” Hence, this blog.

What Can AIOps Do For Me?

“WIIFM = What’s In It For Me”, is always a good question. Sometimes posed as “Why Do I Care?”

Ok, this is a big topic. The following is the parts of AIOps that I think are more likely or relevant or interesting having the most potential.

Standard initial AIOps sales pitch (but still true):

Modern networking (security, etc.) can provide massive logging and telemetry data streams from many platforms across the network, including security, WiFi, server, database, and cloud technology.

AI has the potential to help with the following tasks surrounding that mass of data.

How AIOps might do that:

Filter out noise and bring significant events to our attention.
Conduct prior dependency analysis based on network topology for use in correlating events (e.g., “this device failing causes that application to become unreachable”).
Identify time-correlated events across tech domains so we’re aware of all that was going on at the time of an outage, and identify the possible cause for a cascade of alerts.
Monitor telemetry data and look for statistical outliers, ideally by time of day/week/month/year rather than just across all data.
Identify components with statistically limited lifetimes or high failure rates and alert operators about likely pending failures (this likely might need to be based on vendor scale data rather than site scale).
Correlate problem symptoms with actions that successfully resolved them, and provide this information when similar symptoms occur.

Other possibilities, including some that seem more dubious to me:

LLMs and natural language queries, reporting, etc. – ancillary role but potentially useful.
LLMs in general: meh. The unreliable results and unpredictability seem to me to doom LLMs for anything where wrong answers could be a problem. Queries and reporting may provide enough value where the consumer can “sanity filter” results (maybe) for LLMs to be useful there? Example link showing I’m not an isolated aging curmudgeon stuck in the past: https://cacm.acm.org/blogs/blog-cacm/275660-face-it-self-driving-cars-still-havent-earned-their-stripes/fulltext.
Building configs – maybe missing nuances, tendency to “look good” (BS) or “hallucinate” – better uses are where it is fairly clear that AI did what you intended correctly and completely. (Or completely enough, you can easily fill in what’s missing?)
Identify device configurations that differ from those of other similar devices or device roles.

So What?

Well, knowing the types of things that a product might do can be helpful in reviewing products. There will be a lot of “AI washing” (claims of AI to add “shiny new” glitter to products). So, digging down into concrete use cases, capabilities, and limitations will be essential in selecting between products and choosing where your spending might produce value or the most value.

Why This Blog?

I’ve been interested in AI, and dabbled in it (Prolog, LISP programming, reading) in the 1990-ish boom. Now we can do a lot more, a lot faster, huge LLM models, for instance. But some ground truths remain.

The parts of AI that I trust are more the correlation and “advanced statistics” based aspects. Yeah, my background as a mathematician may be why I trust that.

LLMs, it’s amazing they now do what they do, but ultimately there’s the black box aspect: what is going on under the hood (so to speak), and how much can you trust the results? For natural language queries and reporting, especially as a human assist, yeah, that makes sense. Hey, we all find Google search useful. We just discard the “wrong answers.”

Buy or DIY (Do It Yourself)?

Chances are that major platforms like ServiceNow and Splunk have or will roll out sophisticated AIOps capabilities. (They claim to have them, but to what extent?) They can afford huge development budgets hence may have a lot of features, fast. On the other hand, big organizations sometimes have problems moving fast or breaking new ground.

In any case, the “Big Tools” likely will require sophisticated (costly) licensing fees. If you work for a big shop, that still may be the best answer for you, rather than trying to integrate Yet Another Tool.

Similar logic applies to various network management platforms. Given the apparent tendency of vendors towards selling all-encompassing network configuration and management tools (perhaps by domains or parts of the network), this may be limited by the vision of your network vendor. E.g. Cisco DNAC or ACI or Cisco Secure may each grow their own AIOps functions. Cisco DNAC has had some such for a while, in fact. Cisco AppDynamics is another place where AIOps might work nicely with app performance data. Common capabilities or shared code, maybe not at first: different dev teams.

For such hardware vendor-driven products, development may be hindered by AIOps being add-on value, not as close to main functionality as with say Splunk. (Or ServiceNow?)

Third-party network management vendors may want to play, but affordable development is a limitation. Few have deep pockets like the above bigger vendors do? Vendor-specific log and now telemetry formats are another barrier for them (well, speed bump anyway).

There are also third-party AIOps vendors. The good news is, they may move faster (features!) and cost less, especially initially. Potential downside: integration into log and telemetry data streams or repositories.

Sample Third Party AIOps Vendors

I have encountered two interesting AIOps startups in the AIOps space over the last couple of years. Discussion of their products and perspectives might be informative. I also included a third firm that I and some peers consider significant.

There are likely others, but I am not aware of them. (Marketing people will undoubtedly flood my email inbox upon seeing this!)

Selector.ai

The company can be found at https://www.selector.ai. I blogged about it recently, see What Does Selector.AI DO?

Selector.AI presented at NFD30 (Networking Field Day 30) early in 2023. Recommended! I watched the presentations, readily available at https://techfieldday.com/appearance/selector-ai-presents-at-networking-field-day-30/. Great explanations!

The one sentence summary: Selector.AI ingests logs, metrics, and events, including telemetry (streamed or aggregated). It then normalizes the data, puts it into a data lake, auto-baselines it, and checks for temporal proximity. It then filters and alerts.

Tim Bertino wrote up a great summary of those presentations, which I won’t try to replicate: https://artofnetworkengineering.com/2023/01/23/nfd30-gaining-intelligent-observability-w-selector-ai/.

The telemetry aspect is one key differentiator from products that are solely log-based. My rationale is that simple threshold events from other tools are not very smart indicators of problems. Whereas time-aware “smart statistics” (my term) and other AI telemetry data interpretation can potentially provide much smarter alerts, e.g., exceeded 90-th percentile for a five- or fifteen-minute period at that time of day, day of week, etc.

One of the selling points that resonated for me was to use Selector.AI to pre-filter log messages, reducing the volume going to ServiceNow (which charges per event).

Other highlights:

Selector also does event correlation, rolling multiple events into a single alert. It provides drill-down on its events. The claim is that this provides actionable root causes.
It can be used with SlackOps.
It provides network health alerting via consolidated metrics across domains.
Selector does configuration compliance, and correlates config changes with network performance issues.
It can use embedded network agents to determine network vs. application-related performance issues.

From my notes, Selector operates independently of semantics. That is, it needs little domain knowledge (I presume modulo normalization of logs and events and telemetry data.)

Customization to a site is include as part of the standard service.

As of earlier this year, Selector was trying to anticipate customer use cases, e.g. Kubernetes, multi-cloud. They cede application performance to vendors of APM tools such as Cisco. But they do monitor application interactions.

Like Kentik, they have a “negative feature list,” a list of things they do not intend to expend effort on.

BigPanda

https://www.bigpanda.io

BigPanda appears to be more focused on events and alerts, providing “noise reduction” and “eliminating monitoring silos.” It provides incident intelligence,” which might include probable root cause, or do triage aware of business context. They also state they automate known incident response, automate ticketing, chat and page notifications with team awareness.

Their blog page looked like some interesting reading: https://www.bigpanda.io/blog/.

Moogsoft

I was intended to write about Moogsoft, which has been doing AIOps a bit longer than the above two firms, and initially had a lot of attention due to some amazing correlation stories. Maybe even pioneered the AIOps space. I and some other experts have been watching them for a while as an interesting company.

While exploring BigPanda’s website, I ran across a blog noting that Dell acquired Moogsoft (July 2023), confirmed elsewhere. BigPanda’s theme was that the acquisition validated the AIOps market.

Other Vendors

Sumo Logic looks interesting: https://www.sumologic.com/.

Their marketing appears to be more about log and security analytics delivered via SaaS. Search revealed that they do some machine learning: https://www.sumologic.com/solutions/machine-learning-powered-analytics/. GigaOm just recognized them as a leader in Cloud Observability: https://www.sumologic.com/brief/gigaom-cloud-observability.

Other vendors not mentioned above that Google search shows as defining and marketing AIOps:

AppDynamics
MicroFocus
ScienceLogic
BMC Software
Dynatrace (“causal AI”)
IBM
Cisco
Broadcom
Aruba
Juniper
Palo Alto Networks
(And others, this list is getting too long!)

As far as Log Analytics, there are a number of vendors in that space, so I’m going to desist, and leave discussion of that for another blog. Or for Gartner to comprehensively review! Security tools in related spaces are more likely to be sold as SIEM tools, yet another topic for separate blogging.

Conclusion

It looks like Artificial Intelligence and Machine Learning can provide great value as part of your event/alert and telemetry processing. The potential of reducing the alert count and using correlation to identify potential problems or root causes or provide cross-domain event correlation (e.g., all things that happened around the trouble ticket time) really seems useful.

Suggestion: try one out and see if it helps!

Let’s start a conversation! Contact us to see how NetCraftsmen experts can help with your complex challenges.