Troubleshooting Cloud Apps with Murphy

Author
Peter Welcher
Architect, Operations Technical Advisor

I have followed Brighten Godfrey (Technical Director at VMware and Professor at the University of Illinois at Urbana-Champaign) for a while. Brighten was the key technical person behind Veriflow, similar to Forward Networks. It was acquired by VMware and embedded into vRealize Network Insight. He and his teams keep showing up with innovative and interesting research projects and results.

For more background, see the Links section below.

Concerning the relevance of this blog for you, the reader, networks, and apps have expanded to the cloud, adding complexity, particularly for troubleshooting. We must be prepared to work with other organizational skill areas to solve problems. Tools that can help us do that might be valuable!

Normally, I write about products or technology. This blog is different: Murphy represents research that might end up in VMware tools and/or stimulate similar approaches in other tools. Early-stage technology!

This research demonstrates the value of combining broad telemetry collection with AIOps correlation tools for diagnosing performance problems.

So please keep reading for my summary of Murphy below!

Who or What Is Murphy?

Murphy is a “learning-based” approach to cloud network troubleshooting. The goal is to “allow enterprise teams to become more proactive about improving the performance of their distributed cloud applications and network infrastructure.”

Murphy uses telemetry and machine learning. (I’ll even say “AIOps,” although I haven’t seen it in their writeups about Murphy.) It models known “loosely defined” relationships between entities (app and VM, two communicating VMs, etc.).

The ACM research article states: “Compared to past work, Murphy can reduce diagnosis error by ≈ 1.35× in restrictive environments supported by past work, and by ≥ 4.7× in more general environments.”

VMware reportedly incorporated some research approaches into the Network Insights Beta feature in the VMware Aria Operations for Networks product, Enterprise Cloud tab.

Interesting tidbit from the ACM writeup:

“We implemented Murphy in Python with ∼7K LOC. For reference schemes, we used the author-provided implementation for Sage and our own implementation of NetMedic and ExplainIT as their code wasn’t available publicly. We test the schemes in two environments: (a) a cloud environment of a large enterprise running many production applications and (b) microservice-based applications (from the DeathStarBench suite [17]) running on private servers and a public cloud environment (AWS). “

Links

https://www.linkedin.com/in/brighten/

https://www.google.com/search?q=”Brighten Godfrey”

https://www.google.com/search?q=”veriflow”

https://www.google.com/search?q=”veriflow vRealize”

https://dl.acm.org/doi/10.1145/3603269.3604877 (Abstract)

https://dl.acm.org/doi/pdf/10.1145/3603269.3604877 (Technical Paper)

https://blogs.vmware.com/management/2023/09/improving-cloud-network-troubleshooting-a-research-based-solution-unveiled-at-sigcomm-2023-new-york-city.html

Conclusion

I hope this gave you a little peek into the research behind coming AIOps tools, and the potential power they may provide. It did for me!

I’ll also note that some “networking people” may be working in heavily virtualized environments, particular as more distributed Edge Computing goes on. E.g. an all-VMware on-prem environment combined with Cloud, perhaps VMware-based Cloud capabilities, perhaps even leveraging VMware’s Tanzu functionality. In such an environment, we should be aware of the troubleshooting capabilities of the VMware tools that are likely present!

Let’s start a conversation! Contact us to see how NetCraftsmen experts can help with your complex challenges.

 

Disclosure statement