Written by Terry Slattery and Dan Wade
Every network eventually gets a hardware refresh and automated testing can verify that a site is fully operational after the new hardware and configuration are installed.
Taking the Initiative
Dan Wade, one of our network engineers, describes one of his approaches to network automation. His words explain it best.
A couple months ago, I built a “site refresh validation” tool that would help [them] verify the operational state of newly installed devices at their remote offices. In total, there are over 40 available checks that can be performed (depending on the device type). These checks were built as a series of pyATS test cases in the pyATS automated testing framework. I built the test cases to mimic the current SOP documentation for verifying a site, so that processes were still upheld – it just removed the need to manually run the command and parse the output. My test cases run the commands, parse the output, and compare it with an expected result. This allows the verification process to take less than five minutes – that includes documenting the output and producing a pass/fail result based on an engineer’s expectation.
I think this could be a great topic because it focuses on using automation as a testing tool. Too often, engineers relate automation to configuration management/large config pushes. I believe verifying operational state (running “show” commands and parsing the output) is the best way to start in automation, versus pushing out configuration snippets to hundreds of devices (potentially causing a mass outage).
Dan’s solution exactly matches the approach that we at NetCraftsmen have been recommending to organizations that want to get started with automation and are looking for first steps.
Of course, it’s beneficial to know the ROI – is this automation measurably better than the manual process it replaced? Dan provided the following additional information.
How long does it take to do the checks manually?
I guess I should clarify. All the checks (some include ping tests, which take the longest since you’re battling latency) take less than 5 minutes, not 5 minutes per check. To manually verify, I would say engineers need at least 15-20 minutes to run the show commands, manually read through the output to find the necessary values, and record the results.
How long did it take to develop your automation?
It took me about 3-4 months to gather the requirements (with engineers input), build out the test cases, and have the engineers QA the test results.
Since it’s currently only used after site refreshes, I would say it’s used maybe 10-15 times a week, give or take, so maybe close to 40-50 times a month (it really depends how many refreshes the engineers do for that particular month). I know engineers have also used it as an ad-hoc tool to verify operational state at existing sites (without a recent refresh), but it’s mostly used for site refreshes.
Doing the math, Dan determined that the savings is 10-17 hours per month. What’s not in that figure is the benefit of eliminating human error of the manual process. The client’s network engineers are also using the system to verify operational state at existing sites. The results are documentation regarding the installed state of the site. Subsequent uses at a site validates that the baseline network state is correct, a valuable step in troubleshooting that reduces the time to resolution.
But how long did it take to develop?
I would guess around 120 hours of focused time (120 / 8 = 15 days) over a 3-4 month period.
That’s an excellent ROI: 120 hours to develop a tool that saves 10-17 hours per month, implying a direct cost payback time of less than a year. But it doesn’t end there.
The nice thing about the tool is that it’s been built with extensibility and provides a framework to continue evolving the testing to fit future use cases. So for the next use case, it may only take 10-20 hours to extend the existing framework vs. building from scratch. In that case, the new feature’s ROI would break-even within a couple months of development.
Dan’s key point is that once the base automation system is developed, additional features are a much smaller incremental effort, reducing their payback time.
** Additional reading: