I’ve written in this blog about various reasons for using network automation but it is time to put them together. Counting down…
7. Performance and SLAs
The first thing that network management does is performance monitoring. It’s conceptually easy, but surprisingly challenging, primarily due to differences in vendors, changes in standards (e.g., 32-bit vs 64-bit counters, different SNMP versions, and bugs in vendor implementations). Once those hurdles are past, the thousands of interfaces need to be sorted by a variety of criteria (e.g., percent utilization, error rates, broadcasts, etc). Alerting thresholds on performance data need to be defined and now you have a system that alerts you when utilization is high or errors suddenly appear on a link. Doing this task without automation is impossible in any network consisting of more than about 50 routers and switches.
SLAs are another area where automation is required. How else would you monitor the delay, jitter, and packet loss across a network (to pick three common SLA factors). An automated system is required for performing SLA tests, processing the results, and presenting the reports.
6. Scaling of processes
There are many processes in managing networks that should be performed regularly to have a smoothly running network with minimum downtime. But because these processes take a lot of time to implement manually, they are seldom performed. With network automation, these processes can be performed regularly, reducing risk of an unexpected network failure. Of course, the results of the processes should be sent to a network administrator, particularly regarding any alerts or exceptions. These processes include:
- Compliance – Do your configurations adhere to your network policies? Check the security settings of your routers and switches or the network management settings (do all devices send syslog to the right place?)
- Saving configurations – Save all configurations to non-volatile storage on the device and to a backup server.
- Switch port utilization – Identify unused switch ports, allowing you to consolidate connections and perhaps re-deploy existing hardware.
- Improve network resilience – Verify that the first hop redundancy protocol (HSRP/VRRP/GLBP) is configured and operating correctly.
- Consistent deployment – Are all devices running the OS that you have validated? Mixing multiple OS versions is a good way to encounter an unexpected bug.
Reduce operating costs by tracking the inventory of your network devices and paying maintenance only on those devices that are in your network. Know which devices you want to upgrade next in a network refresh by tracking the age of all your devices and the OS loaded on them.
When troubleshooting, an accurate network topology drawing is valuable. Keeping network drawings up to date is a tedious and often neglected task and when a problem occurs, I typically see people sketching the network topology so they can proceed with the problem diagnosis. The NMS collects connectivity information, which can be displayed within the tool or exported to drawing tools (Microsoft has published the Visio XML format).
Topology information is also very valuable for network planning and preventing outages. It allows you to answer questions about uplink oversubscription ratios, verify redundant connections (or the lack thereof), and identify strange topologies that tend to appear in most networks (and that can cause strange behavior or failure modes).
3. Network Analysis
Network analysis is the process of taking all the collected data about a network and performing analysis on that data to identify current and potential problems. The simplest analysis is identifying interfaces running at high utilization. More complex analysis incorporates data from multiple devices, such as determining that a VRRP group only contains one router (the operational data from all routers shows that there is no peer router). The most complex analysis uses multiple sources of data, such as from both configuration files and operational data, exemplified by a duplex mismatch where an interface configuration shows a setting of ‘auto’, the interface’s state is ‘half’, and operational data shows late collisions.
Other network analysis incorporates data sources like events (syslog or SNMP traps). Most network management systems collect the data but then rely on the network engineer to perform the analysis. Because the network engineer is already busy, this limits what he or she can do, so it often defaults to looking at alerts generated by the interface utilization thresholds. Automating the analysis tasks allows easy identification of lots of problems that network engineers know that they know should be done but never have the time to perform.
2. Correlation of the above items
The next step in automation is to correlate several of the above items. A good example is to use the topology information to perform higher frequency interface performance polling on any interface where the neighboring device is another infrastructure device. Edge ports can be polled at a much lower frequency.
Another example is using the topology information to determine whether a subnet has been allocated multiple times. Similarly, it would be good to use topology to tell if two subnets that overlap, but have different masks are on the same segment due to a typo in the configuration or are they two different subnets in different parts of the network.
1. Human error
The biggest and most important reason for network automation is human error. It accounts for at least 40% of network failures (some estimates are as high as 80%). It has been proven that automation helps reduce those errors. Updating the configurations of hundreds of routers and switches is not something that should be done manually. Automated mechanisms to verify a proposed change and to implement a change control process where it is validated by other network engineers is important for reducing or eliminating silly mistakes.
That’s the list. Networks are big. Networks are complex and are increasing in complexity. Automation is the only hope we have of managing the size and complexity while providing high availability.
Re-posted with Permission
NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html