A Network Management Architecture, Part 3

Terry Slattery
Principal Architect

This is the third of a series of posts on the network management architecture used by NetCraftsmen  in our network assessments. The architecture consists of seven elements, shown below.

[Note: The two prior posts A Network Management Architecture, Part 2 and A Network Management Architecture, Part 1 covered Events, NCCM, Performance, and IP Address Management. In this post, we’ll look at Active Path Testing, and Application Performance.]

Active Path Testing

Network management systems tell us how the network devices are operating. But they often don’t tell us what the applications experience as their packets transit the network. The applications are what the business relies upon to remain viable, so it is essential to know what the applications are experiencing.

Active path testing generates synthetic transactions to measure a path’s operational characteristics, such as delay, jitter, and packet loss. Some tools can show path capacity (how much bandwidth a path has) and utilization (how much of the bandwidth is being used). In the figure below, we see that a path had continuously high utilization until May 26. Something was consuming bandwidth on the path until that date. Other tests that originated from the same facility also showed high utilization, indicating that the problem was within that facility.

The key to using an active path testing tool is to instrument the network in a way that allows you to determine where a network problem is occurring that is impacting an application. If a path that is six hops long is being monitored and a problem is detected, how can the hop that is causing the problem be identified?

We recommend creating a set of path tests that allow problems that are common to each hop to be identified by correlating the data from multiple path tests. In the figure below, Path Testing Deployment Design, we recommend a full mesh of tests between the three appliances, shown as green circles. Sites 1, 2, and 3 are the main data centers in which most application servers are located. The full mesh of tests monitors the backbone links and the Site N Distr to Site N Core links. Within each data center site, configure tests from the test appliance at that site to other targets within the site. That tests intra-site paths.

Finally, configure tests from each Data Center Site N to targets at the remote sites. If a site is dual-homed, there may need to be some policy-based-routing implemented to force some test traffic over a specific path. Or perhaps a remote target needs to be specified that will always prefer a given path, assuming normal routing and no network failures.

Create alerts to be sent when a path fails or when one of the operational characteristics exceeds normal operating parameters. If a link experiences a failure, an alert should be generated. Multiple alerts would mean that a core link that is part of multiple tests has failed.

With the structured approach described here, the number of tests is minimized. If you have clearly identified the tests and created a spreadsheet that documents the tests and the links that are tested, you will be able to easily tell which link is causing problems due to the alerts that are generated.

Application Performance

While active testing checks the paths, there is still a need to look at the applications themselves. This class of tools will need to capture packets and be able to see part of what is happening within the packets. While this analysis could be done with Wireshark or Sniffer, it would be extremely tedious.

The application performance monitoring tools understand applications running on the network and are able to show the applications that are running (within the limits of packet decoding) and associate clients with servers. A good system will be able to report on network delays, packet loss, jitter, server delays, and client “think time” delays. On several occasions, we’ve found significant server delays, which absolved the network team of any responsibility for the slowness. Extending the packet capture to the server farm allows the system to help identify which server out of a multi-tier implementation (application server, middle-ware server, or database server) is contributing to the delays.

Another characteristic of these tools is the ability to see classes of applications, determined by protocol and/or endpoints. At one site, we found a saturated link on which 50% of the traffic was originating from three sources over HTTP: Pandora (streaming audio), Akamai Networks (audio and video streaming and downloads), and LimeLight Networks (audio and video streaming and downloads). We named this traffic “entertainment traffic”. The entertainment traffic was choking out the business traffic. Due to organizational policies, it couldn’t be stopped, so we applied QoS. The Internet entertainment traffic was prioritized into a low priority queue that was only allowed to use remaining available bandwidth. The complaints about slow application performance stopped.

The application performance management tools, because they do packet capture, can also see packet loss in some protocols, such as TCP or VoIP. These protocols contain sequence numbers that indicate the order in which the packets were sent, so that if they are received out of order, they can be reassembled in order. In TCP, there’s another use: identifying packets that need to be retransmitted because the original packet was dropped (typically due to link errors or a lack of buffers or due to congestion). Regardless of the cause, it is important to know that key business applications are experiencing packet loss. (See my blog TCP Performance and the Mathis Equation.) If packet loss is detected, then check the interfaces along the path for errors and drops. The packets are being lost somewhere in the path and it is simply a matter of finding where it is happening. Note that identifying TCP retransmissions is something that Wireshark or similar tool could perform without needing to decode the application. Or the server team could check for TCP retransmissions on the servers and on the clients that are experiencing problems. If SNMP were enabled, the NMS tools may even be used to gather this fundamental data.

In my next post, we’ll cover Topology Mapping and wrap it up.


Other posts in this series:

Part 1 | Part 2 | Part 4


Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html.

Leave a Reply


Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.


Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.


John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.