What Are Critical Network Problems?

Terry Slattery
Principal Architect

What are your critical network problems? Every network has them. They are the problems that cause network outages, or are the precursor to network outages. Most networks have similar designs and have similar critical network problems. I’ve heard network staff say “our network is unique” and I always think to myself “if that’s true, you have unique problems that you probably don’t understand”. Using well-known configurations allows network vendor technical staff to quickly understand your network and help diagnose problems in a crisis.

Let’s look at some examples of common critical network problems.

  • Redundancy failures. When a redundant link or device is down, it needs to be detected and repaired before the backup connection also dies. Without the proper level of monitoring, you may not know that a redundant connection is running on a single link. More than once, I’ve seen network outages because a both halves of a redundant configuration failed. In many cases the first failure occurred days, weeks, or months before the second failure, only it was not detected and reported. In one case, a redundant connection was shut down for troubleshooting and was overlooked when the troubleshooting session ended. The network outage occurred when the other connection also experienced a problem.
    A redundancy failure can be caused by incorrect HSRP/VRRP/GLBP configuration, failure of a redundant link, failure of a redundant device, or the application of an ACL/Firewall that blocks an alternate path. Use the NMS to report HSRP problems, links (router interfaces or switch trunking interfaces) that are in up/down state, devices that are unreachable (down), and configuration changes that may affect connectivity via a backup path.
  • Performance failures. Links that are reporting increasing numbers of errors or dropped packets are a source of network performance slow-downs (failures). I’ve been tracking down duplex mismatches at one site where some of the interfaces are reporting more than 1M packet errors per day. I don’t know what’s on the links, but whatever is there isn’t performing very well. Anyone using the devices on those links has to be very patient at that packet loss rate. I suspect that the cause is a duplex mismatch and that because the traffic level is high, that the link isn’t able to transfer much data.
    Another source of performance failures is on congested links. Routers typically report network congestion as dropped packets, not error packets, so tracking a different interface statistic is required. The impact on applications is just like that of interface errors – the packet is lost. In the case of UDP, a higher level application protocol may retransmit it. TCP will retransmit the lost data. In either case, the time delay while waiting to make sure that the packet was lost is significant, impacting overall network performance, and ultimately the productivity of the people using that application.
  • Subnet Mask Inconsistent. Most networks rely on spreadsheets to manage their IP address and subnet allocations. As a result, it is easy to have two subnets allocated at the same time to two different purposes. All allocations of a given subnet should have the same mask. (The use of Proxy-ARP for hosts that didn’t understand subnetting is now a very old technoIogy and should probably not be used any further unless providing connectivity to one of these old pieces of technology.) I can’t think of a good exception that isn’t tied to broken or deficient hardware. If the same subnet is allocated to two different parts of the network, connectivity will be to the nearest subnet, which may not be where the desired device is located. I’ve seen one subnet allocated as a /24 and the same subnet divided into /28 subnets that were allocated elsewhere in the network. Successful connectivity to the /24 subnet depended on the particular address in use and whether one of the /28 subnets had a closer routing metric. Imagine trying to troubleshoot this one, which depends on both the address and physical location to determine if the packets could reach the /24 subnet.
  • Configuration not saved. This is a big one. Devices sometimes die in ways that don’t allow you to retrieve the on-board configuration. And a simple power outage in other cases will wipe out a configuration that’s not saved to on-board backup storage (typically NV-RAM). While working at Netcordia, we had a prospect who had about 10 devices with unsaved configurations. The week following the on-site demonstration (the NetMRI had been removed), a power outage caused all 10 devices to reboot – back to their previously saved configurations. Saving configurations is easy to do and it saves a lot of time when the inevitable power failure or hardware failure occurrs. Of course, it makes sense to save the config locally to handle the case of a reboot as well as saving it remotely to handle the case of a hardware failure.

What are on your list of critical network problems that you work hard to proactively identify?



Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html


Leave a Reply


Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.


Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.


John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.