High Availability Networking (>5-nines)

Terry Slattery
Principal Architect

One of the best sessions I attended at CiscoLive this year was titled “BRKRST-3365, Unified HA Network Design: The Evolution of the Next Generation Network” by John Cavanaugh, Chris Cornwall, and a whole team of contributors.  They talked about the High Availability (HA) network designs that they have done over the past ten years.  Some of their network designs have had no application-affecting down-time over a ten year period. There were several key factors that influence high availability.

The first important factor was cross-connected dual-core networks.  They labeled the two cores as Red and Blue with cross-connections so that a single failure would not cause packets to take a much longer path around the failure, potentially impacting application performance.  Why two core networks?  Full redundancy allows one core to be taken out of service for maintenance while production continues on the other core.

Dual-core redundancy is important for companies who can no longer afford maintenance windows for performing network upgrades.  One VP of network engineering at a financial firm told me that he has two maintenance windows: July 4 and Christmas.  Global companies may find those days are also unavailable because significant parts of the world economy runs year-round.  Being able to take out half of the network for software and hardware maintenance while the business runs on the other half allows prompt resolution of relatively minor network problems as well as addressing security vulnerabilities in the network infrastructure.

The other major factor that I liked was their recommendation for reduction of failure domains.  A simple example is to design relatively small Layer 2 domains so that when a spanning tree loop occurs, it has a smaller range of impact.  I’ve heard of a 900 server data center outage that was due to the insertion of an old switch into a data center-wide spanning tree domain.  The switch was old enough and slow enough that it couldn’t perform the task of the root bridge.  The entire data center’s operation was affected.  A smaller Layer 2 domain would have reduced the negative impact.

Another HA recommendation that I like is putting redundant servers on different subnets.  Equipment on the subnets should not share common failure sources like routers, switches, power feeds, and cooling.  Geographically diverse data centers help, but watch out for latency between them.  Terrestrial latency is roughly 10ms per 1000 miles and high latency paths between data centers may negatively affect applications whose protocols rely on a packet per round trip time.

I highly recommend that you take the time to look up the recording for this session.  It was definitely one of the best I attended.



Re-posted with Permission 

NetCraftsmen would like to acknowledge Infoblox for their permission to re-post this article which originally appeared in the Applied Infrastructure blog under http://www.infoblox.com/en/communities/blogs.html


Leave a Reply