CHALLENGE:
The use of diversity as a tool to mitigate risk and improve security has evolved over time. A major nationwide financial firm came to us over 20 years ago to help them improve their application and network availability. The firm had been experiencing several wide-spread failures over the preceding few years and it was getting attention at the highest levels of its own management. From small, single-metro outages to large-scale issues impacting entire regions, and in a couple of cases the whole company, these outages prevented customers from doing business, denying clients access to their funds, and were damaging to the firm’s reputation raising the ire of regulatory agencies. The causes were varied, from service provider failures to misconfigurations by managed service providers and the firm’s own staff.
The challenge was to provide a network design capable of 100% availability.
Strategy:
Since no guarantee of 100% availability was or is possible, we needed to better understand what they really needed. Instead of a literal interpretation, the company was looking for a methodology to prevent systemic or widespread outages, which had become business impacting. What they needed was a fully fault-tolerant design that could survive the inevitable failures that would occur.
We employed infrastructure modularity and diversity strategies and methodologies. Initially, this focused on infrastructure diversity, i.e., using parallel, and/or modular systems. But over the 20-year course of our relationship with this client, diversity continues to be a guiding strategy for risk mitigation and has since expanded into Internet Protocol (IP) equipment vendor diversity.
SOLUTION:
Solution 1 – Infrastructure Modularity and Telecom Diversity
When the client relationship initially started, most corporate networks were designed as a single system and the rapid growth of these networks had created fragility. The concept of modularity was limited to campus, data center, and WAN. Diversity was limited to redundancy, e.g., having two circuits connecting a site to one or more corporate data centers (DCs). In addition, the major systemic outages also included failures seen with service provider frame relay and ATM offerings.
We modeled a network built from distinct modular blocks:
- Core
- DCs
- Regional Centers
- WAN and Branch Networks
This Lego-like approach permitted significant risk-based testing on each module, and we also found ways to limit the ability of events to cascade from one module to others. The technical aspect of this was to introduce border gateway protocol (BGP) to an enterprise as the glue connecting the modules. Unlike interior gateway protocols that were easy to use but had scalability issues, BGP was almost infinitely scalable. After all it ran the Internet. This permitted our ability to lock down dynamic changes from propagating across networks.
Given the target of a fully fault-tolerant design that could operate at or near the client’s targets, we had to model a solution that provided diversity for each of the modules. This led to a design using two separate core networks. Other modules such as the DC structures were also duplicated.
To reduce risk, each core, DC, and regional center operated on a diverse, independent infrastructure. There wasn’t any common fiber, path or telecom equipment shared between them. This addressed all the major outages that had been experienced to that date.
The results were literally separate parallel networks. Each vendor operating in the environment was required to work with its partner (think of making the fiber providers for each network work with each other to guarantee there was no common conduit, fiber or telecom equipment being utilized).
Individual branch sites did not require the same level of fault-tolerance, but the design needed to solve for critical scaling issues that had started when the firm grew to over 2,000 locations. As a result, branch networks went from a single large WAN to a modularized approach. This prevented issues within each module from cascading to others.
In the end, a service level agreement (SLA) was required and provided for 100% availability – but it was written specifically as a guarantee against systemic or wide-spread outages. The requirements back to the firm were to maintain the separation with regular audits and to operate the network as a system – preventing changes on one network while the other one is undergoing maintenance (and yes there was a caveat for malicious insiders).
Solution 2 – Vendor Diversity
In the initial phase of the project, all the routers, switches and security systems were from a single IP equipment vendor (Cisco Systems). To deal with the single-vendor risk, upgraded software was rolled out with an N+1 phased approach:
- In a lab
- In a segment of one network
- Widely deployed on that network
- After a few weeks of stable operation to the other network
However, the N+1 approach still left devices susceptible to Zero Day and similar type vendor-specific attacks. These are the type of attacks often utilized by criminal gangs and even nation-state actors.
To offset the single-vendor risk, more recent implementations employ IP equipment vendor diversity, using diverse route/switch, security, and optical vendors on these types of networks. Think of having one module on Cisco and another using Arista devices. In one case, for a network of similar scale, one core was on Cisco and the other on Juniper routers.
What we discovered in testing was the concept of staying homogenous within a module worked best. Intermingling vendors within a module led to inter-vendor issues and resulted in lowest common denominator feature sets. Distributing an application between DCs, irrespective of their vendor bases, led to the highest availability.
RESULTs
Over the years, conduits, fibers, links, and network devices have failed, but there have been no systemic outages since the original network’s implementation over 20 years ago. The network as a system has survived, floods, tornados, hurricanes, and even the loss of a major site.
Meanwhile the design has been extended to additional regions and gone through several generational upgrades (as examples – the branch network moved from frame relay to MPLS to SD-WAN, and the original SONET based optics were displaced by 10g and most recently with 100g over a DWDM system).
What has survived is the modularity and operational practices stemming from the strategy of employing diversity wherever possible.
Having the right partner, who can help you navigate your way through all the system and vendor diversity choices is key.
Let’s start a conversation! Contact us to see how NetCraftsmen experts can help with your complex challenges.