How do you learn how to do network design well?
NetCraftsmen does Network Assessments, as well as other types of assessments. We frequently get asked about design skills and sources of best practices. In addition, I’ve been mulling over how to best internally build junior engineers’ skills in design.
Building Design Skills
Here are several ways to build your network design skills:
- Read everything you can get on the topic with a critical eye. Determine what is trustworthy, what is marketing pushing an agenda, what is overly complex, what is good for the customer.
- Read the Cisco Press ARCH book. Now. It ties to the CCDP exam, and certainly doesn’t hurt if you’re doing the CCDE written (from experience). I’ve been involved with the second and third editions, and can vouch that there is lots of good design content. The updated fourth edition came out in December 2016. I now have a copy, but haven’t read/skimmed it, so I can’t comment on it yet. The table of contents indicates some timely updates, including some added case studies at the end. Good idea!
- Read the CCDE suggested reading list — you need to know various approaches, their technical capabilities, limitations, pros, and cons. You may find the CCDE Study Guide useful (good content, many typos). It does include Service Provider and MPLS topics, which might be deferred if you’re primarily interested in enterprise design.
- Read the Cisco Validated Design (CVD) documents. They’re mostly really good content. A few are shallow or have an agenda. In general, I find any given CVD about 95 percent bang on, with minor differences over some of the ancillary details. There are some major exceptions, ones that I consider poor ideas/overly complicated.
- Get into a job position where you can see and analyze many networks, regarding both business requirements and technical design. Consulting (internal or external) and design reviews are one way to do that. Question everything. Think outside the box. Solve business problems for the customer.
- Practice design write-ups and simple clean diagramming. Presenting a design to the customer is important. Did I say “PowerPoint”? Definitely a required skill!
- Practice critical thinking. How could this be done differently? What are the pros and cons of each alternative? Over time, you should build up mental or notes-based lists of alternatives for various design situations (campus, datacenter, WLAN, WAN, small/medium/large sites, etc.). Example: datacenter design with two core switches, or three different flavors of spine-leaf with Nexus 9Ks.
- Practice your PowerPoint and/or Visio skills. They’re essential for presenting a design, and darn useful for showing the stages of a migration or new technology deployment.
Example: Exercising Design Skills
As an example, consider campus network segmentation, a topic that doesn’t get widely discussed. I’m using it for precisely that reason: Thinking through new situations builds skills.
I’ve seen the following alternative approaches. A terse pros/cons analysis is included. I’m not claiming its 100 percent perfect; I’m just trying to provide a meaningful example, one where I’ve seen a lot of rigid design thinking (that is, picking one of the below approaches where I’d have picked another, generally because it was simpler or less costly and met the requirements).
>This can work; I have done it at 100+ building hospital and medium-sized government agency scale.Simple, low-tech for operations
Design Approach | Pros | Cons |
---|---|---|
Firewall-based segmentation | This is common in datacenters, not so much in campuses. | Firewall routing is limited in capability. Firewalls can become traffic bottlenecks. Generally ends up being somewhat clumsy, not agile. |
Central interconnect or core with firewall pairs guarding compartments or zones | For zone A to talk to B, two pairs of firewalls need to be configured. This scales very poorly. | |
Firewall pair with compartments as zones off the firewalls | All the rules are in one firewall pair. | You still have to write two sets of ingress rules for A to B. All the rules are in one firewall pair. Doesn’t scale (scale up not out). |
Multiple firewall pairs randomly interconnecting security zones | Routing could get really messy. Avoiding asymmetric flows? Operations, troubleshooting ugly. Hierarchical designs are generally highly advisable — random meshes not. |
|
Network virtualization and segmentation | Complexity | |
VLANs as segmentation | Simple | Large scale VLANs not a good idea. |
VLANs + VRFs and VRF Lite/EVN as segmentation | This can work; I have done it at 100+ building hospital and medium-sized government agency scale. Simple, low-tech for operations |
Not elegant. Adding a segment/zone is painful. |
That plus central MPLS | Makes the core more scalable, which in turn simplifies overall scaling | Cost for MPLS capability (devices supporting it, licensing). Increased tech complexity (MPLS VPNs). |
MPLS-centric campus (e.g. down to access layer using 3850 switches) | Scalable, elegant. | Cost for MPLS capability. Increased tech complexity (MPLS VPNs). |
Campus Fabric: ISE + VLAN/VRF/VXLAN, omitting LISP | If you have a need to extend L2, VXLAN is the way to go. We’re doing it in some datacenters, with BPG (EVPN). It is, in effect, the new FabricPath. | Why would one want to extend L2? VoIP mobility without central CAPWAP and avoiding re-DHCP comes to mind. The DNA Campus Fabric (SD-Access) solution requires LISP. |
(New) Cisco DNA Campus Fabric: ISE + VLAN/VRF/VXLAN + LISP | (See above) | (See above) Increased complexity due to LISP. (I’m not a fan!) LISP is apparently in there for mobility support; I’m not convinced it is needed unless you’re talking mobility across normal L3 routing boundaries — use cases? |
ISE-centric segmentation | Provides useful security insights into devices, locations, contexts. Doesn’t create “hard boundaries” between user or server groups. Does allow control via PEPs. |
Device support in general (SGTs, PEPs). Doesn’t create “hard boundaries” between user or server groups — that’s not helpful if you need to keep different types of users segmented from each other. Some requirements might need overly many PEPs. SXP scaling limitations. |
Some comments about a couple of the above items follow.
The Cisco DNA Campus Fabric is fairly new, and I’m still absorbing. I have some design biases that apply (in general):
- I very strongly don’t like extended L2. VLANs spanning a couple of switches, OK. A fair-sized campus, I don’t want to go there.
- I’ve not yet seen any use cases I really believed in for extending L2 at large scale. But if you have to, VXLAN is the way to go.
- LISP is inherently complex. The way it does VRFs is very weird (to my brain, anyway). Counter-intuitive.
- Firewalls tend to be inflexible, rigid and limiting, and generally not great at routing.
I’ve seen firewall-based segmentation mostly when it was designed by security staff. Comfort zone, or “I have a hammer and everything looks like a nail syndrome”?
I’ve seen and done some network-based segmentation. In a few cases, ISE might have been more elegant and provided more security information. The trade-off there: low-tech VLANs and VRF for higher-tech ISE, and perhaps difficulties keeping it working. (We have consultants who can help a lot with that, but that defeats do-it-yourself, and has costs.)
Concerning the VXLAN items: I’m on board with VXLAN (and EVPN) for modest-sized datacenters where ACI policy control doesn’t particularly fit, e.g. a 90 percent virtualized NSX environment where ACI as fabric manager isn’t deemed cost-justified.
in the Campus, VoIP over WLAN without central CAPWAP might be one reason for extending L2 robustly via VXLAN. Having to re-DHCP while roaming is not great for VoIP. That aspect of WLAN may be evolving. At scale, people seem to want centralized controllers, but doing FlexConnect with them means people roaming between floors or whatever need to re-DHCP, or you need site-wide VLANs. VXLAN is a robust way to do larger-scale L2, via routed tunnel overlay on top of a robust L3 infrastructure.
LISP allows you to track where a mobile device or VM happens to be and route to its location. For datacenters, you may need that capability if you extend L2 (DCI, OTV, or whatever) across locations or normal L3 boundaries. Caution: A lot of complexity and other baggage may accompany this. Design advice highly recommended!
I strongly prefer to not extend L2 in the first place. Vmotion is active-active in a sense (VM running in one or the other location). I prefer cloud-scale approaches, where GSLB load balancing or anycast is used to support multiple active instances.
Different people weight factors differently, so some may opt for extended L2. LISP is part of the toolkit available for cases where L2 must be extended. I think of it as useful for /32 routing, in effect.
Concerning ISE, I’ve liked it as logical segmentation. The analogy I’ve used is having a corridor getting you wherever you need to go, with policy/user-group (and other factor) based “badge” that opens some doors to you, but not others. The “door” here being a Policy Enforcement Point (PEP). The MPLS, VRF-Lite, etc. is more like building sub-corridors with hard walls within the main corridor: extra structure providing segmentation even where it isn’t really relevant.
The challenge with ISE is that, as with badges, it doesn’t separate different types of users in a typical campus LAN. So if you have a university-affiliated hospital, having mixed user spaces in the hospital might expose some medical staff computers to malware, which might then propagate to FDA-approved non-hardenable medical devices that those medical staff access. (Shades of Stuxnet?)
Another example: A large manufacturing organization used MPLS for core segmentation, with VLANs and VRFs at buildings. They needed to separate admin personnel from manufacturing operations, to protect the assembly line etc. from disruption. Complexity was a bit of a problem and a cost. It was particularly fun when we needed to get IP multicast working on wired and WLAN segments — but that’s another story. The scale was a bit large for VRF Lite, but it could have worked. This type of design I think of as closer to physical segmentation.
By the way, with IOT, segmenting within campuses and sites will likely be a growing problem. We’ll probably want to isolate the different IOT device networks as well, to limit the spread of malware or security exploits as the technology matures. Maintaining separate physical networks for different forms of IOT would be cost-prohibitive.
Challenges
Do you buy into everything I said above? Come up with some areas where your opinion differs, and how you would justify your opinion. What would you do differently?
Comment on this blog, I’d love to hear your thoughts, have a discussion, etc.
Using Design Skills
Our network assessments sample networks and look for major (while noting minor) network design, performance, and configuration issues, all tying back to best practices. We usually also take a high-level look at security (firewalls and segmentation design appropriate, etc.), and, if the data is available, examine syslog records for undetected but serious problems.
NetCraftsmen also performs computer/storage/virtualization assessments, UC assessments, and security assessments. Learn more about our assessments.
Comments
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!
Great write up Peter!
What are your thoughts on on a design that incorporates the firewall at a layer 1? I’ve worked on a design for a big financial firm, where the firewall become part of the core, but it was not participating in any routing. Instead, it allowed us to scale well, because the policy was converted over to a subnet-based type of policy. This design replaced the per zone firewalling segmentation that you mentioned in your great blog.
Some comments – just to touch the topic:
-be humble – do not assume you are right when it comes to your design – be open to others ideas and be willing to change your design (it’s always difficult to see things out-of-the-box, particulary see your own design this way) – in my 25years career in networking I experience this many times
-seek non-technical solutions – read business requirements and discuss them – some of the causing huge complexity can be easily removed as they were added ‘by accident’ or seen as ‘nice to have’ without real reason
-if you are involved in architecture/design activities listen to developers/implementers – some very good ideas come from their experience to have better design. Better means many times other aspects like troubleshooting, resilience. You learn about them mostly from experience. It’s easy to throw in more VRFs and fix particular problem related to traffic engineering – but complexity grows, and of course maintenance costs
-think about upgrade/extension/etc – you need/want upgrade old system easy way – if you offer an upgrade who wants to risk it when the network service disruption is substantial?
etc, etc, etc
Bogdan
CCIE#10147, Emeritus