Remote Communications During A Crisis Can Be Challenging – Some Things To Think About
This blog is one of several about #NFD16.
Veriflow was kind enough to provide me with a demo and discussion earlier in 2017. I followed that up with the blog post, “Network Verification with Veriflow.” This blog post provides some updates based on what was covered in Veriflow’s #NFD16 presentation. If you like the idea of pre-validating and doing network assurance, or if you’d like more details, be sure to watch the streaming video of the Veriflow presentations and demos!
Veriflow highlights the following capabilities:
Recently, we are hearing a lot about “Intent-Based Networking.” Veriflow has something to say about that (see their Resources page). Their product does verification that the design/implementation (operations) complies with business intent, in vendor-agnostic ways.
The challenge with “intent-based” is how you convey the intent. This was a key question for #NFD16 delegates — several of us realized that the tools helped automated checking of policy, but that to a degree just moves the labor to defining the policy.
One thing Veriflow does is makes it easy to convert items you test to intent rules. This can be specific A-to-B type connectivity (or segmentation), or it could be much broader (tags, groups of devices, etc.).
Veriflow not only collects configuration information to model the network with, but data plane forwarding and Cisco show command output, to include the current state of the network, with time stamping and snapshot capability.
They’ve also built out a long initial list of “auto-intent” items — and it’s possible more might be coming. The sort of thing Veriflow can check: if HSRP is configured, is there a properly configured peer. One example of a challenge with this: maybe you configured HSRP but have yet to deploy router #2. So either take active state into account, or tag “pre-deployment” items.
Veriflow calls this feature “Intent Inference.” It is Spanning Tree and ECMP-aware, so it can do things like catch failover situations where one ECMP alternative will not pass the required traffic (or fails to block disallowed traffic, in a segmentation validation scenario).
This sort of low-level checking is exactly what humans miss. When I do network assessments, lately I’ve been finding a lot of “network entropy,” as in drift or randomness in configurations, resulting in things like trunk allow mismatches, HSRP peer not configured, misconfigured, or just not present for some reason, and many other such things. Design issues tend to be more interesting for all concerned, but it’s that low-level stuff that can really create a lot of hidden problems that will at some point consume your troubleshooting time. Or worse, defeat failover when there is a failure in the network. If we can automate them away, that saves real time for us! For big companies (airlines, online commerce, financials), it potentially saves on embarrassing service outages.
As evidence of the importance of this, I’ve seen two large enterprises do an HA validation, one at the config level (is HSRP or VRRP configured, does the static route to the firewall have the VIP as the next hop, etc.), and the other at the config plausibility level, followed by failover/fallback testing during a weekend change window. Times 900 points of HA. That took a team of several engineers a year to complete using weekend change windows. The cost of doing that might have paid for a Veriflow license several times over!
That sounds like a cost ROI case for having something like Veriflow doing “continuous verification.” It may not be 100 percent perfect the way actual testing failover is, but if it greatly improves the odds that HA will work as intended, or that other things will work as intended, that’s a big win. Catching other sorts of configuration errors (device IP not in right subnet, duplicate IP, mismatched subnet masks, etc., etc.) could be equally important.
The same applies to operations and rapid response. Having a tool spot the path traffic is taking and which ACL rule is blocking it saves time and lowers cost because the Tier 1 and 2 staff can solve more problems.
I recall a long day into evening session with the “brain trust” in a room for quite a while, trying to figure out why there were alerts that the major e-commerce web service was down when outside verification showed it as up. It turned out the internal monitoring tool’s routing was wrong, going inside rather than outside to get to the website, and being blocked. After we caught onto that (with a little help from ThousandEyes doing external monitoring), we spent hours figuring out the path, where it was blocked, how to fix the routing, and getting two firewalls’ static routing and NAT points sorted out.
Veriflow flow checking would likely have rapidly identified the problem, and better, allowed us to test the proposed solution before deploying it. Further, if it had configuration and routing snapshots, we likely could have pinpointed what change triggered the problem in the first place.
The offline troubleshooting aspect of this sort of tool really gets my attention. I’ve seen multiple people tied up for hours trying to solve a problem, which turns out to either be routing or an access list (or the two working against each other). Normally we have to do traceroute to and from the target, then check each hop for routing, ACLs, and other factors. Veriflow speeds that up a lot!
All that of course assumes the model is faithful to the actual network behavior. Veriflow has some sharp people working for it. Having said that, could there be special undocumented, “interesting,” or odd modes of Cisco IOS or NXOS behavior? Darn sure, and I’ve got the scars to prove it! Try redistributing OSPF on a Cisco device with OSPF area 1 configured and no area 0, for example. Does Veriflow catch all such? Maybe a lot of them; probably not all.
Veriflow also has an API, which apparently is quite powerful, letting you automate creation of intent rules. Using the API, you can also query the Veriflow database and correlate the data in ways the product doesn’t already do. The example online (see the API link above) shows code to check for MTU mismatches. I’ve been fiddling around with some Python configuration checking just using captured configurations. Extracting knowledge of neighbors on a link and their state is hard to do without something like the network device and topology models inside Veriflow. Having access to all that lets you extend the capabilities of the tool in powerful ways, without having to wait for the developers to code them up. Leveraging Veriflow could also save you the work of building a topology model from configuration files and/or the live network, allowing you to get what you need faster.
This is something I don’t hear in “networking people should code” discussions: the best use of your time may be supplementing existing tools, not building tools from scratch. YMMV.
Other features discussed:
In summary, Veriflow is doing a lot of interesting and exciting things, and doing them fast. To learn more, check out their website. But first, check out the #NFD16 recorded video of the Veriflow presentations and demos.
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!
Remote Communications During A Crisis Can Be Challenging – Some Things To Think About
Need for Speed
Container-Based WAN Monitoring
Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.
Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.
John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services. Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.
He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.