How Total Data Center Visibility Benefits Planning, Design

NetCraftsmen®

Two years ago, Cisco Fellow Navrindra Yadav had an idea — to create a self-driving data center. He saw the amount of time spent and how many people were devoted to managing data center networks and applications. A lack of complete information makes management and troubleshooting more difficult. He knew how important pervasive security is for the data center and realized how hard it is to find attacks. How can you be agile if you’re always trying to keep your data center from breaking? Yadav felt that hardware and software could be united to solve these problems, and he had an idea how to do it. So, he started working on what would become Cisco’s new analytics engine — Tetration.

Through the Cisco Champions program, I had the privilege of a pre-announcement briefing, and then attended the product launch in New York City on June 15th. There, Terry Slattery, Colin Lynch, and I were able to spend an hour with Navrindra and Tetration Product Manager Jothi Prakash. Navrindra explained that this wasn’t a spin-out/spin-in type deal, but rather a homegrown solution. It was developed and tested with Cisco IT until it proved its value to Cisco, which then funded a team to work on it.

Tetration Launch in NYC
From left to right: Lauren Friedman, Denise Donohue, Navindra Yadov, Terry Slattery, Jothi Prakash, Yogesh Kaushik, Colin Lynch

Fast-forward two years, and Cisco IT was ready to migrate an entire Hadoop environment containing eight petabytes of data over the weekend. Because of the application and traffic-flow knowledge gained through Tetration analytics, the manager was confident he was going to make it to his vacation starting that Monday.

By now, you’ve probably seen many postings on how this magic happens. (See Colin’s and Terry’s blog posts.) Real-time metadata is collected from every packet of every flow, every interaction between every device in your data center.

Or, at least that’s the final goal. Right now, “every device” includes the next-gen Nexus 9300 switches and both virtual and bare-metal servers. It can also ingest logs and configuration data from Layer 4-7 services, such as Infoblox. Data is collected through sensors that send telemetry data to Tetration. Server (host) sensors run inside Virtual Machines (VMs), so they see everything the VM sees. This also works on VMs within cloud services, such as AWS, Google, and Azure, and most versions of Linux and Windows are supported.

For the Nexus switches, network sensors are built into an ASIC, which is connected to the backplane of the switch and sees every packet. The sensors look at the first 160 bytes of the packet to extract the metadata, and then send it to the Tetration analytics engine. There is no CPU involvement on the switches. Cisco has measured the network overhead at less than 1% and CPU usage of about one-quarter of a CPU on servers.

It sounds like a really powerful information tool. Which brings us to the questions: Is this a useful tool, or a solution looking for a problem? What would drive you to install an entire rack of gear that lists for $3 million to $4 million?

I have a few thoughts.

Data Center Planning and Design

Are you really confident that you know all the applications in your data center, and all the details about those applications? If so, you’d be the first person I’ve heard of who is! Even Cisco found that it could decommission more than 40% of its VMs based on information from Tetration — a savings in costs and resources. If you’re moving to centralized, software-defined type control, you need to know data flows and application dependencies. What talks to what, with what type of traffic? What access is critical to make this application work? Application Centric Infrastructure (ACI), for example, is based on a whitelisting model. If you don’t know all the connections you need to permit, you could end up causing an outage.

When you’re expanding or refreshing your data center, or building out a new one, accurate application information leads to precise planning and design. How many servers will you need? What are bandwidth and latency requirements for each application? For some applications, for instance, the latency between racks is too high, so dependent servers must be in the same rack. You must be able to identify all dependent servers. Can you imagine how valuable it would be to know that you’ve accounted for every application when building out a new data center? Or to know exactly what application traffic goes through a particular switch before you replace it? Just think of how this would help minimize downtime during a migration.

Tetration has a replay function that lets you do a “what-if” analysis on stored data. I can think of several uses for this, such as:

  • Predicting future growth. You can create a much more accurate picture of future needs by analyzing past usage and changes over time. This would help you size the data center appropriately, so it’s neither over- nor underbuilt. Understanding your company’s data patterns would also help you plan future growth, enabling you to predict future performance and “right-time” data center expansions.
  • Testing data center changes. When Cisco was moving its Hadoop cluster to a new data center, it was confident the new system had been architected correctly because the company used Tetration to replay data flows. This allowed it to correct design flaws and ensure that resiliency was working as planned — before moving anything to production. You can test the effects of adding or subtracting switches or servers or making policy changes. Based on historical data, you can see the effect a change would have at various times in your business cycle (e.g., a retailer’s holiday rush).
  • Testing policy changes. Once you collect data and have a baseline for a specific application, you can run a simulation of proposed policy changes. This will tell you the exact effect that policy would have had on traffic. The simulation ensures that a policy does what it should before you put it into production.

Security and Troubleshooting

I’ve lumped security and troubleshooting together because they use the same capabilities within Tetration. Both benefit from the ability to replay data flows and interactions. On the security side, you could replay an attack to understand precisely what happened and which devices or applications were affected. Then you can use that information to mitigate the effects beforehand, or recognize and shut down future attacks. When troubleshooting, it would be nice to replay the traffic that had a problem, or maybe a specific data flow, to see exactly where the issue occurred. This hard data solves the “is it the network or the application?” question.

Both rely on visibility. As David Goeckeler, Cisco’s Senior Vice President and General Manager of Networking and Security, says, “You can’t stop what you can’t see.” Tetration gives you that visibility because information is collected on all data — not just samples. On the security side, you want to identify an attack as soon as possible; so you need visibility across as many attack vectors as possible.

The gold standard in troubleshooting is to be proactive rather than reactive. Given that Tetration sees all traffic, it can create baselines of expected flows and patterns. Because Tetration can search through billions of flows in less than a second, it can quickly recognize anomalies, at which point a human can be brought in to decide how to respond. The system will suggest remediations and learn from the decisions humans make. The goal is for the system to remediate problems independently at the outset. I think we all recognize that will require a level of trust from the human operators!

Unless you’re in one of those industries that needs to capture everything, the current version is probably overkill for a data center with fewer than 5,000 endpoints. Note that the endpoints do not have to be in one single data center. So long as you have IP connectivity, you can monitor more than one data center by creating encrypted tunnels back to the Tetration cluster. Cisco’s cluster in San Jose, for example, monitors the company’s Texas and North Carolina data centers. WAN bandwidth usage is generally less than a gigabyte. Cisco foresees smaller systems for smaller data centers in the future.

Bottom line: If your data is critical – if your organization’s services can’t go down – then the information and capabilities provided by this system can help. If you’re rolling out software-defined controllers of any stripe, this will improve your outcome. In my opinion, it’s a large step toward the future of data centers and networking in general. We’d be happy to have a deeper discussion of whether Tetration might be right for you. Just reach out.

Leave a Reply

 

Nick Kelly

Cybersecurity Engineer, Cisco

Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.

 

Virgilio “BONG” dela Cruz Jr.

CCDP, CCNA V, CCNP, Cisco IPS Express Security for AM/EE
Field Solutions Architect, Tech Data

Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.

 

John Cavanaugh

CCIE #1066, CCDE #20070002, CCAr
Chief Technology Officer, Practice Lead Security Services, NetCraftsmen

John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services.  Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.

He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.