What is NVMe and How Does It Impact My Network?

Among other things, for some time now I’ve been designing data centers with Nexus 93180 or 93240 switches, leveraging their 40 / 100 Gbps ports and inexpensive 100 Gbps copper cabling to interconnect them. That allows building relatively inexpensive data centers with plenty of short-term bandwidth and some solid growth headroom. This might be based on just VPC / VLANs rather than VXLAN — keeping it simple for smaller organizations.

I’m seeing increasing levels of network-based storage, generally NFS, CIFS, and / or iSCSI. Occasional Fiber Channel (FC) or FCoE. The latter two seem to be more common in very large organizations where performance is (or used to be) a major concern.

Some recent designs I’ve done have had to accommodate NetApp storage attached via up to 8 x 40 Gbps links, and Dell modular chassis / fabric with 16 x 100 Gbps uplinks (customer’s choice). That makes the Nexus 93240 attractive (more high-speed ports for little price difference versus the 93180), as well as the Nexus 9336C or 9364C. Providing the NetApp and Dell connections does consume more of the 40 / 100 Gbps ports and fewer of the 10 or 25 Gbps ports.

Conclusion: The Nexus 9300 line is at a sweet spot for modest data centers with high-speed port needs like those just listed.

Caution: That’s at present; speed is a fast-moving target.

I’ve been happily thinking there is a fair amount of bandwidth available for data centers. However, life experience says that if the bandwidth is there, someone will figure out how to consume it. It may be time to re-adjust one’s definition of “lots of bandwidth”.

The rest of this blog lightly covers NVMe / NVMeoF, which is one factor which may use those high-speed data center ports.

This blog will present some basic information about NVMe and NVMeoF to get you started — what are they and how do they differ. The intent is to help you get up to speed on NVMe and how it impacts your network and design. Reference links provide a way for you to deepen that knowledge.

The PacketPushers video with J. Metz talking about NVMeoF is a good resource covering the topic in significantly more depth. One take-away is that three NVMeoF drives may be able to easily consume 100 Gbps. As this technology gets adopted (and it’s coming in fast), we’re going to need to engineer some very high speed, low latency networks!

Acronyms

Let’s start with the acronyms.

NVMe stands for Non-Volatile Memory Express. It is one step in evolving solid state disks / flash storage. NVMe is based on a specification for how a server may write to storage over a PCIe bus, treating the storage like memory. The point to doing so is speed, but also a standard interface that vendors can write code to.

NVMeoF adds “Over Fabric”, meaning network. It represents the dis-aggregation of NVMe. NMVeoF provides very fast access to networked storage.

What’s Different?

There are some key differences from traditional storage.

With NVMe and PCIe:

A standard storage mechanism with very high throughput.

And with either NVMe or NVMeoF:

Flash / SSD means no spinning media. You can write to it very quickly. Very low latency. Spinning media required queuing read / write operations in a single queue, and had head seek time (moving physical read-write head to the right position relative to the disks).
With memory, you can do concurrent reads and writes, so you can have 64,000 queues instead of 1.

NVMeoF:

Scaling, disaggregated networked storage!

Network Dependencies

Storage over networks is highly dependent on latency. Network / fiber channel latency did not matter quite as much with spinning disks, since they had inherent latency (head seek time, etc.). If you think of NVMe storage as more like RAM, it’s a whole new ballgame.

With storage via the network, you’re going to want extremely low latency, and very high bandwidth so you can move lots of data very quickly. That gets challenging. You also don’t want packet drops.

There are five NVMeoF transports: fiber channel (FC-NVMe standard), Infiniband, RoCE, iWARP, and TCP. The first two are alternative forms of networking. RoCE version 2 is a form of direct memory access, RDMA, over UDP. iWARP is a form of RDMA over TCP. You don’t want to mix the two in a data center, they don’t interoperate.

Design Considerations

From a design perspective, the question immediately comes up: do we imitate a SAN, and have a dedicated storage network? Some sites have certainly done that with their Cisco-based FC or FCoE networks. Others have built more unified networks.

Any time you run storage over Ethernet, you’re going to want lossless behavior. That means engineering in enough bandwidth, low / no oversubscription.

Some of the functionality from the Data Center Bridging (DCB) standards that FCoE leveraged is used for this, e.g. with RoCE and iWARP. Specifically:

PFC (Priority Flow Control)
ECN (over IP) (Explicit Congestion Notification)

NVMeoF (TCP) is a bit different, with its own advantages and disadvantages. To provide smoother flow control (and fewer drops), NVMeoF TCP uses a TCP variant, DCTCP, Data Center TCP.

The goal of DCTCP is to adapt and prevent drops, since retransmissions really slow data transfer down. The key idea is to use congestion notification to pro-actively slow TCP transmission down to the throughput the current network traffic levels will support. That addresses questions about buffer depth with functionality for a “zero buffer” network.

A Related Story

I’ve recently had a discussion with a customer and a VMware expert. The customer’s BU wanted Mellanox switches for very low latency to support RoCE storage for a VMware VSAN cluster, and intense IO. Cisco’s published Nexus9K latency numbers are mildly higher than the Mellanox numbers — which begs the question of the precise testing conditions, etc.

To cut to the chase, the VMware person’s comment was along the lines that the difference between 300 nanoseconds versus under 1000 nanoseconds is likely to be insignificant compared to other performance factors.

Of course, YMMV.

References

This section contains some references I liked, that you might find useful.

I’ll note in general that CiscoLive OnDemand is a great resource for getting started on any new technology that Cisco touches. Slides from prior events are free to access. Yes, you won’t find as much for older technology or products with relatively small / possibly fading markets, e.g. WAAS — which I had cause to go looking for recently. Hey, the speakers have to draw an audience, or they become non-speakers.

I’ve packaged up a CiscoLive OnDemand search for you in the list below.

Note that one of the CiscoLive presentations covers the impact of all-flash arrays on the (SAN) network. Another discusses the impact on HCI (Hyper Converged Infrastructure).

Comments

Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

—————-

Hashtags: #CiscoChampion #TechFieldDay #TheNetCraftsmenWay #DataCenter

Twitter: @pjwelcher

Disclosure Statement

NetCraftsmen Services

NetCraftsmen is partnered with Pure Storage. Flash Arrays and NVMe!

Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at [email protected].