Replicating at Speed

Author
Peter Welcher
Architect, Operations Technical Advisor

This blog is abstracted from one small aspect of two different real-world WAN network design discussions, part of consulting engagements. They involved two software development companies, and replication over the WAN. Let’s look at what that entails at a high level.

Both companies produce some well-known applications they sell in various forms to end customers. Their development process entails checking in code and a heavily automated build process, followed by testing (automated and / or human).

Due to growth and acquisitions, both organizations are globally distributed, albeit to different degrees. The locations reflect the acquisition history, so may appear somewhat random to an outsider (including me!).

Coding, testing, and customer support personnel may be scattered across the company sites. Projects get completed, there are some people available in location X that can help location Y get a product release done … I do suspect entropy has something to do with this as well. Testers or post-release customer support need access to binary images, to test for bugs or try to replicate reported bugs.

To leverage that distributed talent, both companies replicate their code and binary trees. In both cases, the indivisible “atomic” units of replication run up to 100 GB in size, or bigger. In one case, a programmer commits some code (which could be one line of changed code) and that might trigger a build and replication. Multiply that times N programmers, pounding code and triggering builds when near a product or release ship date. That could add up to a lot of replication traffic!

We will use the term “batch traffic” for replication, backup, and other large data transfers like this. We assume they must complete within some time period per an SLA. Such traffic is (hopefully) mostly insensitive to drops throttling them back, so can likely be treated as low priority batch traffic from a QoS point of view.

The Replication Challenge

How do you get 100 GB out to say 5 to 10 or more sites, within the U.S. or globally, and do so efficiently?

Solving that actually starts with a business problem: how fast does the company need replication to go?

Reworded: What is the business requirement concerning replication completion time? Does this differ for different forms of replication? What about backup and other bulk transfers?

Note that developers or testers may be twiddling their thumbs, waiting for replication to complete so they can fire up the program and test it. They may be stuck testing an old version until replication completes. Or doing other tasks while waiting.

Replication time is tightly tied to bandwidth, which in turn costs money. One way of viewing this situation is balancing wait time (hours of programmer / tester time wasted or used inefficiently), versus the costs of higher speed networking. Programmer / tester frustration (mean time to quitting?) and other factors might also come into play.

The other side of that is, if every little change triggers replication, do you want to gate that somehow? Do you abort prior replications? Do you get overlapping replications, where one may complete but then get rapidly replaced by the next one triggered? But then does replication ever finish?

To maintain some reality about costs and bandwidth, a certain amount of “replication engineering” (my term for it) may be needed. Perhaps commits trigger a build once the last build completes. The same might apply to replication. What else can be done to reduce the volume of replication, limit how many sites a given atomic unit goes to, make atomic units smaller, etc.?

The business requirement reflects what choice the business makes in the speed versus wait time trade-off (and likely how much energy you put into making replication efficient or motivating the team responsible for build and replication to improve the processes).

Doing Some Math

Here is the theoretical minimum time to transfer 100 GB:

  • On a 100 Mbps link, 8000 seconds = 2.2 hours
  • On a 1 Gbps link, 800 seconds = 13.3 minutes
  • On a 10 Gbps link, 80 seconds = 1.3 minutes

Those are approximate theoretical numbers (and I’m ignoring the whole 1024 versus 1000 thing).

No actual network will be that fast, for a number of reasons (server speed, TCP slow start, contention for bandwidth, congestion and packet loss, latency and TCP stack, etc.).

Also note that rsync and most replication tools only transmit changes (in some form, to some degree), which might be less than the full 100 GB. Or not. That certainly affects speed calculations like those above. It suggests that knowing the typical replication volume (number of bytes transferred) might be useful to look at and know, even if that tells you the volume changes a lot.

For replication, when you create a new code branch and build, it could clog your links or take a while to replicate, since it is handled as a full copy, nothing was there on the far end(s) previously.

Concerning how latency and packet loss affect throughput, see also Mathis’ formula (although newer TCP stacks may do better — I haven’t seen research on that topic, not that I’ve looked that hard for it).

The most data you could transport in 24 hours on a 1 Gbps link would be 86,400 G bits or 10,800 GB, or 10.8 TB. Or 108 replications of a full 100 GB of data.

Further math would require some idea of the efficiency of the replication tool, as in the typical amount of data actually transmitted. Whether the tool runs parallel threads is another consideration (rsync, and some of the popular file-sharing applications do this, to use the link more efficiently, and offset some of the effects of latency).

Yet another factor to consider would be the amount of time it takes the replication tool to compare block checksums (or whatever change detection mechanism it uses). Replication tools do not directly compare bytes, since that would be slower than just transmitting the whole file or set of files.

Delivering the Data

The other factor is how you get the replication data there. Replication is usually TCP-based, so fairly robust. Some might opt for IPsec VPN over the Internet. MPLS can be costly, especially in countries with protected / monopoly telecom companies. However, don’t assume Internet and VPN is cheaper. Like many things in networking, “it depends”.

The private-WAN versus VPN over Internet comparison has one cost wrinkle. The crypto behind IPsec VPN requires a “real router” or firewall. If you’re considering speeds of up to say 1 Gbps, the router cost may not be that different than a few months’ worth of circuit costs. If you’re considering higher speeds, and if you can use a L3 switch instead of a router along with a private circuit, the cost savings may be substantial. See my prior Router Versus Switch blog on this.

Capacity Planning for Replication

Overall, if you mix replication and other traffic on the WAN, you’ll probably want QoS to protect your users’ business and Internet traffic. You will probably have three requirements overall:

  • Good performance for VoIP, video, and business apps, be they over WAN circuits, MPLS or IPsec VPN (backup or primary link)
  • Good performance for SaaS and Internet-based applications
  • Rapid-enough completion of replication and batch traffic jobs

If you are just using WAN circuits, QoS guaranteeing a small percentage of bandwidth to batch traffic may suffice.

If you’re mixing Internet and multi-point VPN with replication and / or business traffic, the QoS can get rather complex. I’m not going to attempt blogging about that. I do know a consultant who can help you with that.

How do you tell if you need more bandwidth for replication?

You may want to look at graphs of completion times or some metric such as Top-N 95th percentile completion times for replication / batch transfers, to spot trending batch situations that might need more bandwidth. If they are taking too long, they likely need more bandwidth. Add bandwidth to the link in question. Don’t tweak the batch QoS queue.

Side note 1: One of my doing-QOS-and-preserving-your-sanity recommendations is that you don’t change any per-class percentages, you just increase link bandwidth. Otherwise you have a bazillion one-offs and a management nightmare (What is the current traffic level, what was the percentage allocated times the link bandwidth — the operational mean time to useful results is hours).

Side note 2: Our Terry Slattery has converted me to be a fan of the 95th percentile. See my prior blog on this, which has links to his blogs on the topic. I think of the 95th percentile as the level where the statistic in question is “this big or bigger 5% of the time”. For traffic over 24 hours, the 95th percentile is a measure of heavily busy time on a link during that 24-hour period. If you like that idea, you might consider also using 95th percentiles to get a handle on the various QoS queue traffic levels and which queues are “running hot”.

The overall challenge for capacity planning is in coming up with one or a couple of metrics that measure when you need more bandwidth, and how much. Staring at plots of traffic or QoS is not efficient at scale.

I’m throwing these ideas out there in the hope they help. I haven’t seen or found any discussion of these broad topics: capacity planning with QoS, and sizing for replication at speed. That’s probably because there are so many variables that can be tuned.

Actual network management (NM) tools that will let you do this sort of math and reporting easily — good luck with that. Data export and some side calculations is starting to look like part of the answer. One approach might be to use a NM tool to gather interface and per-queue data, and then use its API (assuming it does have an API) to access the data and munge it. Reports as well as feeding Grafana for graphs might be the output end of that.

Comments

Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!

—————-

Hashtags: #CiscoChampion #TechFieldDay #TheNetCraftsmenWay #NetworkTraffic

Twitter: @pjwelcher

Disclosure Statement
Cisco Certified 20 Years

NetCraftsmen Services

Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at info@ncm2020.ainsleystaging.com.

Leave a Reply