BGP Traffic Engineering
One of my recent blogs alluded to inbound QoS. This topic came up in a recent consulting engagement. After considerable thought and some research (sounds better than “I googled some things”, doesn’t it?), I’ve concluded the following:
TL;DR: Inbound QoS is rather limited in what it can do for you.
Let’s take a look at what you can and cannot control as far as QoS for inbound traffic.
Unless you can afford large amounts of mostly unused bandwidth, sheer bandwidth does not necessarily solve QoS issues. Bursts of traffic will likely occur, and when they do, congestion will drop random packets — that typically happens outbound transmission to the next device.
Let’s focus on a WAN or Internet setting, as that’s where most people desire inbound QoS.
The fun thing here is that any congestion and inbound packet drops will be occurring at the MPLS WAN or ISP egress router’s interface. Which is outside your control, under most current business models. Good luck even getting statistics on what that egress router is dropping on your interface, or even finely grained utilization data, depending on your ISP.
I’ll spare you the full rant here. What I’ve seen in the way of business WAN or Internet reporting by providers … I’ll be polite and say that most of it is “inadequate”, “pointless”, “pathetic”, not to mention “useless”. Total number of bytes in 24 hours, or data points once per hour …
If you’re an iSP or WAN provider, don’t you want customers to see they have congestion and therefore need to buy more bandwidth? Hourly averages hide that.
Getting back to inbound QoS … Here’s what I do believe works, to a degree: inbound traffic policing. If you pick a class of TCP-based traffic and drop excess traffic, that will more or less throttle the flow back, keeping its traffic level around the policing level. Note that is policing to an absolute, fixed traffic level, e.g. 30% of the link.
For UDP traffic and general IP traffic, that generally doesn’t work, unless the UDP-based application has some built-in awareness of packet loss and pacing. If someone decides to trash your Internet connect by spewing UDP at you, about all you can do is call your ISP and say the magic acronyms “DOS / DDOS” — unless you have one of the tools that helps with that. If that’s happening on your WAN, you need to look at the traffic source, etc. and fix the problem.
And that’s about it! You can’t do shaping inbound on Cisco devices. Even if you could do it, it’s not clear what problem it would solve.
You could get creative. One attempt I’ve seen attempted is to use a VRF (Cisco) or virtual router (Juniper) to pair the inbound interface with an outbound one. You then apply shaping and / or queues and priorities / percentages of bandwidth to the traffic on that outbound interface (working around being unable to do them on the inbound interface).
The logic is that by converting an inbound problem to an outbound one, we can apply all the good tools for outbound traffic. Like percentages and shaping. All true! Yes, we can configure this.
BUT: There’s still the question of whether doing so actually does what you want it to do.
Remember that outbound QoS only applies when there is more traffic than can be sent immediately, so that there is a queue of packets waiting to be forwarded. QoS is about managing that queue of packets that have yet to be sent. Which ones get sent first or next, and which ones are more likely to be dropped to preserve queue space for higher priority packets.
Suppose then that you’re feeding inbound traffic through a VRF to an outbound physical interface with QoS applied.
Suppose you’re doing QoS like most sites do, using percentages of bandwidth, which may more or less equate in coding to pro-rated opportunities to transmit. Suppose high-priority app X has a traffic lull and low-priority bulk app B is getting more than its share of bandwidth.
If and when the next spurt of priority app X traffic hits its outbound queue, that may trigger some drops of bulk app B. If app B is TCP based, that will signal the sender to slow down.
The logical snag here is that if app B is blasting away as fast as it can, how does the app X traffic make it into the router to hit the outbound queue, cause some app B drops, thereby sending an indirect signal to sender to slow down on sending app B? — Whew!
If we’re talking WAN here, the answer might be “the provider’s egress QoS”. If Internet, however, egress QoS is not an expected feature. So, does the app X traffic even make it into the router? Luck of the queueing draw? But if we’re talking 1 Gbps in and 1 Gbps out, there’s going to be no ensuing congestion on the egress interface, hence no drops …
In the case in question, bulk app B was replication over VPN tunnels, mixed with general Internet traffic. The latter could (maybe) be classified and marked and assigned high or low priority QoS by the router. The app B traffic ditto — or was already marked (DSCP bits in the IP header).
Conclusion #1: You really have to police the inbound replication to preserve a bandwidth pool for other traffic. If the other traffic steadily fills its portion of the bandwidth during business hours, then the paired outbound QoS probably won’t be very helpful.
When you have a mix of VPN tunnel and Internet traffic, coming up with effective QoS is hard.
If you separate the VPN tunnel and the Internet on different physical links (which adds cost for having two Internet links), then what I have recommended elsewhere is to apply QoS and shaping to what goes into the tunnel at the other end.
The challenge in that case comes when multiple sites are sending traffic to a common site. One would like to allow each site to use all the unused bandwidth at the receiving end. I don’t know how to make that work. If there are three tunnel source sites, you could have them cap outbound traffic to say 1/4 of the receiving end’s bandwidth, leaving some receiving bandwidth for other traffic, and live with some unused bandwidth.
QoS logic at the sending sites is de-coupled from queue state at the receiving site(s), as it has to be. There’s no way to communicate the extremely short-lived queue state at site A to site B — the time needed for QoS decisions is orders of magnitude less than the time it takes to signal between sites.
There are vendors selling devices that claim to do inbound QoS. Exinda is one I’ve heard of.
I found a rather old Network World article by a Riverbed author talking about techniques for intervening in the TCP signaling. The short version is that manipulating TCP flows and causing TCP to think there is congestion (let’s call it “premature congestion”), doing so might preserve some headroom for new flows to get their packets in through the inbound interface so that the queuing can throttle back other apps.
In terms of the VRF tying inbound to an outbound interface, if we have queuing feeding a shaper to say 90 or 95% of the bandwidth, then there might be some headroom left. This seems like one area where some lab experimentation would be handy.
I think I’m seeing WAN accelerators and other inline packet / flow manipulation boxes as a dwindling market. That likely also why those vendors are getting into the SD-WAN market. The driver for this may be cost as speeds increase. It’s cheaper to buy bandwidth than to buy processors that can do some calculations (flow counter updates, etc.) on every packet on a 1 or 10 Gbps or faster interface.
QoS is a complex topic, with some subtleties to it.
I’d love to be able to control inbound traffic. Especially as applications like Skype (voice, video, app sharing) shift to being Internet-based. How do you use QoS to protect such traffic as it comes in from the Internet?
Short Answer: find an ISP that doesn’t laugh when you say ‘QoS for Internet traffic’?
For that matter, how do you even classify Internet traffic for internal QoS purposes, especially if it is HTTPS? The SaaS provider’s server IPs probably change often.
The URL or the TLS “SNI” may help there, but in general, that would seem to require tying initial packet exchange to later traffic flows — something a good bit more complex than matching on IP and port. Or doing something similar with DNS.
Do products actually do that? Cisco says that NBAR can track the initial DNS. It also can apparently look at the TLS SNI or certificate common name (CN) field. What percent of traffic does that cover and not cover? I have no data.
Security tends to get a lot more attention than QoS, so maybe there’s hope for better classification of such Internet traffic going forward.
From the QoS perspective, that brings us full circle. Even if you can classify inbound Internet / SaaS traffic, how helpful is that?
Comments are welcome, both in agreement or constructive disagreement about the above. I enjoy hearing from readers and carrying on deeper discussion via comments. Thanks in advance!
Hashtags: #CiscoChampion #TechFieldDay #TheNetCraftsmenWay #Routing
Did you know that NetCraftsmen does network /datacenter / security / collaboration design / design review? Or that we have deep UC&C experts on staff, including @ucguerilla? For more information, contact us at firstname.lastname@example.org.
BGP Traffic Engineering
Design: Is It One Site or Two?
What Business Leaders Should Know About Network Monitoring
Nick has over 20 years of experience in Security Operations and Security Sales. He is an avid student of cybersecurity and regularly engages with the Infosec community at events like BSides, RVASec, Derbycon and more. The son of an FBI forensics director, Nick holds a B.S. in Criminal Justice and is one of Cisco’s Fire Jumper Elite members. When he’s not working, he writes cyberpunk and punches aliens on his Playstation.
Virgilio “Bong” has sixteen years of professional experience in IT industry from academe, technical and customer support, pre-sales, post sales, project management, training and enablement. He has worked in Cisco Technical Assistance Center (TAC) as a member of the WAN and LAN Switching team. Bong now works for Tech Data as the Field Solutions Architect with a focus on Cisco Security and holds a few Cisco certifications including Fire Jumper Elite.
John is our CTO and the practice lead for a talented team of consultants focused on designing and delivering scalable and secure infrastructure solutions to customers across multiple industry verticals and technologies. Previously he has held several positions including Executive Director/Chief Architect for Global Network Services at JPMorgan Chase. In that capacity, he led a team managing network architecture and services. Prior to his role at JPMorgan Chase, John was a Distinguished Engineer at Cisco working across a number of verticals including Higher Education, Finance, Retail, Government, and Health Care.
He is an expert in working with groups to identify business needs, and align technology strategies to enable business strategies, building in agility and scalability to allow for future changes. John is experienced in the architecture and design of highly available, secure, network infrastructure and data centers, and has worked on projects worldwide. He has worked in both the business and regulatory environments for the design and deployment of complex IT infrastructures.