Improving NGINX Performance with Kernel TLS and SSL_sendfile

drewg123 · on April 4, 2022

We've been running kTLS + SSL sendfile on FreeBSD at Netflix for the last 6 or 7 years. (We had local patches to nginx, before nginx did them "right", and 2 versions of kTLS before the 2nd version was upstreamed to FreeBSD). The savings in terms of CPU use and memory BW are pretty substantial. Especially when you use a NIC which can do in-line kTLS offload, then things basically go back to pre-TLS costs because the buffers are not touched at all by the CPU.

BTW, FreeBSD 14 supports cha-cha poly. But is far more CPU intensive than GCM, so I'd advise against using it.

tinco · on April 4, 2022

Was CPU a bottleneck for your throughput, or did you just want some more headroom on those machines?

I've been in the room with some web scale companies, but never was aware of any big excess TLS termination costs, but maybe I just didn't ask and they had some specialized hardware/software that I didn't know about.

toast0 · on April 4, 2022

I'm not at Netflix, just summarizing. Check out the Netflix presentations like [1]; at 400gbps, the bottleneck is RAM and NUMA fabric. Doing bulk crypto on the CPU requires double the RAM bandwidth, and that kills the throughput regardless of the cost of cipher computation.

TLS termination costs are a large factor for Netflix because their CDN appliances want to have as much throughput as possible in a small space. At 'web scale', it wouldn't be a big deal to be happy with 100gbps througput and use 4 boxes; but getting ISPs to install more boxes is hard, it's better to get one box to work as fast as possible.

Personally, I've seen it be an issue; dual xeon 2690 v1 or maybe v2 couldn't push 10G via TLS, but could via plain text. But the v3 chips had better acceleration and could do 2x10G no problem. I never got access to more than 2x10G networking, so no idea what the limit was. That was under managed hosting (dedicated bare metal), so no ability to do networking beyond what was offered. At Facebook, they tended to use smaller servers and more of them, and I didn't deal with TLS as much.

[1] https://people.freebsd.org/~gallatin/talks/euro2021.pdf

drewg123 · on April 4, 2022

The limit for FreeBSD software kTLS with our workload is a little over 100Gb/s (using a 2x100G NIC) on a 2697a v4. I think most people only get 25-30Gb/s out of the same CPU when not using kTLS.

drewg123 · on April 4, 2022

The bottleneck is both CPU and memory bandwidth. Doing kTLS in software (or in a lookaside accelerator, like QAT) doubles memory bandwidth requirements because something (CPU or accelerator hardware) needs to read the buffer to be encrypted, and write the result into a new buffer.

And doing TLS in userspace, rather than kTLS is even worse, because it disables sendfile. That means you now have extra copies across the user/kernel boundary.

vasco · on April 4, 2022

Not at netflix but at scale any savings in CPU usage translate directly into spend savings. If you have 100 machines at peak and autoscaling rules in place, you can generally assume that a 1% saving in CPU usage translates into one less machine running. You never get "headroom", you just have less machines running at any given time.

tinco · on April 4, 2022

That's only true if CPU is the bottleneck, which is why I asked. I would expect network throughput to be the bottleneck, but maybe that's wrong nowadays with cheap 200gbit nics.

jfindley · on April 4, 2022

Video streaming is a bit different from "normal" TLS web serving. For video streaming the network is much more likely to be the bottleneck, but traditional TLS web serving tends to send a lot of fairly small packets, and saturating even 40gbe with small packets of TLS is non-trivial.

My experience is that at small (think less than 700 bytes or so) packet sizes, TLS overhead can be a very real cost as you scale up, and it's easy to hit "walls" where really you just need to buy more servers because the engineering cost of getting your servers to really saturate their NICs isn't worth it. How much of a big deal that is for you will depend a lot on exactly what sort of a thing you're serving, though.

drewg123 · on April 4, 2022

I'm curious why TLS packets are small in your experience. We use 16k TLS records, and send at whatever MSS the client has negotatied with us. Thats normally something "normal", like 1420-1460 bytes. Plus we use TSO, so the NIC sees large sends of up to 64KB.

cperciva · on April 4, 2022

Random question: Do you force Netflix clients onto the ciphers which are most efficient for the Netflix servers, or are there cases (I'm thinking mobile devices particularly) where it makes sense to use the ciphers which are most efficient for the clients?

jchw · on April 4, 2022

Pretty sure the best choice is basically always AES GCM because most modern chipsets can hardware offload that. Curious to hear the answer, though.

stragies · on April 4, 2022

Genuine quesion: In 2022, is that still dependent on "Can I trust my (HW?)random number generator"?

jchw · on April 4, 2022

I don’t believe so. AES is symmetric-key encryption; if you do it incorrectly, it doesn’t decrypt at all. The only place you can mess up, then, is in the mode, which I don’t believe is accelerated by most (any?) AES encryption instruction sets; instead, they mostly handle doing an individual round of encryption/decryption.

Public key crypto systems seem like they would be much scarier to have hardware acceleration, though I’m sure if you broke it down to low level enough bits you could make it impossible for the hardware to “break” it (aside from, well, if it decided to just maliciously subsitute your code for its own. But it could do that without extension ISAs.)

amadvance · on April 4, 2022

GCM is typically implemented in Intel assembler with the PCLMULQDQ instruction

https://www.intel.com/content/dam/www/public/us/en/documents...

CorrectHorseBat · on April 4, 2022

How is the story any different for public key crypto?

jchw · on April 4, 2022

I am not a cryptography expert, so please take what I say with a grain of nacl. The parameters you choose for public key crypto systems can weaken the cryptographic properties of the result. For example, nonce reuse can single-handedly destroy your security. Similarly, with RSA, padding is crucial to security. Even if you write the extension ISAs such that they never choose parameters, there’s still probably some room for their to be broken output that seems to work.

netheril96 · on April 4, 2022

It of course depends on the random number generator. That is how you create the keys and sometimes the initialization vector. But AES itself is deterministic so it’s not possible to backdoor that part.

LinAGKar · on April 4, 2022

x86 and ARMv8 do, but not ARMv7

mobilio · on April 4, 2022

And one huge exception - RPi CPU.

toast0 · on April 4, 2022

There are cases where it makes sense, but I'm not sure that mobile devices that are likely to be playing video is it. Chances are, they'll be burning more power on the screen backlight than the CPU to do AES vs something else (assuming they're not accelerated for AES). There's two sides to the argument, but easing the burden on servers lets one server serve more clients; it's easier to justify 2x the cost for crypto on the client than on the server, because most clients aren't bottlenecked on crypto and some servers are. (Of course, I'm usually a server engineer, so of course I want my servers to have less work ;)

Choosing ciphers for ease of the client makes more sense, IMHO, when the client is really constrained, like a feature phone or tiny IoT things.

Aissen · on April 4, 2022

Just a note to anyone wanting to use kTLS: make sure to benchmark it first, like in the article. Depending on the CPU architecture, it might even be slower than plain userspace TLS.

Also, while the tx side has seen lots of investment (from CDN companies/owners), the receive side usually comes later. For instance, it's not supported for TLS 1.3 in openssl (although there's an open PR).

georgia_peach · on April 4, 2022

How long before we push everything into the kernel?

zurn · on April 4, 2022

Unikernels are here today.

The other way to eliminate protection boundary crossings is to push everything to userspace, a la Snabb, DPDK etc.

_dh54 · on April 4, 2022

Generally the kernel is involved when it comes to making use of hardware. Specialized hardware emerges when widely applicable bottlenecks are identified, like rendering 3D graphics, decoding video, or in this case TLS encryption. Not everything is destined for the kernel as userspace has generally desirable properties as well.

georgia_peach · on April 4, 2022

Last I checked, accelerated crypto instructions were unprivileged. Going by the article, this is just jamming the entire TLS hokey-pokey into kernelspace to avoid a copy.

TLS session management is rather hairy. Judging by their Linux numbers, I'd take the performance hit over pushing something that complicated into the kernel.

megous · on April 4, 2022

There are also DMA based crypto accelerators (quite common, almost all computers in my home have one). Even those cheap $10 Orange Pi's.

Aissen · on April 5, 2022

The kernel code shouldn't do any session management. It just takes over once the handshake is done; the rest happens in userspace.

Also, you're thinking of CPU-accelerated crypto, but you're missing two other use cases: here, coupled with sendfile(2), you can reduce the number of back-and-forth between kernel and userspace when you already know what will be written on the socket. The other use case is the (few, for now) network cards that do TLS in hardware, meaning going at line-rate, whatever your CPU speed is (not sure you can do single-thread 200Gbit/s crypto on x86_64).

_dh54 · on April 4, 2022

To be fair the hardware in question is not just accelerated crypto operations. From what I’ve read on this topic there are network cards that handle the end-to-end TLS protocol (sans negotiation).

In general you have a point but it’s a judgement call like many things in engineering. Even in the absence of specialized TLS hardware, TLS operations are so common, there is a strong case for pushing it in the kernel if that improves efficiency by a double digit percentage.

SSL_sendfile in particular is an efficiency boon for large static site hosts, it could result in significantly less hardware waste and/or reduced power consumption.

yencabulator · on April 4, 2022

But if we're talking about offloading to a NIC, that can often be accessed directly from userspace, with a lot less kernel complexity.

Pushing complexity into the kernel only makes sense when it's not immediately thereafter offloaded to easily-isolated hardware.

_dh54 · on April 4, 2022

> that can often be accessed directly from userspace, with a lot less kernel complexity.

Can you be more specific? Is this something that could be done with non-root privileges and without explicit coordination? Like normal TCP sockets?

yencabulator · on April 5, 2022

This is how it normally looks:

Modern hardware interacts with software ("drivers") by ringbuffers & data areas.

To support this use case, the hardware generally provides multiple "logical devices". For example, https://en.wikipedia.org/wiki/Single-root_input/output_virtu.... They're defined so that handing an untrusted party control of a logical device limits what they can do, e.g. what VLANs or other network overlays they can interact with. Each one gets its own ringbuffers etc.

Another use case for the same hardware idea is virtual machines. Here, you can think of a userspace process as a virtual machine, minus all the overheads and pretense of being a whole computer.

A userspace process is given access to the memory areas containing ringbuffers & data areas for a logical NIC. A library acts as a driver, and controls the NIC. All interaction is just reads & writes to memory, after setup the kernel is not involved at all.

_dh54 · on April 5, 2022

But in this setup, only a single process can use the logical device provided by the NIC. In kTLS case then multiple processes can share the logical device transparently.

yencabulator · on April 5, 2022

Sure, but there's usually something like 16 logical devices even in the less fancy hardware. And we're talking about nginx, generally there's only one of those -- the VM use case demands more, and that's the primary demand for the feature.

toast0 · on April 4, 2022

Avoiding copies is useful, but AFAIK, this isn't jamimg the whole hokey-pokey into kernelspace. Only the bulk encryption (and probably? packetizing too), user space negotiates the session and just tells the kernel what it is. If there's anything that's not just application data, the kernel leaves it for userland.

Bulk ciphering isn't hairy, and it's the same approach as IPSEC; userland negotiates sessions, kernel does the bulk ciphers.

imhoguy · on April 4, 2022

Awaiting kPHP ;) /s

jrwr · on April 4, 2022

I am working on a MicroPHP to shove into something like a RP2040 since PHP is really just a shell script for a whole bunch of C functions... but PHP eats too much ram currently on the micros

rascul · on April 4, 2022

> Alpine Linux 3.11–3.14 – Kernel is built with the CONFIG_TLS=n option, which disables building kTLS as a module or as part of the kernel.

I wonder if this is still the case with 3.15?

Edit:

I figured I could check for myself. I don't know for sure what the default kernel package is, but there apparently is a linux-lts package. After installing this package, it leaves a config-lts file in /boot which, when grepped, returns:

# CONFIG_TLS is not set

The more I learn about Alpine (and musl), the more I don't want to use them. It appears as if I have an inherent performance penalty serving https web sites with nginx when I do it from Alpine.

stormbrew · on April 4, 2022

> It appears as if I have an inherent performance penalty serving https web sites with nginx when I do it from Alpine.

This is a weirdly alarmist take on this? If you're trying to use bleeding edge kernel features, which this basically is, you should probably feel comfortable using an alternative kernel because odds are pretty good you're gonna have to update sooner rather than later for some bug fix or other.

It's just not really reasonable to expect all distros to enable all kernel flags all the time, a lot of them are not really proven safe or secure. Especially when they're new.

rascul · on April 4, 2022

> This is a weirdly alarmist take on this

It's an observation. I didn't intend for it to be alarmist, if you interpreted it that way then perhaps I could have worded it differently.

> If you're trying to use bleeding edge kernel features, which this basically is

It's been in the kernel since 2017 (the article noted kernel 4.13 which was released 2017 when I looked it up). That doesn't seem very bleeding edge to me.

> It's just not really reasonable to expect all distros to enable all kernel flags all the time

Of course. And I've already been considering moving away from Alpine for at least some use cases, and this can lead me to use move away for more use cases.

stormbrew · on April 4, 2022

> It's an observation. I didn't intend for it to be alarmist, if you interpreted it that way then perhaps I could have worded it differently.

Alarmist is perhaps the wrong word. What I mean is that this is a very strange and high bar for choosing a distro. You aren't "suffering a penalty by using alpine," you're being a beta tester by using a non-LTS ubuntu with a bunch of random flags on or whatever. You can also just.. use a different kernel version with alpine (or whatever distro), no one's stopping you.

> It's been in the kernel since 2017 (the article noted kernel 4.13 which was released 2017 when I looked it up). That doesn't seem very bleeding edge to me.

You can't actually use the version in 4.13 though, you need at least 4.17, because apparently that's what openssl 3.0.0 requires.

Now 4.17 has been around for a while too! But you also need openssl 3.0.0 to make practical use of it, and that's only been out since sept 2021. And also had a massive number of breaking changes.

And then you have to be using a newer kernel version than that to get tlsv3 ciphers apparently. Looks like somewhere around 5.10, though it doesn't explicitly say in the article afaict. If you don't use that then maybe you're gaining some speed but you're also downgrading your security.

And then you need to use bleeding edge nginx and manually compile that against openssl3.

So yeah. It's technically been there for years. But in practical terms no one (or very few people at least) have been using it in anger until the last few months. A new syscall in linux is "bleeding edge" for a while.

Honestly the kernel is the least of your concerns here.

limoce · on April 4, 2022

What if I'm running NGINX inside a Alpine-based docker container, but the host OS is Ubuntu? I guess the kernel is okay but OpenSSL provided by Alpine is not ready for kTLS, so NGINX will not enable kTLS as well.

rascul · on April 4, 2022

I didn't consider containers. I mostly run Alpine in virtual machines.

rascul · on April 4, 2022

Hah I just realized I could check the kernel config in a vm. Earlier I looked for a kernel package to install in a container to see what the config was. Maybe I've been drinking too much. Anyway, in one of my Alpine virtual machines, /boot/config-virt has CONFIG_TLS unset.

bigbizisverywyz · on April 4, 2022

The article actually mentions Alpine as one of the dist. where it's not supported by default:

>The following OSs do not support kTLS, for the indicated reason:

>Alpine Linux 3.11–3.14 – Kernel is built with the CONFIG_TLS=n option, which disables building kTLS as a module or as part of the kernel.

and even recommends building OpenSSL and Nginx 3.0 yourself anyway, so looks like it will be a while before this might be available out-of-the-box for most major dists. But of course everything is OSS so you can DIY if you don't mind getting some ./configure under your fingernails :)

_ikke_ · on April 4, 2022

A feature request to enable it: https://gitlab.alpinelinux.org/alpine/aports/-/issues/13664

Shared404 · on April 4, 2022

The way I see it is that it's just trade-off's to be chosen between.

I like alpine because it's simple enough even for me to wrap my head around and understand what's going on. None of what I'm serving is high traffic or complex enough for this to matter to my usecase - and I suspect this applies to many people's situation.

I appreciate musl/alpine for their stability and simplicity I suppose, and a bit of performance is an OK price to pay in my mind.

schoen · on April 4, 2022

(2021)

KarlKemp · on April 4, 2022

November of 2021, to be exact. If you want to tag this article, you will have to tag others with (2022) sooner than one might expect.

1500100900 · on April 4, 2022

There's so much churn on hacker news that everything gets old after a couple of nanoseconds. So I think it would be best to just add a field for a 128-bit timestamp of the original submission and present it with high precision.

winrid · on April 4, 2022

This would make Nchan even faster, neat.

WJW · on April 4, 2022

How so? According to the article this works only for serving static files. Nchan, cool as it is, has nothing to do with static files.

winrid · on April 5, 2022

Ah, I see, I merely skimmed. Thanks for the correction.