So you need to dedicate two cores, making them unavailable for general-purpose t...

jpgvm · on Dec 16, 2018

If you are coding your application against libaio chances are the entire point of the system is to run that application.

High performance software generally does divide the system up so that certain cores serve certain roles. i.e IRQ handling etc. This is often done to exploit NUMA architecture, ensuring that memory accesses only happen against the local memory controllers and locally attached PCIE where possible.

If you are going to this point you are rarely if ever concerned with sharing the system with anyone else.

rwmj · on Dec 16, 2018

It's not at all unusual for our HFT and other realtime customers to dedicate cores to particular tasks. They often boot with isolcpus to reduce the number of CPUs visible to the scheduler, then dedicate those CPUs to servicing interrupts or running critical programs. They also turn off all power management so the CPUs are burning at 100% all the time.

CoolGuySteve · on Dec 16, 2018

Standardizing the kernel-bypass of all the different network cards for absolute minimal latency would make life much easier in HFT.

amluto · on Dec 16, 2018

> They also turn off all power management so the CPUs are burning at 100% all the time.

That seems quite silly to me. That wastes a bunch of TDP and likely loses some Turbo.

bdavis__ · on Dec 16, 2018

you do this to minimize latency. to ensure things happen in usecs instead of msecs, this is what you have to do. deliberate tradeoff. it is better to waste electricity, and to ensure your deadlines are met than for some to be late and some to be early.

changing the speed of a cpu is very expensive in time, and can take a couple of ms to complete (of course, most of this is caused by how it is mechanized in the OS).

amluto · on Dec 17, 2018

Transitions from C0 to C1 and back are very fast [0], and basically any application has at least a core and probably several doing kernel things like servicing network IRQs.

I agree that HFT applications should not be using deep C states.

[0] Unless you have a misguided feature like C1E auto-promotion on. Fortunately, recent Linux kernels will overriding firmware and turn it off. On Sandy Bridge and possibly newer CPUs with C1E auto-promotion on, resuming from C1 can take many, many milliseconds.

gpderetta · on Dec 16, 2018

True. Also overclocking and watercooling are common.

jandrewrogers · on Dec 16, 2018

Dedicating cores to specific server tasks is normal in modern high-performance software architectures. It can very substantially improve throughput of the system. For server code generally, and data intensive software specifically, there is a reasonable and valuable presumption that your hardware resources are dedicated to your task. This assumption greatly simplifies the design of high-performance software and tends to match the deployment environment of performance-sensitive software in any case.

Many high-performance software applications are constrained by network/disk bandwidth, so you would have cores doing no productive work regardless. This allows you to re-deploy those unused cores in other creative and useful ways. Specialization of cores has the added benefit of minimizing lock contention and making it simpler to reason about complex interactions between threads.

reacharavindh · on Dec 16, 2018

I don't know if it is the case, but, I would happily trade off two cores to get better IO in most cases in our nodes servicing IO bound jobs. Otherwise, those CPU cores are just idling anyway because of IO..

With the current growth of CPU leaning towards more and more cores with ever smaller improvements in single core performance, dedicating spare cores for such perm improvements is not at all a bad idea.

kbwt · on Dec 16, 2018

The only thing this really improves is latency, because you can always batch more I/O into a single syscall.

beagle3 · on Dec 16, 2018

You can batch 1000 writes to the same socket into a single syscall, but you can’t batch writing to 1000 different sockets into one syscall. (And it is actually a common occurrence to write the same message to 100 sockets at the same time - e.g. in an IRC server with 100 people in A room)

kbwt · on Dec 16, 2018

There's io_submit which lets you batch IOs on any number of fds. It appears to lack socket support, but I don't see any reason why that couldn't be added.

jandrewrogers · on Dec 16, 2018

There is a fundamental difference between disk I/O and network I/O that cause them to be treated differently.

For disk I/O, you have absolute control over the number and type of events that may be pending, allowing you to strictly bound resource consumption and defer handling of those events indefinitely with no consequences. Similarly, you can opportunistically use unused disk I/O capacity to bring future I/O events forward with little cost e.g. pre-fetching. The worst case scenario you have to deal with in terms of disk I/O is self-inflicted, which makes for relatively simple engineering.

Network I/O events, by contrast, are exogenous and unpredictable. You often have little control over the timing or the quantity of these events, and the only limit on potential resource consumption is the bandwidth of your network hardware. Not only do you have to handle the worst case scenario in these cases, you also have little control over what a worst case scenario can look like. This leads to very different design decisions and interactions with the I/O subsystem versus disk.

gpderetta · on Dec 16, 2018

Io_submit is part of the Linux aio subsystem. Thus patch is literally a extension to that subsystem to avoid doing a kernel call to submit aio requests and collect results.

kbwt · on Dec 16, 2018

Sure, but this patch is only useful for single-purpose, HFT-like workloads. Batched io_submit on sockets would be useful for any application sending many small packets to lots of clients, such as game servers.

gpderetta · on Dec 16, 2018

For network io, an userspace accelerated network stack is probably better though.

lima · on Dec 16, 2018

This is basically how the entirety of DPDK using a poll-mode driver works. Each worker thread is doing busy polling on a shared memory segment, ensuring minimal packet processing latency.

gpderetta · on Dec 16, 2018

This is a bit different though. DPDK completely bypasses the kernel and userspace directly talks with the network adapter.

I think this is mostly for disk I/O and the userspace component still communicates with the kernel that controls the hardware and performs permission checks.

freebsd4me · on Dec 16, 2018

So it’s like netmap for disk.

Matthias247 · on Dec 16, 2018

Yes. But actually you don’t only donate the cores to the application but also the IO device which is acted upon (E.g. the drive or network adapter). This also makeshift sense for some very specific high performance tasks, but not for a general purpose system.

dkersten · on Dec 16, 2018

If the AMD leaks are to be believed, we will have cheap (less than about $400) 16 and 12 core consumer processors next year. I’d happily trade two of those cores off against syscall overhead.