If you are coding your application against libaio chances are the entire point of the system is to run that application.
High performance software generally does divide the system up so that certain cores serve certain roles. i.e IRQ handling etc. This is often done to exploit NUMA architecture, ensuring that memory accesses only happen against the local memory controllers and locally attached PCIE where possible.
If you are going to this point you are rarely if ever concerned with sharing the system with anyone else.
It's not at all unusual for our HFT and other realtime customers to dedicate cores to particular tasks. They often boot with isolcpus to reduce the number of CPUs visible to the scheduler, then dedicate those CPUs to servicing interrupts or running critical programs. They also turn off all power management so the CPUs are burning at 100% all the time.
you do this to minimize latency. to ensure things happen in usecs instead of msecs, this is what you have to do.
deliberate tradeoff. it is better to waste electricity, and to ensure your deadlines are met than for some to be late and some to be early.
changing the speed of a cpu is very expensive in time, and can take a couple of ms to complete (of course, most of this is caused by how it is mechanized in the OS).
Transitions from C0 to C1 and back are very fast [0], and basically any application has at least a core and probably several doing kernel things like servicing network IRQs.
I agree that HFT applications should not be using deep C states.
[0] Unless you have a misguided feature like C1E auto-promotion on. Fortunately, recent Linux kernels will overriding firmware and turn it off. On Sandy Bridge and possibly newer CPUs with C1E auto-promotion on, resuming from C1 can take many, many milliseconds.
Dedicating cores to specific server tasks is normal in modern high-performance software architectures. It can very substantially improve throughput of the system. For server code generally, and data intensive software specifically, there is a reasonable and valuable presumption that your hardware resources are dedicated to your task. This assumption greatly simplifies the design of high-performance software and tends to match the deployment environment of performance-sensitive software in any case.
Many high-performance software applications are constrained by network/disk bandwidth, so you would have cores doing no productive work regardless. This allows you to re-deploy those unused cores in other creative and useful ways. Specialization of cores has the added benefit of minimizing lock contention and making it simpler to reason about complex interactions between threads.
I don't know if it is the case, but, I would happily trade off two cores to get better IO in most cases in our nodes servicing IO bound jobs. Otherwise, those CPU cores are just idling anyway because of IO..
With the current growth of CPU leaning towards more and more cores with ever smaller improvements in single core performance, dedicating spare cores for such perm improvements is not at all a bad idea.
You can batch 1000 writes to the same socket into a single syscall, but you can’t batch writing to 1000 different sockets into one syscall. (And it is actually a common occurrence to write the same message to 100 sockets at the same time - e.g. in an IRC server with 100 people in A room)
There's io_submit which lets you batch IOs on any number of fds. It appears to lack socket support, but I don't see any reason why that couldn't be added.
There is a fundamental difference between disk I/O and network I/O that cause them to be treated differently.
For disk I/O, you have absolute control over the number and type of events that may be pending, allowing you to strictly bound resource consumption and defer handling of those events indefinitely with no consequences. Similarly, you can opportunistically use unused disk I/O capacity to bring future I/O events forward with little cost e.g. pre-fetching. The worst case scenario you have to deal with in terms of disk I/O is self-inflicted, which makes for relatively simple engineering.
Network I/O events, by contrast, are exogenous and unpredictable. You often have little control over the timing or the quantity of these events, and the only limit on potential resource consumption is the bandwidth of your network hardware. Not only do you have to handle the worst case scenario in these cases, you also have little control over what a worst case scenario can look like. This leads to very different design decisions and interactions with the I/O subsystem versus disk.
Io_submit is part of the Linux aio subsystem. Thus patch is literally a extension to that subsystem to avoid doing a kernel call to submit aio requests and collect results.
Sure, but this patch is only useful for single-purpose, HFT-like workloads. Batched io_submit on sockets would be useful for any application sending many small packets to lots of clients, such as game servers.
This is basically how the entirety of DPDK using a poll-mode driver works. Each worker thread is doing busy polling on a shared memory segment, ensuring minimal packet processing latency.
This is a bit different though. DPDK completely bypasses the kernel and userspace directly talks with the network adapter.
I think this is mostly for disk I/O and the userspace component still communicates with the kernel that controls the hardware and performs permission checks.
Yes. But actually you don’t only donate the cores to the application but also the IO device which is acted upon (E.g. the drive or network adapter). This also makeshift sense for some very specific high performance tasks, but not for a general purpose system.
If the AMD leaks are to be believed, we will have cheap (less than about $400) 16 and 12 core consumer processors next year. I’d happily trade two of those cores off against syscall overhead.