setjmp()/longjmp() will work, but they're sort of inefficient as, at least under POSIX, they will save and restore the signal mask, which makes for two round-trips to the kernel just for a coroutine context switch.
My web server uses coroutines, and for x86 and x86-64 it uses open-coded assembly routines to yield/resume, with fallbacks to setjmp()/longjmp() on other architectures.
It works fairly well, performance-wise. In fact, it's one of the top-performing servers/frameworks in the TechEmpower's Web Framework benchmarks[1].
I wrote a similar article explaining how everything is put together here[2].
Given the various efforts like libtask, lthread, boost coroutines, etc, it seems like the low-level assembly trickery ought to be isolated and standardized. Maybe some new methods (similar to the setcontext family) should be proposed to the glibc project. Something not at risk of deprecation.
The Boost guys factored the context switching and stack allocation stuff in to a separate library - Boost Context[0], which supports a bunch of architectures[1]. I'm sure their fcontext code could be lifted. They claim on modern x86-64 that a switch takes about 8ns[2]
The Boost Coroutine library is built on top, adds type safety, ensures the stack is unwound when contexts are destroyed, and enables propagation of exceptions across switches.
$ ls -lh /usr/lib/libboost_context.so.1.58.0
-rwxr-xr-x 1 root root 55K May 30 09:58 /usr/lib/libboost_context.so.1.58.0
Not sure if something like this should be part of glibc. Just recently there was an ABI break due to the fact that jmp_buf is exposed in the headers to allow embedding the struct[1].
(Also, I ended up mixing up ucontext.h with setjmp.h in my comment above; Lwan uses ucontext.h as a fallback. There are coroutine implementations that will use setjmp/longjmp, or at the very least reuse the jmp_buf struct and roll their own asm, though.)
Glibc seems like a good place to put it to me. If everyone is preferring hand-coding assembly primitives that workaround suboptimal standard methods (ucontext), then my first instinct would be to look into making those existing methods more optimal.
(Note: in case it wasn't clear, I'm not suggesting we put all of coroutines into glibc, just the stack dancing stuff)
My web server uses coroutines, and for x86 and x86-64 it uses open-coded assembly routines to yield/resume, with fallbacks to setjmp()/longjmp() on other architectures.
It works fairly well, performance-wise. In fact, it's one of the top-performing servers/frameworks in the TechEmpower's Web Framework benchmarks[1].
I wrote a similar article explaining how everything is put together here[2].
[1] https://www.techempower.com/benchmarks/ [2] http://tia.mat.br/blog/html/2014/10/06/life_of_a_http_reques...