Thanks, I used that neat knuth coin-flip trick from your code.
Why does the comment on the file say that its lock-free implementation is half as slow as compared to 'sequential' one? Where does the slowness come from? Is it all those while(1) loops waiting to race atomic operations?
It's used to manage code fragments that need to be accessed in signal handlers.