If you twist your perspective a little, you can change how you view the stack; instead of looking at it as a literally linear blob of data, consider it the 'data' or 'scope' pointer of a closure (frequently implemented as a pair of data and code pointers). A key element of this closure is the continuation closure, which is implicitly passed in to closures when they get called; the 'RET' type of CPU opcode is the normal way of calling this closure.
If one carefully adds stack splitting and duplication capabilities to an ordinary procedural language, it amounts to a conversion to continuation-passing style. Not only does that cut down on some of the drawbacks of massive multithreading on 32-bit (address space exhaustion, though not the expense of context switches), but it also permits interesting implementation strategies (e.g. making call/cc work in C or languages implemented in terms of C, which can in turn be applied to implementation of stateless servers).
Stack splitting is also central to efficiently implementing parallel logic programming languages. This was the missing insight required for efficient or-parallelism that the 5th generation computing project was hoping for.
Curious idea, clearly borrowing from managed runtimes with garbage collected stack frames.
Random idea (feasible only where address space is plentiful): why not let the operating system's virtual memory subsystem worry about allocating stack space (reserve a large stack with mmap(), let the page faults be responsible for actually allocating the memory and periodically munmap() and re-mmap() the upper part of the stack.
This is exactly how Windows allocates the stack. The address space is reserved, but only the first page is comitted; the next page after that is called the guard page. Touching the guard page causes an exception, and the first-chance handler for that exception commits the page, sets up a new guard page at the next position, and resumes execution.
Compilers targeting Windows need to generate stack touching prologues for procedures that allocate more than 4K of local variable data, or else they risk code first touching the page beyond the guard page, which will cause a much more unpleasant access violation.
Ah, I vaguely remember encountering that years ago now that you mention it. I assume it doesn't actually ever purge pages that go unused for a while unless the thread exits?
Hm, isn't that "random idea" the conventional way of assigning thread stacks? Just carve out a piece of virtual address space, and put the stack pointer there. OS maps in memory when the application actually touches the pages. This works great for 64 bit systems.
(the difference with your idea is that the memory is never unmapped / returned to the system unless the thread dies)
I realise all user space memory is at some point allocated using a virtual memory mapping. I'm not so sure that explicitly reserving pages which aren't committed is the conventional way to allocate stacks for new threads. In any case, if you don't explicitly purge the pages, a thread will end up using a lot more stack space than necessary if it required a large stack early on in its lifetime.
I've quickly skimmed through it, but when reading it thought about he's gonna handle alloca() (and later I've read dynamic arrays).
So the solution is to dynamically allocate them, and at the end of stack routine free them. That's all good, but what if you longjmp() and have alloca() function still allocated. That would leak... Unless unwinding is done, and RAII kind of type for alloca is made. (Or maybe wrap alloca in C++ RAII, or maybe that's what the compiler would be doing).
Here's a related paper from 2003 called 'Capriccio: Scalable Threads for Internet Services'. It uses call graph analysis to figure out where to place checkpoints that allocate and deallocate stack frame chunks.
The only scalable thing for "Internet Services" I know is non-blocking I/O, and just one thread for every CPU physical thread (i.e. avoiding thread context switching).
Edit: After RTFM, is the purpose of the paper, simulating threads handling blocking I/O with a epoll and/or select. Great paper.
> It becomes possible to run millions of threads (either full NPTL threads or co-routines) in a 32-bit address space
If the minimal "reserved stack space" per thread is 4K, than a million threads would consume about 4GB of memory. That's not quite feasible in a 32-bit address-space. Would most threads run with even less than 4K of reserved stack space?
On a 64-bit architecture, this is completely pointless as there is no problem with the address space. Memory is also not a problem since all modern operating system allocates stack memory as needed.
So, my question is why spend time on implementing this?
> Memory is also not a problem since all modern operating system allocates stack memory as needed.
A modern OS can only allocate memory as needed with a page-level granularity.
For many millions of threads, this is not good enough and will waste GB's of memory completely unnecessarily. Maybe in a decade wasting GB's of memory will be negligible (but then we might want trillions of threads!). And pages seem to be growing too, so the OS granularity is too coarse for micro-threads.
If one carefully adds stack splitting and duplication capabilities to an ordinary procedural language, it amounts to a conversion to continuation-passing style. Not only does that cut down on some of the drawbacks of massive multithreading on 32-bit (address space exhaustion, though not the expense of context switches), but it also permits interesting implementation strategies (e.g. making call/cc work in C or languages implemented in terms of C, which can in turn be applied to implementation of stateless servers).