I made Web49 because there are not many good tools for WebAssembly out there. WABT is close, but the interpreter is too slow and the tools megabytes in size each. Wasm3 is a bit faster but only contains an interpreter, nothing else.
Tooling for WebAssembly is held mostly by the browser vendors. It is such a nice format to work with when one removes all the fluff. WebAssembly tooling should not take seconds to do what should take milliseconds, and it should be able to be used as a library, not just a command line program.
I developed a unique way to write interpreters based on threaded code jumps and basic block versioning when I made MiniVM (https://github.com/FastVM/minivm). It was both larger and more dynamic than WebAssembly. Web49 started as a way to compile WebAssembly to MiniVM, but soon pivoted into its own Interpreter and tooling. I could not be happier with it in its current form and am excited to see what else It can do, with more work.
> I developed a unique way to write interpreters based on threaded code jumps and basic block versioning when I made MiniVM (https://github.com/FastVM/minivm). It was both larger and more dynamic than WebAssembly.
As compared to hand-written assembly or the tailcall technique you describe. But (for the benefit of onlookers) a threaded switch, especially using (switch-like) computed gotos, is still more performant than a traditional function dispatch table.
Has there been any movement in GCC wrt the tailcalls feature?
One of the limitations with computed gotos is the inability to derive the address of a label from outside the function. You always end up with some amount of superfluous conditional code for selecting the address inside the function, or indexing through a table. Several years ago when exploring this space I discovered a hack, albeit it only works with GCC (IIRC), at least as of ~10 years ago. GCC supports inline function definitions, inline functions have visibility to goto labels (notwithstanding that you're not supposed to make use of them), and most surprisingly GCC also supports attaching __attribute__((constructor)) to inline function definitions. This means you can export a map of goto labels that can be used to initialize VM data structures, permitting (in theory) more efficient direct threading.
The tailcall technique is a much more sane and profitable approach, of course.
The goto labels can exported much more directly using inline asm. Further, inline asm can now represent control flow, so you can define the labels in inline asm and the computed jump at the end of an opcode. That's pretty robust to compiler transforms. Just looked up an interpreter in that style:
Because the labels are defined in assembly, not in C, accessing them from outside the function is straightforward. I wrote a whole load of these at some point, there's probably a version of those macros somewhere that compiles to jumps through a C computed goto as well.
Mike's primary complaint is bad register allocation. It is very important to keep the most important state consistently in registers. In my experience, compilers still struggle to do good register allocation in big and branchy functions.
Even perfect branch prediction cannot solve the problem of unnecessary spills.
Very true. I imagine that grouping instructions that use the same registers into their own functions would help with that (arithmetic expressions tend to generate sequences like this). Then you loop within this function while the next instruction is in the same group, and only return to the outer global instruction loop otherwise. If you design the bytecode carefully, you can probably do group checks with a simple bitmask.
Nearly. You need register and to also pass them into (potentially no-op) inline asm. `register int v("eax")` iirc, but it's been years since I did this.
The 'register' is indeed largely ignored, but it has the additional somewhat documented meaning of 'when this variable goes into inline asm, it needs to be in that register'. In between asm blocks it can be elsewhere - stack or whatever - but it still gives the regalloc a really clear guide to work from.
It's `register int v asm("eax")`. However they are very easily elided, especially after higher optimization levels; compilers are very open about this [1].
I read a research paper that proved that the branch prediction issues are non-issue with modern predictors (eg. TTAGE). It is of course true that register spills happen, but it's not bad enough to want to write hand-written assembly. Especially when you simulate AOT-compiled code (eg. RISC-V and WASM), you will already be 3-10x faster than Lua already. For my purposes of using this kind of emulator for scripting, it is already fine.
Throw instruction counting into the mix, and you can even be faster than LuaJIT, although I'm not sure how it manages to screw up the counting so badly. I wrote a little bit about it here: https://medium.com/@fwsgonzo/time-to-first-instruction-53a04...
Tooling for WebAssembly is held mostly by the browser vendors. It is such a nice format to work with when one removes all the fluff. WebAssembly tooling should not take seconds to do what should take milliseconds, and it should be able to be used as a library, not just a command line program.
I developed a unique way to write interpreters based on threaded code jumps and basic block versioning when I made MiniVM (https://github.com/FastVM/minivm). It was both larger and more dynamic than WebAssembly. Web49 started as a way to compile WebAssembly to MiniVM, but soon pivoted into its own Interpreter and tooling. I could not be happier with it in its current form and am excited to see what else It can do, with more work.