I made Web49 because there are not many good tools for WebAssembly out there. WA...

haberman · on Jan 9, 2023

> I developed a unique way to write interpreters based on threaded code jumps and basic block versioning when I made MiniVM (https://github.com/FastVM/minivm). It was both larger and more dynamic than WebAssembly.

I'd be very interested to read more about this. It looks like you are using "one big function" with computed goto (https://github.com/FastVM/Web49/blob/main/src/interp/interp....). My experience working on this problem led me to the same conclusion as Mike Pall, which is that compilers do not do well with this pattern (particularly when it comes to register allocation): http://lua-users.org/lists/lua-l/2011-02/msg00742.html

I'm curious how you worked around the problem of poor register allocation in the compiler. I've come to the conclusion that tail calls are the best solution to this problem: https://blog.reverberate.org/2021/04/21/musttail-efficient-i...

wahern · on Jan 9, 2023

> that compilers do not do well with this pattern

As compared to hand-written assembly or the tailcall technique you describe. But (for the benefit of onlookers) a threaded switch, especially using (switch-like) computed gotos, is still more performant than a traditional function dispatch table.

Has there been any movement in GCC wrt the tailcalls feature?

One of the limitations with computed gotos is the inability to derive the address of a label from outside the function. You always end up with some amount of superfluous conditional code for selecting the address inside the function, or indexing through a table. Several years ago when exploring this space I discovered a hack, albeit it only works with GCC (IIRC), at least as of ~10 years ago. GCC supports inline function definitions, inline functions have visibility to goto labels (notwithstanding that you're not supposed to make use of them), and most surprisingly GCC also supports attaching __attribute__((constructor)) to inline function definitions. This means you can export a map of goto labels that can be used to initialize VM data structures, permitting (in theory) more efficient direct threading.

The tailcall technique is a much more sane and profitable approach, of course.

JonChesterfield · on Jan 10, 2023

The goto labels can exported much more directly using inline asm. Further, inline asm can now represent control flow, so you can define the labels in inline asm and the computed jump at the end of an opcode. That's pretty robust to compiler transforms. Just looked up an interpreter in that style:

#define LABEL_START(TAG) ns_##TAG : __asm__(".p2align 3\n.Lstart_" #TAG ":" :::)

#define LABEL_END(TAG) __asm__(".Lend_" #TAG ":\n")

#define PROLOGUE(TAG) LABEL_START(TAG); ip++

#define EPILOGUE(TAG) __asm__ goto("\tjmpq %0\n" "\t.Lend_" #TAG ":\n"::"r"((void)decode(ip))::ALL_LABELS())

Followed by opcodes implemented in this fashion:

  {
    PROLOGUE(add);
    {
      apply_opcode_ADD(&s->data_stack);
    }
    EPILOGUE(add);
  }

Because the labels are defined in assembly, not in C, accessing them from outside the function is straightforward. I wrote a whole load of these at some point, there's probably a version of those macros somewhere that compiles to jumps through a C computed goto as well.

naasking · on Jan 9, 2023

> My experience working on this problem led me to the same conclusion as Mike Pall, which is that compilers do not do well with this pattern

Note that that message is from twelve years ago. A lot's changed since then, not just in compilers but in CPUs. Branch prediction is a lot better now.

haberman · on Jan 9, 2023

Mike's primary complaint is bad register allocation. It is very important to keep the most important state consistently in registers. In my experience, compilers still struggle to do good register allocation in big and branchy functions.

Even perfect branch prediction cannot solve the problem of unnecessary spills.

naasking · on Jan 9, 2023

Very true. I imagine that grouping instructions that use the same registers into their own functions would help with that (arithmetic expressions tend to generate sequences like this). Then you loop within this function while the next instruction is in the same group, and only return to the outer global instruction loop otherwise. If you design the bytecode carefully, you can probably do group checks with a simple bitmask.

10000truths · on Jan 9, 2023

Does providing a hint to the compiler using the register keyword address the issue sufficiently?

haberman · on Jan 9, 2023

No, most compilers ignore the register keyword, see: https://stackoverflow.com/a/10675111

JonChesterfield · on Jan 10, 2023

Nearly. You need register and to also pass them into (potentially no-op) inline asm. `register int v("eax")` iirc, but it's been years since I did this.

The 'register' is indeed largely ignored, but it has the additional somewhat documented meaning of 'when this variable goes into inline asm, it needs to be in that register'. In between asm blocks it can be elsewhere - stack or whatever - but it still gives the regalloc a really clear guide to work from.

lifthrasiir · on Jan 10, 2023

It's `register int v asm("eax")`. However they are very easily elided, especially after higher optimization levels; compilers are very open about this [1].

[1] https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables....

fwsgonzo · on Jan 9, 2023

I read a research paper that proved that the branch prediction issues are non-issue with modern predictors (eg. TTAGE). It is of course true that register spills happen, but it's not bad enough to want to write hand-written assembly. Especially when you simulate AOT-compiled code (eg. RISC-V and WASM), you will already be 3-10x faster than Lua already. For my purposes of using this kind of emulator for scripting, it is already fine.

Throw instruction counting into the mix, and you can even be faster than LuaJIT, although I'm not sure how it manages to screw up the counting so badly. I wrote a little bit about it here: https://medium.com/@fwsgonzo/time-to-first-instruction-53a04...