I not only welcome reasonable scepticism, but I do my best to facilitate it. I h...

I not only welcome reasonable scepticism, but I do my best to facilitate it. I have accrued sufficient evidence over time of my own fallibility, and idiocy, that I now try to give people the opportunity to spot mistakes so that I might correct them. As a happy bonus, this also gives people a way of verifying whether the work was done in the right spirit or not.

To that end we work in the open, so all the evidence you need to back up your assertions, or assuage your doubts, has been available since the first day we started:

* Here's the experiment, with its 1025 commits going back to 2015 https://github.com/softdevteam/warmup_experiment/ -- note that the benchmarks are slurped in before we'd even got many of the VMs compiling. * You can also see from the first commit that we simply slurped in the CLBG benchmarks wholesale from a previous paper that was done some time before I had any inkling that there might be warmup problems https://github.com/ltratt/vms_experiment/ * Here's the repo for the paper itself, where you can see us getting to grips with what we were seeing over several years https://github.com/softdevteam/warmup_paper * The snapshots of the paper we released are at https://arxiv.org/abs/1602.00602v1 -- the first version ("V1") clearly shows problems but we had no statistical analysis (note that the first version has a different author list than the final version, and the author added later was a stats expert). * The raw data for the releases of the experiment are at https://archive.org/download/softdev_warmup_experiment_artef... so you can run your own statistical analysis on them.

To be clear, our paper is (or, at least, I hope is) clear to scope its assertions. It doesn't say "VMs never warmup" or even "VMs only warmup X% of the time". It says "in this cross-language, cross-VM, benchmark suite of small benchmarks we observed warmup X% of the time, and that might suggest there are broader problems, but we can't say for sure". There are various possible hypotheses which could explain what we saw, including "only microbenchmarks, or this set of microbenchmarks, show this problem". Personally, that doesn't feel like the most likely explanation, but I have been wrong about bigger things before!