From the arxiv pdf: "We disabled Turbo Boost and set the processor to run at its...

dalke · on Sept 9, 2019

For what it's worth, I've tested my AVX2 copy of their code on a couple of different machines, and found it faster than my POPCNT-based implementation. (See my comments elsewhere here.)

Here's my laptop numbers for my benchmark:

  2048 Tanimoto (RDKit Morgan radius=2) 1000 queries
  chemfp search using threshold Tanimoto arena, index, single threaded (popcnt_128_128)
    threshold T=0.4 popcnt 19.33 ms/query  (T2: 19.81 ms check: 189655.09 run-time: 39.2 s) (allow specialized)
  chemfp search using threshold Tanimoto arena, index, single threaded (avx2_256)
    threshold T=0.4 avx2 12.63 ms/query  (T2: 14.04 ms check: 189655.09 run-time: 26.7 s) (allow specialized)

That's 19.33 ms/query on 2048-bit (256 byte) bitstrings using POPCNT unrolled to two loops of 128 bytes each, and 12.63 ms/query using AVX2 specialized for 256 bytes.

My testing showed there was no advantage for a fully unrolled POPCNT implementation. One thing to know is that there is only one execution port for POPCNT on my Intel processor. An AMD processor with 4 execution ports (Ryzen, I think?) may be faster. I don't have that processor, and my limited understanding of the gcc-generated assembly suggests it isn't optimized for that case.

robocat · on Sept 9, 2019

> An AMD processor with 4 execution ports (Ryzen, I think?) may be faster.

FYI m0zg agreed: "AMD can retire 4 (!) popcounts per cycle per core, if your code is able to feed it": https://news.ycombinator.com/item?id=20916023

nkurz · on Sept 9, 2019

I'm one of the authors on the paper. This is a good point. I think the answer is "it might". In practice (as 'dalke' points out) the AVX2 approach still ends up faster on common Intel hardware with Turbo enabled, but there might be cases where it doesn't, and the measured degree of difference certainly might change with Turbo on or off.

Complicating things, some but not all AVX2 operations benefit from Turbo frequencies as well. It depends on other things on the exact mix of instructions. The technical term is "license", and regardless of the setting of Turbo you processor chooses a frequency depending on the license that the instruction mix allows. Daniel has a blog post (co-written with Travis) on how this affects AVX512, but it also affects AVX2: https://lemire.me/blog/2018/09/07/avx-512-when-and-how-to-us.... The specifics are not well documented, and Travis probably has the best understanding of the exact behavior of anyone I know.

Anyway, we turn Turbo off and try to run the processor at a constant frequency because it makes some of our other measurements more reliable, but yes, when making claims that a SIMD solution is faster than a scalar solution, we should probably be testing whether the conclusion holds even when we turn Turbo back on. I think it's still best practice to do most measurements with it off, but I'll try to remember to do an extra test with it turned of if comes up in future papers.

robocat · on Sept 9, 2019

Thanks for your answer - great blog link - most relevant:

"Intel made more aggressive use of AVX512 instructions in earlier versions of the icc compiler, but has since removed most use unless the user asks for it with a special command line option.".

For other readers, short summary copied from the link with very light editing:

* There are heavy and light AVX instructions. "Heavy" instructions roughly are those involving floating point operations or integer multiplications operating on 512 bits.

* Intel cores can run in one of three modes: license 0 (L0) [normal], license 1 (L1) is slower, and license 2 (L2) is the slowest. To get into license 2, you need sustained use of heavy 512-bit instructions, where sustained means approximately one such instruction every cycle.

* The processor does not immediately move to a higher license when encountering heavy instructions: it will first execute these instructions with reduced performance (say 4x slower) and only when there are many of them will the processor change its frequency. Otherwise, any other 512-bit instructions will move the core to L1.

* Downclocking is per core and for a short time after you have used particular instructions (e.g., ~2ms).

* The downclocking of a core is based on: the current license level of that core, and also the total number of active cores on the same CPU socket (irrespective of the license level of the other cores).

The constraints and suggested workarounds are complex. Also the word choice "license" by Intel is bizarre (since it implies the throttling is to reduce performance for business reasons rather than technical reasons?).