I made a variant that is (on my Apple m1 machine) 20x faster than the naive C ve...

fuber2018 · on July 7, 2023

I assume the M1's SIMD registers are wider/more numerous than just the couple of size_t registers used for the loading/masking/accumulating inner loop in your run_swtches().

You can speedup the code by unrolling your inner loop a few times (try 4x or 8x) - it does mean that your overflow prevention limit is lowered (to a multiple of the unrolled grouping number) and run a few more times. But the speedup offsets the increased bookkeeping.

A version I played with showed increased speed by saving the in-progress accumulation in an array and then doing the final accumulation after the main loop is done. But that may be due to the CPU arch/compiler I'm using.

fuber2018 · on July 7, 2023

If this code only runs on one compiler version/CPU arch, then ASSUMING the compiler will do the RIGHT THING and auto-vectorize the code is okay.

But if your code will be cross-platform/run on different OSes/CPU arch's, then a SWAR version may be more consistently performant - no need to guess if the compiler's optimization heuristics decided to go with the general purpose CPU registers or faster SIMD registers.

Downside is that the devs are exposed to the gnarly optimized code.

fuber2018 · on July 7, 2023

Almost the same as my SWAR version - which is what you're doing.

But aren't you reading off the end of the buffer in your memcpy(&w...)? Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?

I just passed in the string length since the caller had that info, otherwise you'd scan the whole string again looking for the zero terminator.

orlp · on July 7, 2023

> But aren't you reading off the end of the buffer in your memcpy(&w...)?

If we go by the absolute strictest interpretation of the C standard my above implementation is UB.

But in practice, if p is word-aligned and is at least valid for 1 byte, then you will not pagefault for reading a whole word. In fact, this is how GCC/musl implement strlen itself.

> Say with an empty input string whose start address is aligned to sizeof(size_t) bytes?

Then the start address is valid (it must contain the null byte), and aligned to a word boundary, in which case I assume it is ok to also read a whole word there.

gpderetta · on July 7, 2023

If I read it correctly, your implementation might read beyond the end of the buffer, and if it crosses a page boundary into an unmapped page, it will segfault. That's one of the many evils of null terminated strings.

orlp · on July 7, 2023

If we go by the absolute strictest interpretation of the C standard, yes, the above is UB.

But in practice no one has page boundaries that cross word boundaries, and I align to a word boundary before doing the word-by-word loop.

gpderetta · on July 7, 2023

good point of course!