Doesn't the Unix Shell implementation break the "threading" constraint, in that each program in the pipe will have a main thread each, and all the programs will run concurrently?
Or should the "threading" constraint really have been stated as "you can use multiple threads, but only if they all have their own address space"?
Alternatively, given the multi-core capabilities of even budget/mobile/small systems these days (even the Rasberry Pi 4 has a quad-core CPU), isn't restricting the implementation to a single thread a weird artificial limitation in 2022? Also, a GPU-based implementation would be interesting, and GPUs most of their power from their massively parallel core architecture.
If I were doing the benchmark, I would just run it on a VM with a single physical core. There are good reasons for comparing single threaded performance since it may not be necessary or desirable to scale that way. For example, suppose it's a wc API or something; limiting each request to a single thread can be better since you get fairer scheduling and less data sharing which is almost always better for multiprocessing. Even shared immutable data can cause problems since the CPU may not realize the data is actually immutable.
I agree GPUs would probably do well on this problem, though it depends on how expensive the memory copying to/from the GPU ends up being. However, if you are going for all our performance, it seems like you could use SIMD like memchr or something.
> If I were doing the benchmark, I would just run it on a VM with a single physical core.
I think that's definitely an environment worth benchmarking on - but I don't think that it should be the only environment to benchmark on.
Also, I don't think it's a good reason to limit implementations to a single thread, even if that is your benchmark environment. It can be worth seeing how well an implementation that's capable of taking advantage of multiple cores/CPUs, does when it's only given one core to work with.
It's probably worth optimising for and benchmarking different threading strategies too - does your implementation create one thread per "work unit" and let the scheduler work them all out, or does it create one a one thread per core (or maybe, one thread per core, plus one), thread pool, and assign work units to each each thread until they're done? And how do those strategies work if they're only given a single core?
The single core case is definitely worth testing, but it seems odd to limit implementations to a single thread because of it. If you think you can go faster with a threaded implementation, you should be able to try that out.
> … worth seeing how well an implementation that's capable of taking advantage of multiple cores/CPUs, does when it's only given one core to work with.
Did something like that — programs written for multi-core forced onto one core, alongside programs not written for multi-core.
iirc That difference wasn't something anyone ever expressed interest in.
Or should the "threading" constraint really have been stated as "you can use multiple threads, but only if they all have their own address space"?
Alternatively, given the multi-core capabilities of even budget/mobile/small systems these days (even the Rasberry Pi 4 has a quad-core CPU), isn't restricting the implementation to a single thread a weird artificial limitation in 2022? Also, a GPU-based implementation would be interesting, and GPUs most of their power from their massively parallel core architecture.