Python speed up is probably from small integer caching, a sorted array will have...

exyi · 2025-04-05T16:37:47 1743871067

I intentionally stayed in the small integer range to avoid benchmarking the cache. 256 distinct values should fit into L1 just fine in both cases.

I'm now thinking that the difference might be even larger if we instead avoid small integers and let the CPU get stuck chasing pointers. The idea is that it gets stuck on a memory access, which forces it to speculate much further, which in turn makes it backtrack a longer path if a branch was mispredicted. I'm obviously no expert on this, feel free to correct me

The results for 1B range instead of 255 are 17.6 ms for unsorted / 68.2 ms for sorted! We are back to what the original article observed and it's a way stronger effect than what branch prediction can offer. So don't sort your arrays, keep them in the order the boxed values were allocated ;)

cma · 2025-04-05T17:02:53 1743872573

How big is the pointed to small integer? With alignment etc. I'm seeing some stuff saying 256 of them would fill an 8KB L1. Plus other stuff for the interpreter might overfill it. Sorted that would be less of an issue.

Larger range one being slower unsorted yes makes sense because of allocation order no longer matching the iteration order.

exyi · 2025-04-05T18:07:42 1743876462

I don't know how large are those boxes, but normal CPU L1 cache has 32 or 48KB which should be plenty for this. Python opcodes for this program are going to be tiny, and the interpreter itself uses the instruction-L1 cache (which is another 32-48KB). I hope the sequential scan of the big array won't flush the L1 cache (there should be 12-way associativity with LRU, so I don't see how it could).

Anyway, there is no need to have 256 integers, just 2 is enough. When I try that, the results are similar: 17.5 ms (unsorted) / 12.5 ms (sorted)

bgwalter · 2025-04-05T16:46:11 1743871571

That seems very likely. The benchmark should probably use a range that is guaranteed to be outside of the cached smallint range.

exyi · 2025-04-05T18:24:23 1743877463

Then you are back to what the article discusses. Each integer is in a separate box, those boxes are allocated in one order, sorting the array by value will shuffle it by address and it will be much slower. I tested this as well, see the other comment.