Python speed up is probably from small integer caching, a sorted array will have runs of pointers to the same integers adjacent. The compiled language one is probably branch prediction right?
I intentionally stayed in the small integer range to avoid benchmarking the cache. 256 distinct values should fit into L1 just fine in both cases.
I'm now thinking that the difference might be even larger if we instead avoid small integers and let the CPU get stuck chasing pointers. The idea is that it gets stuck on a memory access, which forces it to speculate much further, which in turn makes it backtrack a longer path if a branch was mispredicted. I'm obviously no expert on this, feel free to correct me
The results for 1B range instead of 255 are 17.6 ms for unsorted / 68.2 ms for sorted! We are back to what the original article observed and it's a way stronger effect than what branch prediction can offer. So don't sort your arrays, keep them in the order the boxed values were allocated ;)
How big is the pointed to small integer? With alignment etc. I'm seeing some stuff saying 256 of them would fill an 8KB L1. Plus other stuff for the interpreter might overfill it. Sorted that would be less of an issue.
Larger range one being slower unsorted yes makes sense because of allocation order no longer matching the iteration order.
I don't know how large are those boxes, but normal CPU L1 cache has 32 or 48KB which should be plenty for this. Python opcodes for this program are going to be tiny, and the interpreter itself uses the instruction-L1 cache (which is another 32-48KB). I hope the sequential scan of the big array won't flush the L1 cache (there should be 12-way associativity with LRU, so I don't see how it could).
Anyway, there is no need to have 256 integers, just 2 is enough. When I try that, the results are similar: 17.5 ms (unsorted) / 12.5 ms (sorted)
Then you are back to what the article discusses. Each integer is in a separate box, those boxes are allocated in one order, sorting the array by value will shuffle it by address and it will be much slower. I tested this as well, see the other comment.