In languages with GCs or managed in some other way, knowing the upper bound of loop iterations is also important because it allows you to remove the check-if-we-need-to-stop-the-world operation from the loop.
This applies even if the loop isn't unrolled. For example if I see a loop that definitely doesn't iterate more than 2^32 times (because it doesn't overflow) and I can see it only has a few simple instructions in it, then I may decide to omit the check from the loop because I know it can't take that long.
In Java specifically, I've gotten good results by following this technique:
- start with an empty routine for the main loop to be optimized
- add the simplest loop you can that iterates over your source data, written such that has a simple obvious bounds check
- test the loop in a microbenchmark harness and look at the generated assembly if necessary to verify that bounds checks get hoisted out, the loop is being unrolled, etc.
- slowly add more code and continuously verify that assumptions are met
- if the loop body gets too complex and the JVM stops optimizing it properly, try and split the loop into multiple seperate (possibly nested) loops that are in their own functions, and repeat the process
- when finished, experiment with inlining loops again
Coming at it from the other angle - a complex loop that you poke until the generated code is decent - is much more frustrating. E.g. manually unrolling loops is fairly pointless - you really really want the JVM to unroll the loops, because it will do so with fewer bounds checks that you can't eliminate yourself. Start with the JVM producing good fast code, and keep it that way as the complexity grows.
Yes I understand that the java.util.concurrent code includes a lot of loops that look unusual, and in fact they're carefully constructed in such a way as to just avoid the poll (the check-if-we-need-to-stop-the-world part) for inner-loop concurrency operations.
Just checked on my system (fairly recent macbook pro) and a loop that does nothing but increment a memory location and runs 2^32 times takes 8 seconds. Doesn't Java stop the world more often than that?
I think something that accesses main memory goes beyond 'simple' instructions for this definition.
This is an example of a Java compiler phase doing it, although here it looks like it's only doing it for empty loops (or loops that end up empty due to hoisting), so it's not quite what was going on about https://github.com/graalvm/graal-core/blob/0a68438187e7e3196....
This applies even if the loop isn't unrolled. For example if I see a loop that definitely doesn't iterate more than 2^32 times (because it doesn't overflow) and I can see it only has a few simple instructions in it, then I may decide to omit the check from the loop because I know it can't take that long.