In Java specifically, I've gotten good results by following this technique:

- start with an empty routine for the main loop to be optimized

- add the simplest loop you can that iterates over your source data, written such that has a simple obvious bounds check

- test the loop in a microbenchmark harness and look at the generated assembly if necessary to verify that bounds checks get hoisted out, the loop is being unrolled, etc.

- slowly add more code and continuously verify that assumptions are met

- if the loop body gets too complex and the JVM stops optimizing it properly, try and split the loop into multiple seperate (possibly nested) loops that are in their own functions, and repeat the process

- when finished, experiment with inlining loops again

Coming at it from the other angle - a complex loop that you poke until the generated code is decent - is much more frustrating. E.g. manually unrolling loops is fairly pointless - you really really want the JVM to unroll the loops, because it will do so with fewer bounds checks that you can't eliminate yourself. Start with the JVM producing good fast code, and keep it that way as the complexity grows.