What is absolutely crazy is that many “primitive computation” is basically free. Like, there is more than likely some memory stalls either way and they just happen in the pauses not causing longer execution time.
I learned it the hard way when I was writing an emulator for a retro machine. Say you're blitting sprites to the screen. As you're drawing a line, each pixel either has some portion of a sprite, or doesn't. Your instinct is to do just like the hardware and 'chase the beam' and check at each pixel, or even just for each line, whether a sprite is present.
Super wrong. You're far better off precomputing the whole sprite bitmap -- even if you don't end up using it and doing a bulk operation to display it or not display it. Because doing that "if sprite is here?" in a loop is super super expensive, more expensive than just blitting the splits and not using them.
Most efficient ended up being precomputing things into boolean bitset vectors and using SIMD operations to do things based on those.
Even if not doing GPU stuff, the fastest way to compute these days is to think of everything in bulk bulk bulk. The hardware we have now is super efficient at vector and matrix operations, so try to take advantage of it.
(After doing this I have a hunch that a lot of the classic machine emulators <VICE, etc.> that are out there could be made way faster if they were rewritten in this way. )