> That can lead you to some pretty counter-intuitive optimizations because it's ...

> That can lead you to some pretty counter-intuitive optimizations because it's often faster to do more compute work if it means you touch less memory in the process.

It is not specific to the GPUs: this kind of optimizations are pretty common on CPU too where latency kills you and 200 cycles spent wasted on doing compute can actually be faster than a single cache miss trying to fetch data. This is pretty common for many SIMD algorithms actually.

Memory is currently lagging behind compute on almost every type of modern hardware, and it will very likely become worst, not better.