For a very deep dive into the subject, this is a great writeup: How to Optimize ...

zanussbaum · on Nov 12, 2024

this was a huge inspiration for the post! i tried to highlight it in the blog but it might have gotten buried

there are a few things that i wasn't able to figure out how to get access to/i wasn't sure if they were possible. for example, a lot of Simon's article takes advantage of the warp scheduler and warp tiling.

i had a hard time finding information on if that's even possible with my M2/metal and the general memory access patterns. it seems like CUDA does have better documentation in this regard

almostgotcaught · on Nov 12, 2024

That's a nice tutorial but just to be clear: that is not a deep dive in any sense. It's just the bog standard tricks. It doesn't cover MMA and WMMA, which today is table stakes for fast matmul. Also doesn't cover software pipelining. It's basically a good summary of the basics.

saagarjha · on Nov 12, 2024

It’s a deep dive as of like 2015 probably. I don’t know if anyone has done something similar for modern GEMMs. Maybe the CUTLASS or Colfax people?