Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

For a very deep dive into the subject, this is a great writeup:

How to Optimize a CUDA Matmul Kernel for cuBLAS-like Performance (https://siboehm.com/articles/22/CUDA-MMM)

(It's CUDA-specific, so there may be aspects that can't yet be ported to WGPU)



this was a huge inspiration for the post! i tried to highlight it in the blog but it might have gotten buried

there are a few things that i wasn't able to figure out how to get access to/i wasn't sure if they were possible. for example, a lot of Simon's article takes advantage of the warp scheduler and warp tiling.

i had a hard time finding information on if that's even possible with my M2/metal and the general memory access patterns. it seems like CUDA does have better documentation in this regard


That's a nice tutorial but just to be clear: that is not a deep dive in any sense. It's just the bog standard tricks. It doesn't cover MMA and WMMA, which today is table stakes for fast matmul. Also doesn't cover software pipelining. It's basically a good summary of the basics.


It’s a deep dive as of like 2015 probably. I don’t know if anyone has done something similar for modern GEMMs. Maybe the CUTLASS or Colfax people?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: