Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Couple years ago, I wanted about the same thing in HLSL language, for a Direct3D 11.0 compute shader. Here’s the fastest version I managed to make back then: https://github.com/Const-me/Cgml/blob/master/Mistral/Mistral...

As you see, I have implemented 32×32 tiling, using thread groups of 32×8 threads, two groupshared buffers to load tiles of the input matrices, and I accumulate numbers into local variables, 32 / 8 = 4 accumulators per thread.



What's the perf like?


Sorry, I have not benchmarked against cuBLAS or Eigen or similar, I did that thing for ML inference.

I have implemented a profiler on top of D3D11_QUERY_TIMESTAMP and D3D11_QUERY_TIMESTAMP_DISJOINT queries, and tweaked the compute shader to minimize the time reported by these queries for my specific use case.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: