Couple years ago, I wanted about the same thing in HLSL language, for a Direct3D...

lostmsu · on Nov 12, 2024

What's the perf like?

Const-me · on Nov 12, 2024

Sorry, I have not benchmarked against cuBLAS or Eigen or similar, I did that thing for ML inference.

I have implemented a profiler on top of D3D11_QUERY_TIMESTAMP and D3D11_QUERY_TIMESTAMP_DISJOINT queries, and tweaked the compute shader to minimize the time reported by these queries for my specific use case.