Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

WebGPU cannot even come close unfortunately since they don't have support for hardware specific memory or warp-level primitives (like TMA or tensorcores). it's not like it gets 80% of perf, it gets < 30% of the peak perf for anything related to heavy compute matrix multiplications


> don't have support for hardware specific memory

I have no experience with WebGPU but if you mean group shared memory, I think the support is available. See the demo: https://compute.toys/view/25


i tried using workgroup shared memory and found it slower than just recomputing everything in each thread although i may have been doing something dumb

i'm excited to try subgroups though: https://developer.chrome.com/blog/new-in-webgpu-128#experime...


I've heard the WebGPU workgroup wants to close the gap on tensor core support.


you're definitely right, 80% was a bit of an overestimation, especially with respect to CUDA

it would be cool to see if there's some way to get better access to those lower-level primitives but would be surprised

it does seem like subgroup support are a step in the right direction though!




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: