WebGPU cannot even come close unfortunately since they don't have support for hardware specific memory or warp-level primitives (like TMA or tensorcores). it's not like it gets 80% of perf, it gets < 30% of the peak perf for anything related to heavy compute matrix multiplications
i tried using workgroup shared memory and found it slower than just recomputing everything in each thread although i may have been doing something dumb