How much performance difference is there between writing a kernel in a high leve...

		HarHarVeryFunny on July 16, 2024 \| parent \| context \| favorite \| on: Run CUDA, unmodified, on AMD GPUs How much performance difference is there between writing a kernel in a high level language/framework like PyTorch (torch.compile) or Triton, and hand optimizing? Are you writing kernels in PTX? What's your opinion on the future of writing optimized GPU code/kernels - how long before compilers are as good or better than (most) humans writing hand-optimized PTX?

The CUDA version of LCZero was around 2x or 3x faster than the Tensorflow(?) version iirc.