How much performance difference is there between writing a kernel in a high level language/framework like PyTorch (torch.compile) or Triton, and hand optimizing? Are you writing kernels in PTX?
What's your opinion on the future of writing optimized GPU code/kernels - how long before compilers are as good or better than (most) humans writing hand-optimized PTX?
What's your opinion on the future of writing optimized GPU code/kernels - how long before compilers are as good or better than (most) humans writing hand-optimized PTX?