I compiled your function with SCALE for gfx1030: .p2align 2 ; -- Begin function ...

lmeyerov · on July 16, 2024

It's supposed to be waiting for all threads to vote

I'm not familiar with AMD enough to know if additional synchronization is needed. ChatGPT recommended adding barriers beyond what that gave, but again, I'm not familiar with AMD commands.

ckitching · on July 16, 2024

Indeed, no extra synchronisation is needed here due to the nature of the hardware (threads in a warp can't get out of sync with each other).

Even on NVIDIA, you could've written this without the asm a discussed above!

lmeyerov · on July 16, 2024

Yeah I think, after this snippet was written, cuda added __all_sync as an intrinsic. The divergent code before this was plain-ish cuda, and this snippet ensures they wait on the comparison vote before recurring.

So in the AMD version, the compiler correctly realized the synchronization was on the comparison, so adds the AMD version right before it. That seems like a straightforward transform here.

It'd be interesting to understand the comparison of what Nvidia primitives map vs what doesn't. The above is a fairly simple barrier. We avoided PTX as much as we could and wrote it as simply as we could, I'd expect most of our PTX to port for similar reasons. The story is a bit diff for libraries we call. E.g., cudf probably has little compute-tier ptx directly, but will call nvidia libs, and use weird IO bits like cufile / gpu direct storage.