Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I compiled your function with SCALE for gfx1030:

        .p2align        2                               ; -- Begin function _Z15ptx_thread_voteff
        .type   _Z15ptx_thread_voteff,@function
  _Z15ptx_thread_voteff:                  ; @_Z15ptx_thread_voteff
  ; %bb.0:                                ; %entry
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        s_waitcnt_vscnt null, 0x0
        v_cmp_ge_f32_e32 vcc_lo, v0, v1
        s_cmp_eq_u32 vcc_lo, -1
        s_cselect_b32 s4, -1, 0
        v_cndmask_b32_e64 v0, 0, 1, s4
        s_setpc_b64 s[30:31]
  .Lfunc_end1:
        .size   _Z15ptx_thread_voteff, .Lfunc_end1-_Z15ptx_thread_voteff
                                        ; -- End function


What were the safety concerns you had? This code seems to be something like `return __all_sync(rSq >= rCritSq) ? 1 : 0`, right?


It's supposed to be waiting for all threads to vote

I'm not familiar with AMD enough to know if additional synchronization is needed. ChatGPT recommended adding barriers beyond what that gave, but again, I'm not familiar with AMD commands.


Indeed, no extra synchronisation is needed here due to the nature of the hardware (threads in a warp can't get out of sync with each other).

Even on NVIDIA, you could've written this without the asm a discussed above!


Yeah I think, after this snippet was written, cuda added __all_sync as an intrinsic. The divergent code before this was plain-ish cuda, and this snippet ensures they wait on the comparison vote before recurring.

So in the AMD version, the compiler correctly realized the synchronization was on the comparison, so adds the AMD version right before it. That seems like a straightforward transform here.

It'd be interesting to understand the comparison of what Nvidia primitives map vs what doesn't. The above is a fairly simple barrier. We avoided PTX as much as we could and wrote it as simply as we could, I'd expect most of our PTX to port for similar reasons. The story is a bit diff for libraries we call. E.g., cudf probably has little compute-tier ptx directly, but will call nvidia libs, and use weird IO bits like cufile / gpu direct storage.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: