I don't really see how any code that depends heavily on the underlying hardware ...

lmeyerov · on July 15, 2024

Sort of

A lot of our hw-aware bits are parameterized where we fill in constants based on the available hw . Doable to port, same as we do whenever new Nvidia architectures come out.

But yeah, we have tricky bits that inline PTX, and.. that will be more annoying to redo.

Retr0id · on July 15, 2024

> SCALE accepts CUDA programs as-is. [...] This is true even if your program uses inline PTX asm

lmeyerov · on July 15, 2024

Oh that will be interesting to understand, as PTX gets to more about trickier hw-arch-specific phenomena that diff brands disagree on, like memory models. Neat!

lmeyerov · on July 15, 2024

Looks like the PTX translation is via another project ZLUDA, though how they bridge the differences in memory/consistency/etc models safely remains unclear to me...

ckitching · on July 15, 2024

Hi! Spectral engineer here!

SCALE does not use any part of ZLUDA. We have modified the clang frontend to convert inline PTX asm block to LLVM IR.

To put in a less compiler-engineer-ey way: for any given block of PTX, there exists a hypothetical sequence of C++/CUDA code you could have written to achieve the same effect, but on AMD (perhaps using funky __builtin_... functions if the code includes shuffles/ballots/other-weird-gpu-stuff). Our compiler effectively converts the PTX into that hypothetical C++.

Regarding memory consistency etc.: NVIDIA document the "CUDA memory consistency model" extremely thoroughly, and likewise, the consistency guarantees for PTX. It is therefore sufficient to ensure that we use operations at least as synchronising as those called for in the documented semantics of the language (be it CUDA or PTX, for each operation).

Differing consistency _between architectures_ is the AMDGPU backend's problem.

lmeyerov · on July 16, 2024

Ah I was reading the 'deeper dive' section on my phone and missed it was a comparison, not a warning, thank you

I'm curious how something like this example would translate:

===

Mapping lower-level ptx patterns to higher-level AMD constructs like __ballot, and knowing it's safe

```

  #ifdef INLINEPTX
  inline uint ptx_thread_vote(float rSq, float rCritSq) {
      uint result = 0;
      asm("{\n\t"
           ".reg .pred cond, out;\n\t"
           "setp.ge.f32 cond, %1, %2;\n\t"
           "vote.sync.all.pred out, cond, 0xffffffff;\n\t"
           "selp.u32 %0, 1, 0, out;\n\t"
           "}\n\t"
           : "=r"(result)
           : "f"(rSq), "f"(rCritSq));
      return result;
  }
  #endif

```

===

Again, I'm guessing there might be an equiv simpler program involving AMD's __ballot, but I'm unsure of the true equivalence wrt safety, and it seems like a tricky rewrite as it needs to (afaict) decompile to recover the higher-level abstraction. Normally it's easier to compile down or sideways (translate), and it's not clear to me these primitives are 1:1 for safely doing so.

===

FWIW, this is all pretty cool. We stay away from PTX -- most of our app code is higher-level, whether RAPIDS (GPU dataframes, GPU ML, etc libs), minimal cuda, and minimal opencl, with only small traces of inline ptx. So more realistically, if we had the motivation, we'd likely explore just #ifdef'ing it with something predictable.

ckitching · on July 16, 2024

I compiled your function with SCALE for gfx1030:

        .p2align        2                               ; -- Begin function _Z15ptx_thread_voteff
        .type   _Z15ptx_thread_voteff,@function
  _Z15ptx_thread_voteff:                  ; @_Z15ptx_thread_voteff
  ; %bb.0:                                ; %entry
        s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
        s_waitcnt_vscnt null, 0x0
        v_cmp_ge_f32_e32 vcc_lo, v0, v1
        s_cmp_eq_u32 vcc_lo, -1
        s_cselect_b32 s4, -1, 0
        v_cndmask_b32_e64 v0, 0, 1, s4
        s_setpc_b64 s[30:31]
  .Lfunc_end1:
        .size   _Z15ptx_thread_voteff, .Lfunc_end1-_Z15ptx_thread_voteff
                                        ; -- End function

What were the safety concerns you had? This code seems to be something like `return __all_sync(rSq >= rCritSq) ? 1 : 0`, right?

lmeyerov · on July 16, 2024

It's supposed to be waiting for all threads to vote

I'm not familiar with AMD enough to know if additional synchronization is needed. ChatGPT recommended adding barriers beyond what that gave, but again, I'm not familiar with AMD commands.

ckitching · on July 16, 2024

Indeed, no extra synchronisation is needed here due to the nature of the hardware (threads in a warp can't get out of sync with each other).

Even on NVIDIA, you could've written this without the asm a discussed above!

lmeyerov · on July 16, 2024

Yeah I think, after this snippet was written, cuda added __all_sync as an intrinsic. The divergent code before this was plain-ish cuda, and this snippet ensures they wait on the comparison vote before recurring.

So in the AMD version, the compiler correctly realized the synchronization was on the comparison, so adds the AMD version right before it. That seems like a straightforward transform here.

It'd be interesting to understand the comparison of what Nvidia primitives map vs what doesn't. The above is a fairly simple barrier. We avoided PTX as much as we could and wrote it as simply as we could, I'd expect most of our PTX to port for similar reasons. The story is a bit diff for libraries we call. E.g., cudf probably has little compute-tier ptx directly, but will call nvidia libs, and use weird IO bits like cufile / gpu direct storage.

ladberg · on July 15, 2024

Just to check here, if you're given something like the following PTX:

  wgmma.mma_async.sync.aligned.m64n256k16.f32.bf16.bf16

Do you reverse it back into C++ that does the corresponding FMAs manually instead of using tensor hardware? Or are you able to convert it into a series of __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt instructions that emulate the same behavior?

ckitching · on July 16, 2024

Rather awkwardly, you've asked about an instruction that isn't currently implemented. :D Support for wmma and friends is in development.

But in general the answer to your question is yes: we use AMD-specific builtins where available/efficient to make things work. Otherwise many things would be unrepresentble, not just slow!

saagarjha · on July 16, 2024

What do you do when a builtin doesn't exist?

ckitching · on July 16, 2024

Add one: it's trivial to add a compiler builtin to carry the instruction from the frontend to the backend if an instruction exists and the backend knows about it.

If there's no instruction, either, you can write a C++ function to replicate the behaviour and codegen a call to it. Since the PTX blocks are expanded during initial IR generation, it all inlines nicely by the end. Of course, such software emulation is potentially suboptimal (depends on the situation).

Moldoteck · on July 16, 2024

it's a speculation, but I think it's similar with processors = nobody guarantees the code will run the way you set it up. You may want to use some specific register but if the processor will think it has another register that can fulfill the task, it'll use that but tell you that your code is executed as expected. Maybe the internal gpu processor of amd can sufficiently simulate the behavior of nvidia hardware so that higher abstractions will be unaware that something different is happening under the hood

ckitching · on July 17, 2024

Prettymuch. Compilers can do a lot more than people give them credit for. At least AMD document their hardware so it is actually possible to know low-level details. PTX can obfuscate that surprisingly badly for nvidia targets.