I don't really see how any code that depends heavily on the underlying hardware can "just work" on AMD. Most serious CUDA code is aware of register file and shared memory sizes, wgmma instructions, optimal tensor core memory & register layouts, tensor memory accelerator instructions, etc...
Presumably that stuff doesn't "just work" but they don't want to mention it?
A lot of our hw-aware bits are parameterized where we fill in constants based on the available hw
. Doable to port, same as we do whenever new Nvidia architectures come out.
But yeah, we have tricky bits that inline PTX, and.. that will be more annoying to redo.
Oh that will be interesting to understand, as PTX gets to more about trickier hw-arch-specific phenomena that diff brands disagree on, like memory models. Neat!
Looks like the PTX translation is via another project ZLUDA, though how they bridge the differences in memory/consistency/etc models safely remains unclear to me...
SCALE does not use any part of ZLUDA. We have modified the clang frontend to convert inline PTX asm block to LLVM IR.
To put in a less compiler-engineer-ey way: for any given block of PTX, there exists a hypothetical sequence of C++/CUDA code you could have written to achieve the same effect, but on AMD (perhaps using funky __builtin_... functions if the code includes shuffles/ballots/other-weird-gpu-stuff). Our compiler effectively converts the PTX into that hypothetical C++.
Regarding memory consistency etc.: NVIDIA document the "CUDA memory consistency model" extremely thoroughly, and likewise, the consistency guarantees for PTX. It is therefore sufficient to ensure that we use operations at least as synchronising as those called for in the documented semantics of the language (be it CUDA or PTX, for each operation).
Differing consistency _between architectures_ is the AMDGPU backend's problem.
Again, I'm guessing there might be an equiv simpler program involving AMD's __ballot, but I'm unsure of the true equivalence wrt safety, and it seems like a tricky rewrite as it needs to (afaict) decompile to recover the higher-level abstraction. Normally it's easier to compile down or sideways (translate), and it's not clear to me these primitives are 1:1 for safely doing so.
===
FWIW, this is all pretty cool. We stay away from PTX -- most of our app code is higher-level, whether RAPIDS (GPU dataframes, GPU ML, etc libs), minimal cuda, and minimal opencl, with only small traces of inline ptx. So more realistically, if we had the motivation, we'd likely explore just #ifdef'ing it with something predictable.
It's supposed to be waiting for all threads to vote
I'm not familiar with AMD enough to know if additional synchronization is needed. ChatGPT recommended adding barriers beyond what that gave, but again, I'm not familiar with AMD commands.
Yeah I think, after this snippet was written, cuda added __all_sync as an intrinsic. The divergent code before this was plain-ish cuda, and this snippet ensures they wait on the comparison vote before recurring.
So in the AMD version, the compiler correctly realized the synchronization was on the comparison, so adds the AMD version right before it. That seems like a straightforward transform here.
It'd be interesting to understand the comparison of what Nvidia primitives map vs what doesn't. The above is a fairly simple barrier. We avoided PTX as much as we could and wrote it as simply as we could, I'd expect most of our PTX to port for similar reasons. The story is a bit diff for libraries we call. E.g., cudf probably has little compute-tier ptx directly, but will call nvidia libs, and use weird IO bits like cufile / gpu direct storage.
Do you reverse it back into C++ that does the corresponding FMAs manually instead of using tensor hardware? Or are you able to convert it into a series of __builtin_amdgcn_mfma_CDFmt_MxNxKABFmt instructions that emulate the same behavior?
Rather awkwardly, you've asked about an instruction that isn't currently implemented. :D Support for wmma and friends is in development.
But in general the answer to your question is yes: we use AMD-specific builtins where available/efficient to make things work. Otherwise many things would be unrepresentble, not just slow!
Add one: it's trivial to add a compiler builtin to carry the instruction from the frontend to the backend if an instruction exists and the backend knows about it.
If there's no instruction, either, you can write a C++ function to replicate the behaviour and codegen a call to it. Since the PTX blocks are expanded during initial IR generation, it all inlines nicely by the end. Of course, such software emulation is potentially suboptimal (depends on the situation).
it's a speculation, but I think it's similar with processors = nobody guarantees the code will run the way you set it up. You may want to use some specific register but if the processor will think it has another register that can fulfill the task, it'll use that but tell you that your code is executed as expected. Maybe the internal gpu processor of amd can sufficiently simulate the behavior of nvidia hardware so that higher abstractions will be unaware that something different is happening under the hood
Prettymuch. Compilers can do a lot more than people give them credit for. At least AMD document their hardware so it is actually possible to know low-level details. PTX can obfuscate that surprisingly badly for nvidia targets.
Presumably that stuff doesn't "just work" but they don't want to mention it?