Hacker Newsnew | past | comments | ask | show | jobs | submit | anthonix1's commentslogin

... which also has a much lower power cap


Not that much lower, 295W vs 355W, and for LLM inference VRAM bandwidth is the main bottleneck. But the price is ridiculous.


Yeah I would suggest taking a look at PyTorch on AMD before saying stuff like "scaled_dot_product_attention is an NVIDIA CUDA kernel exposed as a PyTorch function", because that is demonstrably false.

Also, FWIW, I would suggest getting a small Llama 3.1 model training fast before trying to do a big 405B model -- faster to iterate and almost everything you'll learn on the small models will scale to the 405B.


Thanks for the feedback! I appreciate you pointing that out. My understanding was based on the PyTorch documentation for scaled_dot_product_attention (https://pytorch.org/docs/stable/generated/torch.nn.functiona...). - "The function may call optimized kernels for improved performance when using the CUDA backend. For all other backends, the PyTorch implementation will be used."

And was trying to make a broader point about the lack of transparency (in performance, lower-level impl) in PyTorch when running on NVIDIA vs. non-NVIDIA hardware.


> And was trying to make a broader point about the lack of transparency (in performance, lower-level impl) in PyTorch when running on NVIDIA vs. non-NVIDIA hardware.

I don't quite understand this argument. Lack of transparency from running PyTorch so instead we're gonna leave it all to XLA? How does this solve the "transparency" issue?


Having a common library function that is either lighting fast or dog slow depending on the hardware, is not a great position to be in.

Moreover, this will get worse as more CUDA specific features are added to PyTorch with ad-hoc fallback functions.

I guess OP is saying that XLA is more transparent in this regard, because it wouldn’t use functions like these and the generated comparable code would be on-pare performance wise?


> it wouldn’t use functions like these and the generated comparable code would be on-pare performance wise

Perhaps if XLA generated all functions from scratch, this would be more compelling. But XLA relies very heavily on pattern-matching to common library functions (e.g. CuDNN), and these patterns will certainly work better on Nvidia GPUs than AMD GPUs.

In this way, I actually think explicitly calling the common library functions is actually much more transparent.


[flagged]


are you at all confident that this isn't hallucinated? I'd never trust an answer like this from an LLM


Did you verify everything else it said is true?


Any direct comparisons to 8xH100? 2 toks/sec seems very slow!

I haven't done any LoRA training on MI300x myself, but I have done LLama 3.1 full training on 8xMI300x and got pretty close to 8xH100 performance with my own kernels (ROCm is just too slow).


Oops, my calculation was wrong. Let me add an edit to the blog, thanks for pointing it out!

My train step was taking 30s.

And I was using a batch size of 16 and seq length of 64, making the training speed as (16*64/30) tokens per sec == 35 tokens per second (for fine-tuning in JAX eager mode).

(I haven't done comparison with 8XH100)


That’s approximately 0.8% MFU - h100 would get more like 30% or 40% MFU if well tuned

405e9 parameters

2 flops per matrix multiply per parameter

3 matrix multiplies for (forward, backward param, and backward activation) passes

batch size 16

seq length 64

1.3 petaflops per second per GPU in bfloat16

8 GPUs

30 seconds

So that’s 0.8% = (405e9 * 2 * 3 * 16 * 64 / 30) / (1.3e15 * 8)

Note that I’m ignoring the attention flops in this simplified calculation, but they would be a second order effect at this sequence length

Also note that I’m assuming full weight training, not LoRA . The result would be lower MFU if using LoRA

These MI300X results are promising functionally (it's tough to get any model this big running) but they have a long way to go on perf. It's also single node. The biggest issues I've seen on MI300X are related to scaling to multiple nodes.

EDIT: The blog seems to indicate it is using LoRA. So we should remove the backward param pass from the equation above. Backward param only applies to adaptor weights, which are much more than 10x smaller, so we set it to 0 in the approximation. So we get

0.53% = (405e9 * 2 * 2 * 16 * 64 / 30) / (1.3e15 * 8)


We've significantly optimized multinode on AMD MI300X for both stability and performance at TensorWave. There were certainly a lot of challenges, but we've become experts at multinode on AMD. We'd be happy to show you the latest results! They are quite compelling


Promises... promises... https://x.com/DarrickHorton/status/1828902979875017003

Maybe challenging for Darrick, but in reality, you just need to follow the directions and configure things correctly...

https://techdocs.broadcom.com/us/en/storage-and-ethernet-con...


If I feed your reply to an llm will it be able to tell me what you’re talking about?

/edit I think it did alright…


I tried with Llama 3.1 and it hallucinated MFU. It also completely misinterpreted multi-node scaling.

==========================================================

Let's break down the results described in the post.

Context: The post is discussing the performance of a large language model on a MI300X GPU, which is a high-performance computing (HPC) system. The model has approximately 405 billion parameters and is trained using a batch size of 16 and sequence length of 64.

Key metrics:

MFU (Million Floating-Point Operations per Second): This is a measure of the model's performance, specifically the number of floating-point operations (FLOPs) it can perform per second.

FLOPs: The number of floating-point operations required to perform a matrix multiplication, which is a fundamental operation in deep learning.

GPU performance: The MI300X GPU is capable of 1.3 petaflops (1.3 x 10^15 FLOPs) per second in bfloat16 (a 16-bit floating-point format).

Calculations:

The author provides two calculations to estimate the MFU of the model:

Initial calculation: Assuming full weight training (not LoRA), the author estimates the MFU as:

405 billion parameters

2 FLOPs per matrix multiply per parameter

3 matrix multiplies (forward, backward parameter, and backward activation)

Batch size 16

Sequence length 64

30 seconds to complete the calculation

1.3 petaflops per second per GPU

8 GPUs

The calculation yields an MFU of approximately 0.8%.

Revised calculation: After correcting the assumption to use LoRA (a technique that reduces the number of FLOPs), the author revises the calculation by removing the backward parameter pass, which is only applied to adaptor weights (much smaller than the main weights). This yields an MFU of approximately 0.53%.

Interpretation:

The results indicate that the MI300X GPU is not yet optimized for this large language model, with an MFU of only 0.53% (or 0.8% in the initial calculation). This is a relatively low performance compared to the theoretical maximum of 1.3 petaflops per second per GPU. The author notes that the biggest issues are related to scaling to multiple nodes, suggesting that the performance may improve when running on a larger cluster.

The revised calculation using LoRA reduces the MFU by about 33%, indicating that using this technique can lead to a more efficient use of the GPU resources.

==========================================================

MFU means model flops utilitization. it is a measure of efficiency from 0% to 100% . 100% means that the model is running at maximum possible efficiency, i.e. 1.3 petaflops per gpu.

==========================================================

In that case, the results indicate that the MI300X GPU is running the large language model at a relatively low efficiency, with an MFU of 0.53% (or 0.8% in the initial calculation).

This means that the model is only utilizing a tiny fraction of the GPU's maximum theoretical performance of 1.3 petaflops per second. In other words, the model is not fully utilizing the GPU's capabilities, and there is a significant amount of headroom for optimization.

To put this into perspective, an MFU of 100% would mean that the model is running at the maximum possible efficiency, using 1.3 petaflops per second per GPU. An MFU of 0.53% or 0.8% is extremely low, indicating that the model is running at a tiny fraction of its potential performance.

The author's comment that the MI300X results are "promising functionally" suggests that the model is able to run, but the low MFU indicates that there are significant opportunities for optimization and performance improvement.


MFU being maximal floating point utilization right? I used chatgpt 4o and it otherwise looks very similar to your llama result


> The blog seems to indicate it is using LoRA. So we should remove the backward param pass from the equation above. Backward param only applies to adaptor weights

Backward pass still runs on the non adapter weights. But yeah 10 TFlops/GPU specially on tiny sequence size is very bad compared to what you can get on Nvidia. And I believe the difference would be even higher with large sequence length.


backward activations does but typically not backwards weight gradients.

Why compute gradients with regards to weights that aren't going to be updated?


Does JAX have its own implementations of matmul, flash attention etc? Or does it use the ROCm implementations like PyTorch does? (e.g,. hipblaslt, Composable Kernel FA etc)

Not too familiar with JAX, but the abysmal PyTorch training perf on MI300x is in large part attributable to the slow perf of the ROCm libraries it is using under the hood.


JAX has a sub-system called Pallas[1] with a Triton-like programming model and an example implementation of Flash Attention [2]. It is quite fast. On TPUs I've heard that the XLA compiler already emits a flash-attention-like computation graph for a regular JAX implementation of attention so there's no need to have some specialized kernel in that case.

1. https://jax.readthedocs.io/en/latest/pallas/index.html

2. https://github.com/jax-ml/jax/blob/main/jax/experimental/pal...


Don't bother with the rectilinear pakeha layouts, do your half adders in curvilinear patterns, Koru style


I am pakeha and tangata triti .... now I have an idea for the next window (I was thinking about a neuron ...)


AHh gotcha.

Well yeah I reckon you render a full custom 4004 w/ koru patterned transistors into about 4m x 4m stained glass panel. Would look good as the foyer panel for the CS dept at the University of Waikato


Probably best to tape it out to https://tinytapeout.com/ first


Do they support curvilinear cells?


The simple openlane flow just uses standard cells, but you have the ability to just tape out polygons so curvyish is possible


(I already have a 4-bit CPU there)


I ported Karparthy's llm.c repo to AMD devices [1], and have trained GPT2 from scratch with 10B tokens of fineweb-edu on a 4x 7900XTX machine in just a few hours (about $2 worth of electricity) [2].

I've also trained the larger GPT2-XL model from scratch on bigger CDNA machines.

Works fine.

[1] https://github.com/anthonix/llm.c [2] https://x.com/zealandic1


I just tried it with llm.c ... seems to be missing quite a few key components such as cublaslt, bfloat16 support, nvtx3, compiler flags such as -t

And its linked against an old release of ROCm.

So unclear to me how it is supposed to be an improvement over something like hipify


Greetings, I work on SCALE.

It appears we implemented `--threads` but not `-t` for the compiler flag. Oeps. In either case, the flag has no effect at present, since fatbinary support is still in development, and that's the only part of the process that could conceivably be parallelised.

That said: clang (and hence the SCALE compiler) tends to compile CUDA much faster than nvcc does, so this lack of the parallelism feature is less problematic than it might at first seem.

NVTX support (if you want more than just "no-ops to make the code compile") requires cooperation with the authors of profilers etc., which has not so far been available

bfloat16 is not properly supported by AMD anyway: the hardware doesn't do it, and HIP's implementatoin just lies and does the math in `float`. For that reason we haven't prioritised putting together the API.

cublasLt is a fair cop. We've got a ticket :D.


Hi, why do you believe that bfloat16 is not supported? Can you please provide some references (specifically the part about the hardware "doesn't do it")?

For the hardware you are focussing on (gfx11), the reference manual [2] and the list of LLVM gfx11 instructions supported [1] describe the bfloat16 vdot & WMMA operations, and these are in fact implemented and working in various software such as composable kernels and rocBLAS, which I have used (and can guarantee they are not simply being run as float). I've also used these in the AMD fork of llm.c [3]

Outside of gfx11, I have also used bfloat16 in CDNA2 & 3 devices, and they are working and being supported.

Regarding cublasLt, what is your plan for support there? Pass everything through to hipblasLt (hipify style) or something else?

Cheers, -A

[1] https://llvm.org/docs/AMDGPU/AMDGPUAsmGFX11.html [2] https://www.amd.com/content/dam/amd/en/documents/radeon-tech... [3] http://github.com/anthonix/llm.c


> Hi, why do you believe that bfloat16 is not supported?

Apologies, I appear to be talking nonsense. I conflated bfloat16 with nvidia's other wacky floating point formats. This is probably my cue to stop answering reddit/HN comments and go to bed. :D

So: ahem: bfloat16 support is basically just missing the fairly boring header.

> Regarding cublasLt, what is your plan for support there? Pass everything through to hipblasLt (hipify style) or something else?

Prettymuch that, yes. Not much point reimplementing all the math libraries when AMD is doing that part of the legwork already.


OK, so in the case of llm.c, if you're just including the HIP headers, using hipblasLt, etc, what would be the benefit of using scale instead of hipify?


Seems like a big benefit would come from not forking the codebase into two versions!


I have not been impressed by the perf. Slower than PyTorch for LLMs, and PyTorch is actually stable on AMD (I've trained 7B/13B models).. so the stability issues seem to be more of a tinygrad problem and less of an AMD problem, despite George's ramblings [0][1]

[0] https://github.com/tinygrad/tinygrad/issues/4301 [1] https://x.com/realAnthonix/status/1800993761696284676


Yeah, I just reproduced the GPT2 from scratch results in 8.75 hours on 4x 7900 XTX. The fork is here: https://github.com/anthonix/llm.c


Maybe get a 7900 XTX. 122 TFLOPS of BF16/FP16 for less than $1k and I'm getting 55.4% MFU


These are not apples to apple comparison, as this is running across GPU and much bigger model


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: