Right. So, like I said, using one CPU core, you can exceed 1 TFLOP/s, leaving al...

adrian_b · on Nov 13, 2024

Your initial claim was ambiguous.

It sounded like you claimed that using only one core you already reach 1 TFLOP/s, implying that you could reach more than that by using more cores, which is false.

Now you have clarified that you actually claim that it is good that when using a single core you can reach the maximum throughput of the shared matrix operation accelerator.

This is correct, but there is no essential difference between this and a Zen 5 CPU that reaches this throughput by using only half of the cores, while having the other half of the cores free to do any other tasks.

stephencanon · on Nov 14, 2024

What’s the power draw of however many zen 5 cores you have to tie up to hit, say, 1.5tflop/s on sgemm?

(Also, that’s a M2 number, since that’s what OP was talking about. Someone will presumably post M4 benchmarks for BLAS sometime soon, if they haven’t already.)

menaerus · on Nov 13, 2024

Top of the line AMD zen5 core can sustain ~80GFLOPS@FP64 and ~160GFLOPS@FP32 using AVX-512, 2x FMA units and ~5Ghz of clock frequency.

This is way way lower than what you claim M2 Pro is capable of and since I'm comparing it against the state-of-the-art datacenter CPU I'm curious how did you get to this number?

M2 Pro core runs at much lower frequency, what it seems to be around ~3.4GHz. And I couldn't find any information about SVE vector widths supported nor number of FMAs.