It sounded like you claimed that using only one core you already reach 1 TFLOP/s, implying that you could reach more than that by using more cores, which is false.
Now you have clarified that you actually claim that it is good that when using a single core you can reach the maximum throughput of the shared matrix operation accelerator.
This is correct, but there is no essential difference between this and a Zen 5 CPU that reaches this throughput by using only half of the cores, while having the other half of the cores free to do any other tasks.
What’s the power draw of however many zen 5 cores you have to tie up to hit, say, 1.5tflop/s on sgemm?
(Also, that’s a M2 number, since that’s what OP was talking about. Someone will presumably post M4 benchmarks for BLAS sometime soon, if they haven’t already.)
Top of the line AMD zen5 core can sustain ~80GFLOPS@FP64 and ~160GFLOPS@FP32 using AVX-512, 2x FMA units and ~5Ghz of clock frequency.
This is way way lower than what you claim M2 Pro is capable of and since I'm comparing it against the state-of-the-art datacenter CPU I'm curious how did you get to this number?
M2 Pro core runs at much lower frequency, what it seems to be around ~3.4GHz. And I couldn't find any information about SVE vector widths supported nor number of FMAs.