I believe the point being made was that this could be done in the CPU faster tha...

Const-me · on Nov 12, 2024

Yeah, but not on a single core.

In my desktop computer, I have Ryzen 7 8700G CPU, which has 8 Zen 4 cores, 4.2 GHz base frequency, 65W TDP. Theoretically, when doing FP32 FMA, each CPU core can do 32 FLOP/cycle. At the base frequency, this translates into 134 GFlops per core. You gonna need all 8 cores to achieve 1 theoretical TFlops.

BTW, integrated GPU inside the same 8700G processor can theoretically do 8.2 TFlops FP32.

menaerus · on Nov 13, 2024

> Theoretically, when doing FP32 FMA, each CPU core can do 32 FLOP/cycle. At the base frequency, this translates into 134 GFlops per core.

Isn't it that zen4 doesn't have "native" support for AVX-512 but "mimics" it through 2x 256-bit FMA units?

Because of this, a single AVX-512 instruction will occupy both FMA units and therefore I think that the theoretical limit for a single zen4 core should be half of the 134 GFLOPS number?

Const-me · on Nov 13, 2024

One FMA counts as two floating-point operations: one multiplication and one addition.

According to uops.info, Zen 4 cores can do two 8-wide FMA instructions per cycle, or one 16-wide FMA per cycle. See VFMADD132PS (YMM, YMM, YMM) and VFMADD132PS (ZMM, ZMM, ZMM) respectively, the throughput column is labelled TP. That’s where 32 FLOP/cycle number comes from.

> doesn't have "native" support for AVX-512 but "mimics" it through 2x 256-bit FMA units

That’s correct, AVX512 doesn’t deliver more FLOPs on that CPU. The throughput of 32-byte FMA and 64-byte FMA is the same, 32 FLOP/cycle for FP32 numbers.

menaerus · on Nov 13, 2024

> One FMA counts as two floating-point operations: one multiplication and one addition.

Right. This is where the discrepancy comes from. I counted FMA as a single FLOP.

Const-me · on Nov 14, 2024

BTW, it’s the same for GPUs. In DXBC shader byte code, mad instruction does FMA. When reporting theoretical FLOPs, GPU vendors count that as 2 float operations.

For example, I have GeForce 4070 Ti Super in my desktop. The chip has 8448 execution units; nVidia calls them CUDA cores but I don’t like the name, the correct number is 66 cores where each core can do 4 wavefronts of 32 threads each. Anyway, these EUs can do one FP32 FMA each cycle, and the boost clock frequency is 2.61 GHz. Multiplying these two numbers results in 22.04928E+12 cycles*EU/second, and nVidia reports 44E+12 FLOPs peak FP32 performance of the GPU.

saagarjha · on Nov 13, 2024

I am told the numbers above require the core to have a matrix multiply unit (such as SME)