In my desktop computer, I have Ryzen 7 8700G CPU, which has 8 Zen 4 cores, 4.2 GHz base frequency, 65W TDP. Theoretically, when doing FP32 FMA, each CPU core can do 32 FLOP/cycle. At the base frequency, this translates into 134 GFlops per core. You gonna need all 8 cores to achieve 1 theoretical TFlops.
BTW, integrated GPU inside the same 8700G processor can theoretically do 8.2 TFlops FP32.
> Theoretically, when doing FP32 FMA, each CPU core can do 32 FLOP/cycle. At the base frequency, this translates into 134 GFlops per core.
Isn't it that zen4 doesn't have "native" support for AVX-512 but "mimics" it through 2x 256-bit FMA units?
Because of this, a single AVX-512 instruction will occupy both FMA units and therefore I think that the theoretical limit for a single zen4 core should be half of the 134 GFLOPS number?
One FMA counts as two floating-point operations: one multiplication and one addition.
According to uops.info, Zen 4 cores can do two 8-wide FMA instructions per cycle, or one 16-wide FMA per cycle. See VFMADD132PS (YMM, YMM, YMM) and VFMADD132PS (ZMM, ZMM, ZMM) respectively, the throughput column is labelled TP. That’s where 32 FLOP/cycle number comes from.
> doesn't have "native" support for AVX-512 but "mimics" it through 2x 256-bit FMA units
That’s correct, AVX512 doesn’t deliver more FLOPs on that CPU. The throughput of 32-byte FMA and 64-byte FMA is the same, 32 FLOP/cycle for FP32 numbers.
BTW, it’s the same for GPUs. In DXBC shader byte code, mad instruction does FMA. When reporting theoretical FLOPs, GPU vendors count that as 2 float operations.
For example, I have GeForce 4070 Ti Super in my desktop. The chip has 8448 execution units; nVidia calls them CUDA cores but I don’t like the name, the correct number is 66 cores where each core can do 4 wavefronts of 32 threads each.
Anyway, these EUs can do one FP32 FMA each cycle, and the boost clock frequency is 2.61 GHz.
Multiplying these two numbers results in 22.04928E+12 cycles*EU/second, and nVidia reports 44E+12 FLOPs peak FP32 performance of the GPU.