It's still "fast enough" for even 120b models in practice, and you don't need to...

angoragoats · on Oct 29, 2024

> It's still "fast enough" for even 120b models in practice

Is it? This is very subjective. The Mac Studio would not be "fast enough" for me on even a 70b model, not necessarily because its output is slow, but because the prompt evaluation speed is quite bad. See [0] for example numbers; on Llama 3 70B at Q4_K_M quantization, it takes an M2 Ultra with 192GB about 8.5 seconds just to evaluate a 1024-token prompt. A machine with 6 3090s (which would likely come in cheaper than the Mac Studio) is over 6 times faster at prompt parsing.

A 120b model is likely going to be something like 1.5-2x slower at prompt evaluation, rendering it pretty much unusable (again, for me).

[0] https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inferen...