It's not cost-free. It comes at the cost of greatly increased latency. 29.9 seco...

_ache_ · on Oct 3, 2024

That is s/token and not token/s. The cost is high.

The actual goal of the article is to highlight that we can optimise the overall speed by decreasing link latency. Yeah link latency, because it's not 1 machine but several low devices that are used together to serve the 70B LLM.

teraflop · on Oct 3, 2024

Am I just misunderstanding, or is the paper using "latency" when what they really mean is "throughput"?

In other words, if I want 100 tokens of output, do I have to wait 2990 seconds? If so, the terminology seems unnecessarily confusing.

m3kw9 · on Oct 3, 2024

Ah the disk swap method

thelastparadise · on Oct 3, 2024

Is there any predictability/patterns for neuron/layer activation? If so, would it be reasonable to have a second tiny model that specifically tries to predict activation and preemptively swap those into memory?

miki123211 · on Oct 3, 2024

This isn't how neural networks work.

For vanilla models, you always use all the weights. That isn't true for mixture-of-experts, though, and in that setting, your approach has merit.

tcdent · on Oct 3, 2024

Depends on the architecture, but generally you just move through the layers linearly. Simple iteration.

The number of layers, and the amount of time spent in each of them, makes me think any benefit from pre-loading the layer ahead is negligible.

You really need the entire model on device to consider it performant.

_ache_ · on Oct 3, 2024

It's not disk swap. It's multi-devices LLM.

kridsdale3 · on Oct 3, 2024

That looked like an analogy. Back in the days of a mechanical arm moving magnetic fields around in our PCs, you could have the illusion of infinite RAM as long as you're ok with microsecond operations now taking two million times longer. This is akin.

wpietri · on Oct 3, 2024

I think the point is that it has the same sort of latency tradeoff that disk swap did: it's awful, but sometimes better than nothing.