Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's not cost-free. It comes at the cost of greatly increased latency. 29.9 seconds per token with Llama 3.1-70B. This is from Table 1 (pg 8) of the paper.


That is s/token and not token/s. The cost is high.

The actual goal of the article is to highlight that we can optimise the overall speed by decreasing link latency. Yeah link latency, because it's not 1 machine but several low devices that are used together to serve the 70B LLM.


Am I just misunderstanding, or is the paper using "latency" when what they really mean is "throughput"?

In other words, if I want 100 tokens of output, do I have to wait 2990 seconds? If so, the terminology seems unnecessarily confusing.


Ah the disk swap method


Is there any predictability/patterns for neuron/layer activation? If so, would it be reasonable to have a second tiny model that specifically tries to predict activation and preemptively swap those into memory?


This isn't how neural networks work.

For vanilla models, you always use all the weights. That isn't true for mixture-of-experts, though, and in that setting, your approach has merit.


Depends on the architecture, but generally you just move through the layers linearly. Simple iteration.

The number of layers, and the amount of time spent in each of them, makes me think any benefit from pre-loading the layer ahead is negligible.

You really need the entire model on device to consider it performant.


It's not disk swap. It's multi-devices LLM.


That looked like an analogy. Back in the days of a mechanical arm moving magnetic fields around in our PCs, you could have the illusion of infinite RAM as long as you're ok with microsecond operations now taking two million times longer. This is akin.


I think the point is that it has the same sort of latency tradeoff that disk swap did: it's awful, but sometimes better than nothing.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: