Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> if you run it at the full 262144 tokens of context youll need ~65gb of ram

What is the relationship between context size and RAM required? Isn't the size of RAM related only to number of parameters and quantization?



The context cache (or KV cache) is where intermediate results are stored. One for each output token. Its size depends on the model architecture and dimensions.

KV cache size = 2 * batch_size * context_len * num_key_value_heads * head_dim * num_layers * element_size. The "2" is for the two parts, key and value. Element size is the precision in bytes. This model uses grouped query attention, which reduces num_key_value_heads compared to a multi head attention (MHA) model.

With batch size 1 (for low-latency single-user inference), 32k context (recommended in the model card), fp16 precision:

2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.

I think, anyway. It's hard to keep up with this stuff. :)


Yes but you can quantise the KV cache too just like you can the weights.


A 24GB GPU can run a ~30b parameter model at 4bit quantization at about 8k-12k context length before every GB of VRAM is occupied.


Not quite true. Depends on number of KV heads. GLM4 32b at IQ4 quant and Q8 context can run full context with only 20GiB VRAM.


No. Your KV cache is kept in memory also.


Whats the space complexity for context size? And who is trying to drop it into linear complexity?


I mean...where do you think context is stored?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: