> if you run it at the full 262144 tokens of context youll need ~65gb of ram Wha...

Gracana · 2025-08-06T20:30:18 1754512218

The context cache (or KV cache) is where intermediate results are stored. One for each output token. Its size depends on the model architecture and dimensions.

KV cache size = 2 * batch_size * context_len * num_key_value_heads * head_dim * num_layers * element_size. The "2" is for the two parts, key and value. Element size is the precision in bytes. This model uses grouped query attention, which reduces num_key_value_heads compared to a multi head attention (MHA) model.

With batch size 1 (for low-latency single-user inference), 32k context (recommended in the model card), fp16 precision:

2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.

I think, anyway. It's hard to keep up with this stuff. :)

wkat4242 · 2025-08-07T03:42:36 1754538156

Yes but you can quantise the KV cache too just like you can the weights.

hnuser123456 · 2025-08-06T20:32:12 1754512332

A 24GB GPU can run a ~30b parameter model at 4bit quantization at about 8k-12k context length before every GB of VRAM is occupied.

iamnotagenius · 2025-08-07T06:50:13 1754549413

Not quite true. Depends on number of KV heads. GLM4 32b at IQ4 quant and Q8 context can run full context with only 20GiB VRAM.

DSingularity · 2025-08-06T19:50:58 1754509858

No. Your KV cache is kept in memory also.

aitchnyu · 2025-08-07T07:23:42 1754551422

Whats the space complexity for context size? And who is trying to drop it into linear complexity?

0x457 · 2025-08-06T22:03:05 1754517785

I mean...where do you think context is stored?