> On a 4K context conversation, that means you would be waiting about 3.5min bet...

lhl · on Sept 22, 2024

Yes, for single user multiturn kvcache reuse could help a lot. vLLM has support for this via Automatic Prefix Caching (APC) so you’d be able to take advantage of this w/ Strix Halo now. llama.cpp has had a “prompt-cache” option but when I last looked it was a bit weird (only works for non-interactive use, saves and loads cache to disk) so it might not help on the Mac side.