Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

> On a 4K context conversation, that means you would be waiting about 3.5min between turns before tokens started outputting.

Wouldn't the time be negligible with interturn kv caching? Many inference providers already do this.



Yes, for single user multiturn kvcache reuse could help a lot. vLLM has support for this via Automatic Prefix Caching (APC) so you’d be able to take advantage of this w/ Strix Halo now. llama.cpp has had a “prompt-cache” option but when I last looked it was a bit weird (only works for non-interactive use, saves and loads cache to disk) so it might not help on the Mac side.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: