Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It sounds like you are trying to chat with the base model when you should be using a chat model.


No, I’m using 9b-chat-q8_0 on a 4090


Turns out that Ollama on windows will run multiple models in parallell consuming all available VRAM and RAM. Changing it to 1 fixed the issue, now it's working great! However, the context length for the output is very small - only 1024 tokens.


That's some really strange behavior, I don't know why that would cause poor results rather than just poor performance.

Can you configure the context size with `/set parameter num_ctx N`? On my laptop with an RTX A3000 12GB I can run `yi-coder:9b-chat` (Q4_0) with 32768 context and it produces good results quickly. That uses 11GB of VRAM so it's maxed out for this setup.


Solved, see:

https://github.com/01-ai/Yi-Coder/issues/6#issuecomment-2334...

Works very well now! 65K input tokens with 8192 output tokens is no longer an issue on my 4090. (It maxes out on 22GB/VRAM)


Awesome! Glad to hear you got it sorted out.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: