This package seems to use llama_cpp for local inference [1] so you can probably use anything supported by that [2]. However, I think it's just passing OCR output for correction - the language model doesn't actually see the original image.
That said, there are some large language models you can run locally which accept image input. Phi-3-Vision [3], LLaVA [4], MiniCPM-V [5], etc.
Although LLaVA specifically it might not be great for OCR; IIRC it scales all input images to 336 x 336 - meaning it'll only spot details that are visible at that scale.
I've had very poor results using LLaVa for OCR. It's slow and usually can't transcribe more than a few words. I think this is because it's just using CLIP to encode the image into a singular embedding vector for the LLM.
The latest architecture is supposed to improve this but there are better architectures if all you want is OCR.